xgrammar — context-free grammars at sample-time

Regex catches "this string matches a pattern". It does not catch "this string has balanced brackets" or "this is a valid SQL query." That power requires a CFG and a pushdown automaton. The trick that makes it cheap at sample-time is the same as for FSMs: precompute a token-aligned bitmask per state, and look it up.

Why FSMs aren't enough

A regular language can match \d+\.\d+ or even \{"name":\s*"[^"]*"\}. It cannot match any of these:

Recursive JSON. Nested objects of arbitrary depth: {"a":{"b":{"c":1}}}. A regular language cannot count brace depth.
Balanced parentheses in arithmetic expressions: ((1+2)*(3-(4+5))).
SQL. Balanced quotes, balanced parens, recursive subqueries.
Function-call schemas with arrays of objects with arrays: the schema's nesting can be unbounded in principle.

The classical hierarchy says: regex = regular = FSM; balanced/nested = context-free = pushdown automaton (PDA). To constrain a model to a CFG, we need a PDA at sample-time.

A pushdown automaton, briefly

An FSM has a finite set of states. A PDA has a finite set of states plus a stack. Transitions can push or pop the stack. This is what lets a PDA "remember" how deep into a nested structure it is. A balanced-parens PDA pushes on ( and pops on ); it accepts only when the stack is empty at end-of-input.

{"a":{"b":1}} — stack visualized read: { " a " : { " b " : 1 } } stack: [obj] [obj,key] [obj,val] [obj,val,obj,val] [obj,val,obj]→[obj]→[] at each character read, the PDA either stays in state, pushes (opening token), pops (closing token), or rejects. stack depth here is at most 2; for deeply nested schemas it can grow to ~100. key insight: at any moment, the set of legal next characters is determined entirely by (state, top-of-stack).

The naive PDA approach is too slow

If we follow the FSM playbook directly — at each step, ask "what characters does the PDA accept?" then build a token mask — we are stuck with two costs:

Per-step mask construction. For each of the 128K vocab tokens, walk its string through the PDA from the current configuration. That's potentially 128K × 16 = ~2M character transitions per token. At 100+ steps per output, the mask construction dominates the model forward pass.
State explosion. A PDA has infinite configurations in principle (the stack can grow). You cannot precompute a per-configuration mask.

So you can't cache directly. The fix is to observe what xgrammar observes: the mask depends only on the current grammar production and a small bounded context, not on the whole stack.

The xgrammar idea — adaptive token mask caching

xgrammar (Dong et al., 2024) makes three moves:

Decompose the grammar into rule items (Earley-style). An item is "we're currently inside rule R and have matched its first k symbols, with m still to go." For the toy grammar S → '(' S ')' | ε, the items are S → •(S), S → (•S), S → (S•), S → (S)•, S → •ε — five items, regardless of how deep the actual nesting goes at runtime. The stack is then a list of which item we're inside at each level, not new states.
Split tokens into context-independent and context-dependent. Most tokens in a typical schema (e.g., the letters of "name") are valid or invalid based on the current item alone — they are context-independent, and their legality can be cached per item. A minority of tokens (e.g., a closing bracket that needs to know which kind of bracket is on top of the stack) are context-dependent and require a quick stack walk at runtime.
Cache the context-independent mask per item, lazily. When the engine reaches an item for the first time, build its bitmask once and cache it. Future hits at that item are an O(1) lookup. Context-dependent tokens add a small constant per step.

The runtime cost per step becomes:

Item lookup → cached context-independent mask: O(1).
Context-dependent fixup over the small minority of "depends-on-stack" tokens: O(stack-walk-depth · small_constant).
Apply mask to logits: O(V/64) words, ~30 µs on H100.
Sample, then walk PDA forward by emitted token: O(token length).

This is comparable to FSM masking — and xgrammar additionally batches the "warm-up" of new items so the first time you hit a new schema, you don't pay a single-threaded compile-and-mask serialization.

Token-grammar alignment, the actual headache

The PDA wants to consume characters; the model emits tokens. A single token like "name":\"Alice\" might span ten grammar transitions. xgrammar handles this by computing, for each token, the shortest grammar configuration sequence that accepts its string starting from each possible source item. Tokens that can validly start at item i and end at item j are added to i's mask, with a side note about where they land.

Multi-byte tokens add complexity: the same byte sequence can sometimes be tokenized two ways. xgrammar's compiler walks all consistent tokenizations and accepts a token if any of them keeps the grammar valid.

What "adaptive" buys you

Building masks for all items eagerly is wasted work — most schemas only reach a handful of items per call. xgrammar builds masks lazily as the runtime encounters new items, and amortizes the cost across all future calls that share a schema. After a few warm requests, mask construction is effectively free.

What it can express that FSMs can't

Constraint	FSM (regex)	xgrammar (CFG)
"a string of digits"	✓	✓
"a date YYYY-MM-DD"	✓	✓
flat JSON: `{"k":"v"}`	✓ (with careful regex)	✓
recursive JSON: nested objects/arrays	✗ (unbounded depth)	✓
balanced parens in math	✗	✓
valid SQL SELECT	✗	✓
typed function call (JSON schema with $ref)	✗ (recursion)	✓

Cost in numbers

From xgrammar's reported measurements (and SGLang's integration):

Mask lookup + apply per step: ~10–30 µs on H100. Compare to ~1–2 ms for a decode forward at small batch — masking is in the noise.
Schema compile time: a few hundred microseconds for typical schemas; cached per-schema-hash, so amortized to zero.
Speedup over the "rebuild mask every step" baseline (older outlines path): 5–100× lower mask-construction cost on recursive schemas; equivalent on simple ones (reported in the xgrammar paper).
Combined with fast-forward: Same trick as the FSM lesson — when only one token is legal, skip the forward. CFG schemas have similar deterministic runs (closing brackets after a value, key strings after a comma).

How SGLang exposes CFG decoding

# json_schema is the most common entry point; xgrammar compiles it to a CFG
s += sgl.gen("order",
            json_schema=Order.model_json_schema(),
            max_tokens=512)

# raw EBNF / BNF for non-JSON grammars (SQL, custom DSLs)
s += sgl.gen("query",
            grammar=open("sql_grammar.bnf").read())

# regex still works — internally compiled to a trivial CFG
s += sgl.gen("uuid", regex=r"[0-9a-f]{8}-[0-9a-f]{4}-…")

The SGLang server picks the backend automatically: regex → FSM, json_schema or grammar → xgrammar. You don't choose; you just specify the constraint.

The combined picture

Interactive · expressive power vs sample-time cost

Move along the spectrum from "no constraint" to "regex" to "JSON schema" to "deeply nested schema." Watch wall-clock cost and the probability of a valid output.