all_lessons / sglang / 07 · xgrammar lesson 7 / 11

xgrammar — context-free grammars at sample-time

Regex catches "this string matches a pattern". It does not catch "this string has balanced brackets" or "this is a valid SQL query." That power requires a CFG and a pushdown automaton. The trick that makes it cheap at sample-time is the same as for FSMs: precompute a token-aligned bitmask per state, and look it up.

Why FSMs aren't enough

A regular language can match \d+\.\d+ or even \{"name":\s*"[^"]*"\}. It cannot match any of these:

The classical hierarchy says: regex = regular = FSM; balanced/nested = context-free = pushdown automaton (PDA). To constrain a model to a CFG, we need a PDA at sample-time.

A pushdown automaton, briefly

An FSM has a finite set of states. A PDA has a finite set of states plus a stack. Transitions can push or pop the stack. This is what lets a PDA "remember" how deep into a nested structure it is. A balanced-parens PDA pushes on ( and pops on ); it accepts only when the stack is empty at end-of-input.

PDA reading {"a":{"b":1}} — stack visualized read: { " a " : { " b " : 1 } } stack: [obj] [obj,key] [obj,val] [obj,val,obj,val] [obj,val,obj]→[obj]→[] at each character read, the PDA either stays in state, pushes (opening token), pops (closing token), or rejects. stack depth here is at most 2; for deeply nested schemas it can grow to ~100. key insight: at any moment, the set of legal next characters is determined entirely by (state, top-of-stack).

The naive PDA approach is too slow

If we follow the FSM playbook directly — at each step, ask "what characters does the PDA accept?" then build a token mask — we are stuck with two costs:

So you can't cache directly. The fix is to observe what xgrammar observes: the mask depends only on the current grammar production and a small bounded context, not on the whole stack.

The xgrammar idea — adaptive token mask caching

xgrammar (Dong et al., 2024) makes three moves:

  1. Decompose the grammar into rule items (Earley-style). An item is "we're currently inside rule R and have matched its first k symbols, with m still to go." For the toy grammar S → '(' S ')' | ε, the items are S → •(S), S → (•S), S → (S•), S → (S)•, S → •ε — five items, regardless of how deep the actual nesting goes at runtime. The stack is then a list of which item we're inside at each level, not new states.
  2. Split tokens into context-independent and context-dependent. Most tokens in a typical schema (e.g., the letters of "name") are valid or invalid based on the current item alone — they are context-independent, and their legality can be cached per item. A minority of tokens (e.g., a closing bracket that needs to know which kind of bracket is on top of the stack) are context-dependent and require a quick stack walk at runtime.
  3. Cache the context-independent mask per item, lazily. When the engine reaches an item for the first time, build its bitmask once and cache it. Future hits at that item are an O(1) lookup. Context-dependent tokens add a small constant per step.

The runtime cost per step becomes:

This is comparable to FSM masking — and xgrammar additionally batches the "warm-up" of new items so the first time you hit a new schema, you don't pay a single-threaded compile-and-mask serialization.

grammar (CFG) json_schema, BNF, EBNF grammar items ~10²–10³ items per schema per-item char masks bitvector over ASCII token masks bitvector over V compile-time compile-time compile-time cached on first use decode step (1) lookup current item's token mask (2) AND into logits → sample → advance PDA

Token-grammar alignment, the actual headache

The PDA wants to consume characters; the model emits tokens. A single token like "name":\"Alice\" might span ten grammar transitions. xgrammar handles this by computing, for each token, the shortest grammar configuration sequence that accepts its string starting from each possible source item. Tokens that can validly start at item i and end at item j are added to i's mask, with a side note about where they land.

Multi-byte tokens add complexity: the same byte sequence can sometimes be tokenized two ways. xgrammar's compiler walks all consistent tokenizations and accepts a token if any of them keeps the grammar valid.

What "adaptive" buys you
Building masks for all items eagerly is wasted work — most schemas only reach a handful of items per call. xgrammar builds masks lazily as the runtime encounters new items, and amortizes the cost across all future calls that share a schema. After a few warm requests, mask construction is effectively free.

What it can express that FSMs can't

ConstraintFSM (regex)xgrammar (CFG)
"a string of digits"
"a date YYYY-MM-DD"
flat JSON: {"k":"v"}✓ (with careful regex)
recursive JSON: nested objects/arrays✗ (unbounded depth)
balanced parens in math
valid SQL SELECT
typed function call (JSON schema with $ref)✗ (recursion)

Cost in numbers

From xgrammar's reported measurements (and SGLang's integration):

How SGLang exposes CFG decoding

# json_schema is the most common entry point; xgrammar compiles it to a CFG
s += sgl.gen("order",
            json_schema=Order.model_json_schema(),
            max_tokens=512)

# raw EBNF / BNF for non-JSON grammars (SQL, custom DSLs)
s += sgl.gen("query",
            grammar=open("sql_grammar.bnf").read())

# regex still works — internally compiled to a trivial CFG
s += sgl.gen("uuid", regex=r"[0-9a-f]{8}-[0-9a-f]{4}-…")

The SGLang server picks the backend automatically: regex → FSM, json_schema or grammar → xgrammar. You don't choose; you just specify the constraint.

The combined picture

model forward logits[V] mask lookup PDA item → bitmask apply + softmax illegal → −∞ sample · advance PDA walk emitted token if mask has one legal token → fast-forward skip forward pass; append forced token to KV; advance PDA mask lookup & apply: ~30 µs · model forward (small batch): ~2 ms · fast-forward: ~0.2 ms / forced token net effect: structural correctness for ≤1% overhead at worst, and ~2× speedup when the schema is forcing-heavy

Interactive · expressive power vs sample-time cost

Move along the spectrum from "no constraint" to "regex" to "JSON schema" to "deeply nested schema." Watch wall-clock cost and the probability of a valid output.

Constraint level vs cost and validity

No constraint: zero overhead, low validity. CFG with fast-forward: small overhead, perfect validity.