Constrained decoding — regex, FSMs, and the compressed-FSM trick

The model emits a probability over the whole vocabulary; the application wants only valid JSON. Filter the logits with a finite-automaton mask at every step. Then realize most automaton states are deterministic — and fast-forward through them. Roughly 2× speedup on JSON-heavy outputs.

The problem in one image

Without constraints, the model samples one of ~128,000 tokens at each step. About 99.9% of them would produce invalid JSON. The output looks fine 80% of the time and silently breaks 20% of the time — at which point you have to either retry the call (wasted compute, wasted latency) or post-fix the parse (lossy, hacky). The fix is to mask the invalid tokens before sampling so an illegal token is impossible by construction.

From regex to FSM

A regular expression compiles to a (deterministic) finite automaton: a set of states, an alphabet, a transition function. For example, the regex r"\d+(\.\d+)?" compiles to:

At any state s, the set of valid characters is determined. To do this at the token level instead, we precompute, for each state, the bitmask of tokens whose string forms only contain characters that the FSM accepts from this state. This precomputation is what makes per-step masking cheap.

The per-step mask, in detail

# at each decode step, the engine has: current_state, logits[V]
mask = state_to_mask[current_state]    # bitvector of length V, precomputed
logits[~mask] = -inf                   # invalid tokens removed
probs = softmax(logits)                # renormalize over legal tokens
next_token = sample(probs)             # standard temperature / top-p / greedy
# advance FSM state by walking each character of next_token's string form
for ch in vocab[next_token]:
    current_state = transition(current_state, ch)

Per-step cost: one bitmask OR-into-logits (V/64 = 2K 64-bit words for a 128K vocab — kernel-friendly), one FSM walk per emitted token. Negligible compared to the actual transformer forward pass.

What the mask preserves

Masking sets the logits of illegal tokens to −∞. Softmax then renormalizes the legal subset to probabilities that sum to 1. Temperature, top-p, top-k, and greedy all operate on this renormalized distribution. The distribution is the target model's distribution conditioned on legality — sampling itself is undistorted. The model's downstream behavior can still degrade: restricting the output space prevents the model from emitting prose-style scratch work that it would have used to think, which can hurt reasoning quality on hard tasks. The fix is to keep a free-text "reasoning" step before the constrained JSON output, not to weaken the constraint.

The compressed-FSM observation

Look at a real JSON-schema FSM, e.g., one that emits {"name": "<string>", "age": <int>}. Most of its states have exactly one valid transition — at "after {, the only legal next character is ", then the only continuation is n, then a, then m, then e, …". For 8–10 characters in a row, there is no choice. The model's "freedom" only resumes when the string-value field opens.

prepended to the input; the model never samples them.

How fast-forwarding actually works

When the FSM enters a state whose mask has exactly one set bit (one legal token, not one legal character — this is the right granularity), the runtime:

Skips the forward pass for that step entirely.
Appends the forced token to the output and the KV cache.
Advances the FSM.
Loops until the FSM state has more than one legal token.

If k consecutive deterministic tokens get forced, we save k model forwards. On a typical JSON object with 6 string keys, deterministic runs add up to 40–80 tokens out of a 200-token output. That's a 25–40% reduction in decode steps, often visible as a ~1.5–2× wall-clock speedup on JSON-heavy workloads.

Tokenizer alignment

Tokens don't align with characters. The string {"name" might be one token ({"name") in some tokenizers, three tokens ({, ", name") in others. The compressed-FSM runtime has to walk the FSM character-by-character but emit tokens. SGLang handles this with a "longest token that fits the deterministic run" greedy strategy. Edge cases (tokens that cross a state boundary) require careful handling, and getting them right is a substantial fraction of the implementation.

How SGLang exposes this

# client side
s += sgl.gen("score", regex=r"[0-9]+\.[0-9]{2}")        # decimal with 2 fractional digits
s += sgl.gen("answer", regex=r"yes|no|maybe")           # choice
s += sgl.gen("json",   json_schema=Order.model_json_schema())   # full JSON schema

# server side
# 1. compile regex / schema → DFA at request-admission time (cached per shape)
# 2. precompute state → token bitmask
# 3. each decode step:
#    - apply mask, sample (or fast-forward if mask is single-bit)
#    - advance FSM by emitted token
#    - if FSM at accept state and stop condition met → done

The compilation step is the once-per-shape cost; the mask application is the per-step cost. SGLang's outlines integration (older path) and its own outlines-compatible compiler share the same DFA structure.

Why this is more than ergonomics

You might think "I'll just postprocess the output with a JSON parser." That works when generation is cheap and parses succeed. It breaks when:

The output is long. A 4K-token JSON output with one bad bracket costs you the whole generation. With constraints, the bracket couldn't have been emitted.
You're chaining calls. Each downstream call depends on parsed structure; a failed parse cascades.
You're paying for tokens. Tokens spent on retries are wasted.
You want guaranteed schema. "Probably JSON" is not the same as "JSON". Closed-loop systems need the guarantee.

Constrained decoding makes the guarantee an invariant, not an aspiration. The compressed-FSM trick makes the invariant cheap enough that you turn it on by default.

Interactive · how much speedup does fast-forwarding give

Set the JSON schema's "deterministic fraction" (the ratio of structural tokens to free-content tokens). Watch the wall-clock speedup grow.

What lesson 07 builds

FSMs are limited to regular languages. They can't express nested JSON (recursive), balanced brackets, or SQL syntax — those need context-free grammars. Lesson 07 covers xgrammar: a pushdown automaton + per-state bitmask cache, with the same per-step cost as the FSM mask but the expressive power of a CFG.