all_lessons / sglang / 10 · speculative lesson 10 / 11

Speculative decoding — EAGLE trees in SGLang

Decoding is bandwidth-bound at small batch — most of the GPU time is moving weights through HBM, not computing. A small draft model proposes a few tokens, the big model verifies them all in one forward pass. Verifying K tokens costs almost the same as decoding 1, so if even half the draft is accepted you've cut wall-clock roughly in half.

The asymmetry that makes it work

Decode at batch 1 is dominated by HBM-to-SRAM traffic for the model weights. At batch B the same weights flow through HBM once and serve all B queries — so doubling batch from 1 to 2 only marginally increases wall-clock. Time-to-process-K-tokens-at-batch-1 ≈ Time-to-process-1-token-at-batch-K.

If you can predict K future tokens cheaply and then verify them in a single batched-K forward, you skip K−1 separate forwards. Predict cheaply: a smaller model, or a one-layer "head," or n-gram. Verify exactly: the same big model, batched over the K positions.

naive — 4 sequential big-model decodes forward · 1 tok forward · 1 tok forward · 1 tok forward · 1 tok 4 forwards = 4 × t_big. Throughput = 1 token / t_big. speculative — draft 4, verify in 1 batched forward d d d d verify · 4 tok in one forward cost = 4·t_draft + t_big ≈ 1.2·t_big at typical t_draft / t_big ≈ 0.05. Net: ~3 accepted tokens per t_big.

The verification math (rejection sampling)

Spec decoding is exact — the output distribution is identical to running the big model alone. The trick is rejection sampling: the draft proposes a token x with probability q(x); the big model gives p(x); we accept with probability min(1, p(x)/q(x)). If we reject, we re-sample from a corrected distribution that exactly preserves p. Walk through the rule:

  1. Draft proposes token xi at position i with probability q(xi).
  2. The target model's forward pass returns p(xi) for each draft token at each draft position.
  3. Accept xi with probability min(1, p(xi)/q(xi)).
  4. If we accept i, move to i+1 and repeat.
  5. On first rejection at position i, sample from the distribution (p − q)+ (clipped, normalized) and stop — the rest of the draft is discarded.

Per draft position the probability of acceptance is α (the "acceptance rate"). For typical Llama-3 + Llama-3 1B draft pairs, α ≈ 0.5–0.7. With K draft positions, the expected number of net new tokens per round is (1 − αK+1) / (1 − α) — the "+1" comes from the corrected token that's always sampled exactly when a rejection occurs, plus the bonus token if we reach the end of the draft. At α=0.6, K=4: expected ≈ 2.3 tokens per verification. In the large-K limit this approaches the classical geometric 1/(1−α) ≈ 2.5.

Why this is exact, not approximate
The acceptance rule is derived to make the distribution of accepted tokens equal to p. The only way to lose exactness is to use a wrong q at decode time (e.g., apply different temperature to draft vs target). SGLang ensures the draft sees the same sampling settings as the target.

Tree-shaped drafts — EAGLE

The plain spec-decoding above proposes a linear sequence of K tokens. EAGLE (Li et al., 2024) generalizes to a tree: at each position, propose the top-m tokens with high draft confidence, branching where multiple candidates look reasonable. The verification then evaluates the whole tree in one batched forward, and accepts the longest path that the target agrees with.

linear draft (K=4) · a b c d if first reject is at position 2, only 1 accepted token. K=4 is wasted. tree draft (EAGLE, fan-out 2) a A b B B' c c' verification evaluates all 7 nodes in one forward; longest accepted path wins. EAGLE-1 / -2 / -3 vary in how the draft head is trained and how the tree is built; the verification structure is the same. Trees have more candidates per verification step → higher α at the same K → faster decode.

What SGLang ships

The throughput equation

With:

Expected accepted tokens per round: n = (1 − αK+1) / (1 − α) for linear; higher for tree.

Per-round time: t = K · tdraft + tbig.

Throughput = n / t. Compare to 1 / tbig for the baseline.

αKSpeedup over baseline
0.32~1.2×
0.53~1.6×
0.74~2.4×
0.855 (tree)~3×+

α is the dominant lever, not K. Increasing K past the point where rejections are common buys you nothing — the extra draft work is wasted. Tree drafting helps when individual positions have multiple plausible tokens (most generative tasks); it hurts on highly deterministic outputs (formatted JSON post-fast-forward) where the linear acceptance is already high.

When spec decoding is not worth it

Failure modes

What this actually composes with

Spec decoding sits on top of every previous lesson without changing them:

Interactive · the spec-decoding break-even

Set draft cost, acceptance rate, and K. The widget computes throughput and shows when the technique stops paying.

Spec decoding throughput

Acceptance rate dominates. K just needs to be "large enough to amortize" — 3–5 is the usual sweet spot.

What lesson 11 builds

You now have every mechanism SGLang ships: program-aware DSL, radix tree cache, cache-aware scheduler, compressed FSM + xgrammar, FlashInfer kernels, CUDA graphs, TP/DP/EP parallelism, speculative decoding. Lesson 11 puts the whole stack against vLLM — where each framework's bets win, where they tie, and how to read benchmark numbers without being misled by them.