Speculative decoding — EAGLE trees in SGLang

Decoding is bandwidth-bound at small batch — most of the GPU time is moving weights through HBM, not computing. A small draft model proposes a few tokens, the big model verifies them all in one forward pass. Verifying K tokens costs almost the same as decoding 1, so if even half the draft is accepted you've cut wall-clock roughly in half.

The asymmetry that makes it work

Decode at batch 1 is dominated by HBM-to-SRAM traffic for the model weights. At batch B the same weights flow through HBM once and serve all B queries — so doubling batch from 1 to 2 only marginally increases wall-clock. Time-to-process-K-tokens-at-batch-1 ≈ Time-to-process-1-token-at-batch-K.

If you can predict K future tokens cheaply and then verify them in a single batched-K forward, you skip K−1 separate forwards. Predict cheaply: a smaller model, or a one-layer "head," or n-gram. Verify exactly: the same big model, batched over the K positions.

The verification math (rejection sampling)

Spec decoding is exact — the output distribution is identical to running the big model alone. The trick is rejection sampling: the draft proposes a token x with probability q(x); the big model gives p(x); we accept with probability min(1, p(x)/q(x)). If we reject, we re-sample from a corrected distribution that exactly preserves p. Walk through the rule:

Draft proposes token x_i at position i with probability q(x_i).
The target model's forward pass returns p(x_i) for each draft token at each draft position.
Accept x_i with probability min(1, p(x_i)/q(x_i)).
If we accept i, move to i+1 and repeat.
On first rejection at position i, sample from the distribution (p − q)₊ (clipped, normalized) and stop — the rest of the draft is discarded.

Per draft position the probability of acceptance is α (the "acceptance rate"). For typical Llama-3 + Llama-3 1B draft pairs, α ≈ 0.5–0.7. With K draft positions, the expected number of net new tokens per round is (1 − α^K+1) / (1 − α) — the "+1" comes from the corrected token that's always sampled exactly when a rejection occurs, plus the bonus token if we reach the end of the draft. At α=0.6, K=4: expected ≈ 2.3 tokens per verification. In the large-K limit this approaches the classical geometric 1/(1−α) ≈ 2.5.

Why this is exact, not approximate

The acceptance rule is derived to make the distribution of accepted tokens equal to p. The only way to lose exactness is to use a wrong q at decode time (e.g., apply different temperature to draft vs target). SGLang ensures the draft sees the same sampling settings as the target.

Tree-shaped drafts — EAGLE

The plain spec-decoding above proposes a linear sequence of K tokens. EAGLE (Li et al., 2024) generalizes to a tree: at each position, propose the top-m tokens with high draft confidence, branching where multiple candidates look reasonable. The verification then evaluates the whole tree in one batched forward, and accepts the longest path that the target agrees with.

candidates per verification step → higher α at the same K → faster decode.

What SGLang ships

Plain draft model spec. Any small model can be a draft. Common pairing: Llama-3-70B target with Llama-3-1B draft, or Qwen-72B with Qwen-7B.
EAGLE / EAGLE-2 / EAGLE-3. A separately-trained small autoregressive module (typically one or two transformer layers) that consumes the target's last hidden state and predicts the next hidden feature, then projects to tokens through the target's existing LM head. The module shares the target's embedding and unembedding so only the small transformer layer is new weight. Cheaper than a full draft model because the feature-space prediction is easier than re-running the model. SGLang loads pretrained EAGLE weights for the popular base models.
n-gram drafting. No model at all — just retrieve common continuations from the prompt's own text. Cheap and surprisingly effective for code completion-style outputs.
Tree-aware sampling kernel. SGLang's Triton sampler takes a per-position tree mask and returns the accepted path in one launch.

The throughput equation

With:

K = draft length (linear) or expected path length (tree)
α = per-position acceptance rate
t_big = target forward time at the verification batch shape
t_draft = draft forward time (or 0 for n-gram)

Expected accepted tokens per round: n = (1 − α^K+1) / (1 − α) for linear; higher for tree.

Per-round time: t = K · t_draft + t_big.

Throughput = n / t. Compare to 1 / t_big for the baseline.

α	K	Speedup over baseline
0.3	2	~1.2×
0.5	3	~1.6×
0.7	4	~2.4×
0.85	5 (tree)	~3×+

α is the dominant lever, not K. Increasing K past the point where rejections are common buys you nothing — the extra draft work is wasted. Tree drafting helps when individual positions have multiple plausible tokens (most generative tasks); it hurts on highly deterministic outputs (formatted JSON post-fast-forward) where the linear acceptance is already high.

When spec decoding is not worth it

Failure modes

Large batch. When the batch is already large (B≥32), t_big approaches compute-bound. The verification forward isn't free anymore. Speedup collapses.
Mismatched temperature. Sampling at high temperature lowers α (draft and target disagree more). Below α ≈ 0.4, speedup vanishes.
Heavy constrained decoding. Compressed-FSM fast-forward already removes the deterministic tokens — the remaining tokens are the genuinely-hard ones where draft acceptance is low. Stacking spec on top of fast-forward is often a wash.
Tiny target model. If t_big is already small, draft overhead is a larger fraction of the budget.

What this actually composes with

Spec decoding sits on top of every previous lesson without changing them:

RadixAttention still caches prefixes; the draft model uses its own small KV cache.
Cache-aware scheduling still applies — spec is per-request.
Constrained decoding still applies: the grammar mask is applied to the target's verification logits.
CUDA graphs capture the verification path at a few common (K, batch) shapes.
TP/DP/EP shard the target model; the draft model is typically replicated (it's small).

Interactive · the spec-decoding break-even

Set draft cost, acceptance rate, and K. The widget computes throughput and shows when the technique stops paying.

What lesson 11 builds

You now have every mechanism SGLang ships: program-aware DSL, radix tree cache, cache-aware scheduler, compressed FSM + xgrammar, FlashInfer kernels, CUDA graphs, TP/DP/EP parallelism, speculative decoding. Lesson 11 puts the whole stack against vLLM — where each framework's bets win, where they tie, and how to read benchmark numbers without being misled by them.