Speculative decoding — EAGLE trees in SGLang
Decoding is bandwidth-bound at small batch — most of the GPU time is moving weights through HBM, not computing. A small draft model proposes a few tokens, the big model verifies them all in one forward pass. Verifying K tokens costs almost the same as decoding 1, so if even half the draft is accepted you've cut wall-clock roughly in half.
The asymmetry that makes it work
Decode at batch 1 is dominated by HBM-to-SRAM traffic for the model weights. At batch B the same weights flow through HBM once and serve all B queries — so doubling batch from 1 to 2 only marginally increases wall-clock. Time-to-process-K-tokens-at-batch-1 ≈ Time-to-process-1-token-at-batch-K.
If you can predict K future tokens cheaply and then verify them in a single batched-K forward, you skip K−1 separate forwards. Predict cheaply: a smaller model, or a one-layer "head," or n-gram. Verify exactly: the same big model, batched over the K positions.
The verification math (rejection sampling)
Spec decoding is exact — the output distribution is identical to running the big model alone. The trick is rejection sampling: the draft proposes a token x with probability q(x); the big model gives p(x); we accept with probability min(1, p(x)/q(x)). If we reject, we re-sample from a corrected distribution that exactly preserves p. Walk through the rule:
- Draft proposes token xi at position i with probability q(xi).
- The target model's forward pass returns p(xi) for each draft token at each draft position.
- Accept xi with probability min(1, p(xi)/q(xi)).
- If we accept i, move to i+1 and repeat.
- On first rejection at position i, sample from the distribution (p − q)+ (clipped, normalized) and stop — the rest of the draft is discarded.
Per draft position the probability of acceptance is α (the "acceptance rate"). For typical Llama-3 + Llama-3 1B draft pairs, α ≈ 0.5–0.7. With K draft positions, the expected number of net new tokens per round is (1 − αK+1) / (1 − α) — the "+1" comes from the corrected token that's always sampled exactly when a rejection occurs, plus the bonus token if we reach the end of the draft. At α=0.6, K=4: expected ≈ 2.3 tokens per verification. In the large-K limit this approaches the classical geometric 1/(1−α) ≈ 2.5.
Tree-shaped drafts — EAGLE
The plain spec-decoding above proposes a linear sequence of K tokens. EAGLE (Li et al., 2024) generalizes to a tree: at each position, propose the top-m tokens with high draft confidence, branching where multiple candidates look reasonable. The verification then evaluates the whole tree in one batched forward, and accepts the longest path that the target agrees with.
What SGLang ships
- Plain draft model spec. Any small model can be a draft. Common pairing: Llama-3-70B target with Llama-3-1B draft, or Qwen-72B with Qwen-7B.
- EAGLE / EAGLE-2 / EAGLE-3. A separately-trained small autoregressive module (typically one or two transformer layers) that consumes the target's last hidden state and predicts the next hidden feature, then projects to tokens through the target's existing LM head. The module shares the target's embedding and unembedding so only the small transformer layer is new weight. Cheaper than a full draft model because the feature-space prediction is easier than re-running the model. SGLang loads pretrained EAGLE weights for the popular base models.
- n-gram drafting. No model at all — just retrieve common continuations from the prompt's own text. Cheap and surprisingly effective for code completion-style outputs.
- Tree-aware sampling kernel. SGLang's Triton sampler takes a per-position tree mask and returns the accepted path in one launch.
The throughput equation
With:
- K = draft length (linear) or expected path length (tree)
- α = per-position acceptance rate
- tbig = target forward time at the verification batch shape
- tdraft = draft forward time (or 0 for n-gram)
Expected accepted tokens per round: n = (1 − αK+1) / (1 − α) for linear; higher for tree.
Per-round time: t = K · tdraft + tbig.
Throughput = n / t. Compare to 1 / tbig for the baseline.
| α | K | Speedup over baseline |
|---|---|---|
| 0.3 | 2 | ~1.2× |
| 0.5 | 3 | ~1.6× |
| 0.7 | 4 | ~2.4× |
| 0.85 | 5 (tree) | ~3×+ |
α is the dominant lever, not K. Increasing K past the point where rejections are common buys you nothing — the extra draft work is wasted. Tree drafting helps when individual positions have multiple plausible tokens (most generative tasks); it hurts on highly deterministic outputs (formatted JSON post-fast-forward) where the linear acceptance is already high.
When spec decoding is not worth it
- Large batch. When the batch is already large (B≥32), tbig approaches compute-bound. The verification forward isn't free anymore. Speedup collapses.
- Mismatched temperature. Sampling at high temperature lowers α (draft and target disagree more). Below α ≈ 0.4, speedup vanishes.
- Heavy constrained decoding. Compressed-FSM fast-forward already removes the deterministic tokens — the remaining tokens are the genuinely-hard ones where draft acceptance is low. Stacking spec on top of fast-forward is often a wash.
- Tiny target model. If tbig is already small, draft overhead is a larger fraction of the budget.
What this actually composes with
Spec decoding sits on top of every previous lesson without changing them:
- RadixAttention still caches prefixes; the draft model uses its own small KV cache.
- Cache-aware scheduling still applies — spec is per-request.
- Constrained decoding still applies: the grammar mask is applied to the target's verification logits.
- CUDA graphs capture the verification path at a few common (K, batch) shapes.
- TP/DP/EP shard the target model; the draft model is typically replicated (it's small).
Interactive · the spec-decoding break-even
Set draft cost, acceptance rate, and K. The widget computes throughput and shows when the technique stops paying.
What lesson 11 builds
You now have every mechanism SGLang ships: program-aware DSL, radix tree cache, cache-aware scheduler, compressed FSM + xgrammar, FlashInfer kernels, CUDA graphs, TP/DP/EP parallelism, speculative decoding. Lesson 11 puts the whole stack against vLLM — where each framework's bets win, where they tie, and how to read benchmark numbers without being misled by them.