Speculative decoding

Draft K cheap tokens, verify them in one parallel target forward, keep the prefix that survives a ratio test. Same output distribution as plain decode, ~2× faster.

The asymmetry that makes it work

Lesson 01, Consequence B: decode is memory-bound. Each step the target model reads its whole KV cache once and does one token's worth of matmul. The compute is tiny relative to the bytes moved.

That has an underappreciated corollary. A K-token parallel forward through the target model takes essentially the same wall-clock time as a 1-token forward:

t_target(K parallel tokens) ≈ t_target(1 token)

Why? Both forwards move the KV cache the same amount of bytes (one pass over the full cache). The compute scales with K but compute was never the bottleneck. So a K-token parallel verify is essentially free.

Speculative decoding exploits exactly this: instead of generating 1 token per target step, generate K cheap guesses with a small model, then verify them all in one expensive-but-not-K×-expensive target step.

The algorithm

From Leviathan et al. 2023 (and concurrently Chen et al. at DeepMind). Given target M_T, draft M_D, speculative length K.

One round:

Draft K tokens sequentially from the small model:
for i = 1..K: q_i = p_D(· | ctx + x_1..i-1); x_i ~ q_i
Target verifies in ONE parallel forward over K+1 positions:
p_Tⁱ(·) = p_T(· | ctx + x_1..i-1) for i = 1..K+1
Accept-reject by the ratio test. For each i = 1..K, sample r ~ U(0,1):
accept x_i iff r < min(1, p_Tⁱ(x_i) / q_i(x_i))
On first rejection at position j, resample from the residual:
p̃(x) = max(0, p_T^j(x) − q_j(x)) / Z, Z = Σ_x max(0, p_T^j(x) − q_j(x))
Append this corrected token; discard the rest of the draft.
If all K accepted, sample one bonus token from p_T^K+1. Round yields K+1 tokens.

The non-obvious bit

Step 4's residual is exactly what makes the whole thing an exact sampler of p_T. Skip the proof below and you'll either disbelieve the unbiasedness or implement the rejection step wrong.

Unbiasedness — why this is exact, not approximate

Claim: per round, the distribution of the first emitted token equals p_T exactly. (The argument iterates: condition on context and apply the same logic to position 2, and so on for accepted positions.)

For any token x, the round emits x at position 1 via one of two disjoint events:

P(x emitted) = P(x proposed & accepted) + P(some y proposed, rejected, x resampled)

The first term is direct from the rule:

P(x proposed & accepted) = q(x) · min(1, p_T(x) / q(x)) = min(q(x), p_T(x))

For the second, total rejection mass is

P(any rejection) = Σ_y q(y) · (1 − min(1, p_T(y)/q(y))) = Σ_y max(0, q(y) − p_T(y)) = Z

(Equivalently, Z = 1 − Σ_y min(q(y), p_T(y)) — the two forms agree because min + max-of-the-difference partitions the probability mass.) Conditional on rejection we resample from p̃(x) = max(0, p_T(x) − q(x)) / Z, so

P(rejection & x resampled) = Z · max(0, p_T(x) − q(x)) / Z = max(0, p_T(x) − q(x))

Adding the two terms:

P(x emitted) = min(q(x), p_T(x)) + max(0, p_T(x) − q(x))

Case split:

case	min term	max term	sum
p_T(x) ≥ q(x)	q(x)	p_T(x) − q(x)	p_T(x)
p_T(x) < q(x)	p_T(x)	0	p_T(x)

Both cases give p_T(x). The marginal of the emitted token is exactly the target distribution. ∎

Quality is preserved

Speculative decoding is not a "draft model approximates the target" heuristic. Sample-by-sample, the output stream is statistically indistinguishable from running M_T alone. The draft only affects how many target steps you needed, never what you got out.

Interactive · how the residual fills the gap, per token

The widget below shows two paths emitting the same target distribution. For each token x:

The p_T bar is split into green (mass arriving via accept — equal to min(p_T, q_D)) and yellow (mass arriving via residual resample — equal to max(0, p_T − q_D)).
The q_D bar is split into green (proposals that will be accepted — same height as the green on p_T) and red (proposals that will be rejected — equal to max(0, q_D − p_T)).
The empirical bar is what the algorithm has actually emitted. Click step repeatedly — green grows from accept events, yellow grows from rejection-then-resample events. Their sum per token tracks p_T(x).

The whole red mass on the q_D side has nowhere to go on the accept path — that's the rejection probability Z. The residual max(0, p_T − q_D) / Z exactly redistributes those rejections so the totals match. Watch the yellow on empirical refill what the red rejected.

Residual resampling · live decomposition (vocab size = 8)

Triple bars per token: p_T (target) │ q_D (draft) │ empirical. Hit step a few times, then run 1000.

—

rejection mass Z

—

samples drawn

accept rate α

—

L1(empirical, p_T)

—

show the per-token identity this widget illustrates

// For any vocab item x:
accept_mass(x)   = min(p_T(x), q_D(x))         // emitted via accept path
residual_mass(x) = max(0, p_T(x) - q_D(x))     // emitted via reject→resample path

// Algebraic identity (proof above):
accept_mass(x) + residual_mass(x) = p_T(x)     for every x

// And by complementary symmetry on q_D:
accept_mass(x) + reject_mass(x)   = q_D(x)     where reject_mass(x) = max(0, q_D(x) - p_T(x))
Z = sum(reject_mass) = sum(residual_mass)      // the algorithm's rejection probability

Speedup math

Let α be the per-position acceptance probability, assumed iid across the K positions in a round (a reasonable working assumption — empirical α drifts only weakly with depth). Then the number of accepted prefix tokens is geometric-truncated:

E[accepted prefix length] = Σ_i=0..K-1 αⁱ = (1 − α^K) / (1 − α)

Every round we additionally emit one extra token — either the residual-resampled correction on first reject, or the bonus sample on a full-accept. So:

E[tokens / round] = (1 − α^K) / (1 − α) + 1

The cost of a round, measured in target-forward equivalents, where c = t_draft / t_target:

cost / round = 1 + K · c (one target verify + K sequential draft steps)

Speedup over plain decode (which emits 1 token per target step):

speedup(α, K, c) = [(1 − α^K) / (1 − α) + 1] / (1 + K · c)

Worked example

α = 0.7, K = 5, c = 0.2 (draft is 5× cheaper than target):

E[tokens/round] = (1 − 0.7^5) / (1 − 0.7) + 1
                = (1 − 0.16807) / 0.3 + 1
                = 2.7731 + 1
                = 3.7731

cost/round      = 1 + 5 · 0.2 = 2.0

speedup         = 3.7731 / 2.0 ≈ 1.89×

That's right in the middle of reported production numbers (1.5-3×). Greedy decoding at temperature 0 gives the highest α and so the best speedup — the draft and target agree more often when both pick argmax.

Variants worth knowing

variant	idea	tradeoff
Medusa (Cai et al. 2024)	Add K small "decoding heads" to the target itself; each predicts (pos+1), (pos+2), …	No separate draft model, no separate KV cache. Trains the heads on top of the frozen target.
EAGLE (Li et al. 2024)	Draft model takes the target's penultimate-layer features as input, not raw token IDs.	Higher α — features carry more predictive info than the discrete sampled token.
Lookahead (Fu et al. 2024)	No draft model. Jacobi fixed-point iteration on the target's own K-step conditional.	Trades draft cost for extra target FLOPs; useful when no good small model exists.
Tree spec (Sequoia, Medusa-tree)	Draft a tree of branches, not a single line. Verify all branches in one target pass; take the longest accepted path.	Higher expected accepted length per round at fixed K; needs a kernel that supports custom tree attention masks.

When speculative decoding does not help

Three regimes where you lose

Compute-saturated. Large-batch serving where the target is already near peak FLOPs — the K-token parallel forward is no longer ~free. The +K·c draft cost is just pure overhead.
Low α with expensive draft. Low acceptance erodes the gain, and only pushes below 1× when α is very small and draft cost is high. Plug in α=0.2, K=4, c=0.5: E[tokens/round] = (1 − 0.2⁴) / 0.8 + 1 ≈ 2.248; cost = 1 + 4·0.5 = 3.0; speedup ≈ 0.75×. At c=0.2 you'd still win modestly even at α=0.4 (≈1.32×) — the K·c overhead only dominates when both knobs go bad.
Tight KV budget. Verifying K+1 positions at once needs that much extra KV scratch. Under preemption pressure (lesson 11) spec decoding can push you into eviction.

Interactive · the speedup landscape

The speedup formula has three knobs. Slide them and watch the surface. The heatmap below plots speedup(α, K) for the c you set. Sweet spots and dead zones are immediately visible.

Things to find:

Set c = 0.2. Trace the speedup ridge as α rises — it bends toward larger K. At α ≥ 0.9, K = 8-10 dominates. At α = 0.5, the ridge stops at K ≈ 3-4: more guesses are just wasted draft work.
Set c = 0.5 (slow draft). Most of the heatmap goes cold; only high-α / low-K regions stay green. This is the "no good small model" regime where Lookahead beats classical spec.
Drop α to 0.3. Almost the whole grid is < 1.0 — you've made the system slower.

Speedup landscape, speedup(α, K) for fixed c

Green = speedup > 1, blue = < 1. White dot = your current (α, K). The colored stripe to the right reads off your exact value.

α (accept rate): 0.70 K (draft len): 5 c = t_draft/t_target: 0.20

E[tokens/round]

—

cost/round

—

speedup

—

optimal K @ this α,c

—

Empirical unbiasedness check

Run 5000 rounds of speculative decoding on a synthetic (target, draft) pair and compare the empirical distribution of the first emitted token against the true p_T. L1 should converge to 0.

L1(empirical, p_T)

—

rounds run

—

mean accepted

—

show the formula being plotted

// expected tokens per round (geometric-truncated + 1 bonus/correction):
const Etok = (a === 1) ? K + 1 : (1 - Math.pow(a, K)) / (1 - a) + 1;
// cost per round in target-step-equivalents:
const cost = 1 + K * c;
// speedup vs plain decode:
const speedup = Etok / cost;