Scheduling tricks — continuous batching, prefix caching, chunked prefill, spec decode

Lesson 20 gave us a paged KV cache. This lesson is everything else the rollout engine does with that storage: how it admits and evicts sequences, how it shares prompts across K-rollout RL, how it interleaves prefill with decode, and how it asks a small model to do part of the work. Four mechanisms; each one independently a 1.5–3× throughput win.

Why this is its own lesson

Paged storage (lesson 20) is the substrate. Continuous batching is the scheduler over it. Prefix caching is the indexing scheme that makes K-rollout RL practical. Chunked prefill is the interleaving rule that keeps prefill from blocking decode. Speculative decoding is the trick that hides decode's memory-bandwidth bottleneck. All four assume PagedAttention exists; none of them work the same way without it.

Trick 1 · Continuous batching

Naïve batched generation: pad N sequences to the length of the longest, decode them in lockstep, finish when the longest finishes. The shorter sequences sit emitting padding tokens for hundreds of steps after they've already produced an EOS. The compute is wasted; the KV slots are wasted; the throughput collapses to length-of-longest, not average.

Continuous (rolling) batching: at every step, evict any sequence that just emitted EOS, accept new sequences into the freed slots, and decode the current batch. No padding-to-longest, no idle slots. With PagedAttention as the underlying storage this is essentially free — block reclamation is O(1) — and the engine's batch is always the maximum the cache can hold.

For RL specifically: you submit K rollouts per prompt × P prompts per step. The K rollouts for the same prompt have wildly different lengths (one finishes in 100 tokens, another runs to 2000). Continuous batching lets them execute concurrently and finish independently, and the rollout pool stays full until the last one returns. The published wins are 2–20× over padded static batching, scaled by length variance — the more variable the trajectory lengths, the bigger the gap.

Trick 2 · Prefix caching

RL rollouts almost always share a prompt. K rollouts of the same problem differ only in their sampled completions; they have identical prefixes for the first |x| tokens. Prefix caching detects this — by hashing token prefixes — and reuses the KV blocks computed during the first sequence's prefill for all subsequent ones.

The win is large and RL-specific:

without sharing: K · P prefill work + K · G decode work with sharing: P prefill work + K · G decode work

where P is prompt length and G is generation length. For K=16 and a 1k-token math prompt with 200-token completions: 16 prefills (~480 ms total at 30 ms each) collapse to 1 prefill (~30 ms). Lesson 25's prefix-sharing kernel discussion calls this the highest-leverage single optimization for RL post-training; the speedup on K-rollout benchmarks is typically 3–10× on the rollout phase as a whole.

vLLM's APC vs SGLang's RadixAttention

vLLM detects prefix sharing block-by-block: two requests share a prefix iff their token sequences are identical at every block boundary. SGLang's RadixAttention (lesson 20) generalizes to arbitrary subtrees of tokens — a multi-turn agent reusing the system prompt, several earlier turns, and a tool-call header all benefit. For single-turn math RL, the two are roughly equivalent. For multi-turn agentic RL, RadixAttention pulls ahead.

Trick 3 · Chunked prefill

If you submit a 32k-token prompt to a vanilla engine, prefill becomes a 32k-long matmul and blocks the engine from running any decode steps for hundreds of ms. Decoding clients see latency spikes. Worse: in an RL rollout pool, a single long-context tool-use trajectory can stall the other 100 short-context rollouts.

Chunked prefill breaks the prompt into pieces (say, 4k tokens each), interleaves them with decode steps, and amortizes the long prefill across many engine steps. The prompt's tokens are admitted to the KV cache one chunk at a time; decoding sequences continue uninterrupted between chunks.

The win is 10–20% on long-prompt workloads, and crucially it stabilizes tail latency. For RL: when one of your rollouts is a long-context tool-use trace, chunked prefill is what keeps the other 100 short-context rollouts from blocking on it.

Trick 4 · Speculative decoding

Decode is memory-bandwidth-bound (lesson 20): every token requires reading the full weight tensor and KV cache from HBM. The compute per token is trivial. Speculative decoding exploits this asymmetry: have a small draft model generate K tokens speculatively, then have the large target model verify all K in a single batched forward pass.

The verification is mathematically careful: each draft token is accepted only with probability min(1, π_target(y) / π_draft(y)), ensuring the final samples are exactly from the target distribution. If all K pass, you've gained ~K× decode throughput. If some fail, you've at least gained the position of the first failure.

Variant	Draft mechanism	Best case
Vanilla draft model	Separate small model (e.g. 1B) drafts for the 70B target	1.5–2.5× on greedy decode
Medusa	Extra prediction heads on the target itself; no separate model	1.5–2× with one model in memory
EAGLE-2 / EAGLE-3	Small head conditioned on target's last hidden state — much higher acceptance	2–3× even at moderate T
Lookahead decoding	Draft from the target itself by re-using past hidden states	1.3–1.8×; no auxiliary model

Speculative decoding + RL caveat

Speculative decoding's correctness depends on faithfully sampling from the target policy. Sample-time numerical drift (e.g., kernel fusion that changes log-prob precision) can quietly make the rollout policy diverge from the trainer's policy — and then your old_logp for the rollout doesn't match what the trainer recomputes. Symptom: PPO ratio is unexpectedly skewed away from 1 at the very first iteration. Fix: turn off speculative decoding for RL, or ensure trainer and rollout use the same kernels.

Also: spec decoding helps most at low sampling temperature, where the draft model agrees often with the target. At T=1 (typical RL exploration), draft acceptance drops and the K-token-verify overhead doesn't pay back. Plan for ~1× at T=1, ~2× at T=0.5, ~3× at T=0.

Interactive · stacking the tricks

Below: a steady-state throughput model with each trick toggleable. The bar shows total tokens/sec at fixed concurrency; KPIs name the bottleneck (KV bandwidth / weight bandwidth / FLOPs). Toggle features in any order and watch how the bottleneck shifts.

Scheduling tricks · throughput simulator

Each trick multiplies an effective throughput term. Continuous batching reduces padding waste. Prefix caching cuts prefill cost (modelled as a 0.4× rollout-time multiplier for K-rollout-shaped traffic). Chunked prefill stabilizes throughput on long prompts. Spec decoding multiplies per-stream tokens/sec by ~2× when conditions are right.

Concurrency B: 32 Avg seq len L: 2048 Sampling T: 0.70

continuous batching prefix caching chunked prefill speculative decoding

Tokens/sec (aggregate)

—

Per-stream tok/s

—

Bottleneck

—

Speedup vs all-off

—

vLLM vs SGLang · what's different

	vLLM	SGLang
Core data structure	PagedAttention block table	RadixAttention prefix tree (on paged storage)
Prefix sharing granularity	Block-aligned	Arbitrary subtree
Strong point	High throughput on independent requests	Tree-structured prompts; tool-use; complex prefixes
Programming model	OpenAI-compatible API	SGLang front-end DSL for structured generation
RL integration	verl, OpenRLHF default	SLIME default; first-class for agentic RL
Weight-update path	Custom `update_weights` NCCL hook	Built-in `resume_memory_occupation` for hot-swap

For pure verifier-RL on single-turn tasks (math, code completion) either engine works well; vLLM has a slight edge on raw throughput. For multi-turn agentic tasks with branchy tool-use, SGLang's RadixAttention gives a real lift because the tool-call tree has substantial KV-reuse opportunities that vLLM's block-aligned prefix cache catches less aggressively.

Trade-offs you'll actually face

Knob	Up means…	Down means…
Batch size (concurrent sequences)	Higher tokens/sec; bigger KV memory; longer time-to-first-token	Lower tokens/sec; smaller memory; faster first-token latency
Max generation length	Longer chains possible; KV memory grows linearly	Shorter chains; more concurrency at same memory
Chunk size (chunked prefill)	Lower prefill amortization; more decode interleaving	Prefill closer to one big matmul; potential decode stalls
Spec decode K	Higher per-stream tokens/sec; more wasted draft compute on miss	Lower per-stream; less waste on miss
Sampling temperature	More exploration; lower spec-decode acceptance; longer tails	More exploitation; spec decode shines; shorter tails

Takeaway

Four orthogonal scheduling tricks on a paged storage substrate. Continuous batching kills padding waste; prefix caching kills prompt re-prefill (single biggest RL-specific win); chunked prefill kills long-prompt latency spikes; speculative decoding kills decode's bandwidth bottleneck when temperature is low enough. Real engines stack all four. The kernel-engineer's first job (lesson 25) is to make sure the scheduler exposes them correctly to the RL framework's K-rollout-shaped traffic.