Scheduling tricks — continuous batching, prefix caching, chunked prefill, spec decode
Lesson 20 gave us a paged KV cache. This lesson is everything else the rollout engine does with that storage: how it admits and evicts sequences, how it shares prompts across K-rollout RL, how it interleaves prefill with decode, and how it asks a small model to do part of the work. Four mechanisms; each one independently a 1.5–3× throughput win.
Trick 1 · Continuous batching
Naïve batched generation: pad N sequences to the length of the longest, decode them in lockstep, finish when the longest finishes. The shorter sequences sit emitting padding tokens for hundreds of steps after they've already produced an EOS. The compute is wasted; the KV slots are wasted; the throughput collapses to length-of-longest, not average.
Continuous (rolling) batching: at every step, evict any sequence that just emitted EOS, accept new sequences into the freed slots, and decode the current batch. No padding-to-longest, no idle slots. With PagedAttention as the underlying storage this is essentially free — block reclamation is O(1) — and the engine's batch is always the maximum the cache can hold.
For RL specifically: you submit K rollouts per prompt × P prompts per step. The K rollouts for the same prompt have wildly different lengths (one finishes in 100 tokens, another runs to 2000). Continuous batching lets them execute concurrently and finish independently, and the rollout pool stays full until the last one returns. The published wins are 2–20× over padded static batching, scaled by length variance — the more variable the trajectory lengths, the bigger the gap.
Trick 2 · Prefix caching
RL rollouts almost always share a prompt. K rollouts of the same problem differ only in their sampled completions; they have identical prefixes for the first |x| tokens. Prefix caching detects this — by hashing token prefixes — and reuses the KV blocks computed during the first sequence's prefill for all subsequent ones.
The win is large and RL-specific:
where P is prompt length and G is generation length. For K=16 and a 1k-token math prompt with 200-token completions: 16 prefills (~480 ms total at 30 ms each) collapse to 1 prefill (~30 ms). Lesson 25's prefix-sharing kernel discussion calls this the highest-leverage single optimization for RL post-training; the speedup on K-rollout benchmarks is typically 3–10× on the rollout phase as a whole.
Trick 3 · Chunked prefill
If you submit a 32k-token prompt to a vanilla engine, prefill becomes a 32k-long matmul and blocks the engine from running any decode steps for hundreds of ms. Decoding clients see latency spikes. Worse: in an RL rollout pool, a single long-context tool-use trajectory can stall the other 100 short-context rollouts.
Chunked prefill breaks the prompt into pieces (say, 4k tokens each), interleaves them with decode steps, and amortizes the long prefill across many engine steps. The prompt's tokens are admitted to the KV cache one chunk at a time; decoding sequences continue uninterrupted between chunks.
The win is 10–20% on long-prompt workloads, and crucially it stabilizes tail latency. For RL: when one of your rollouts is a long-context tool-use trace, chunked prefill is what keeps the other 100 short-context rollouts from blocking on it.
Trick 4 · Speculative decoding
Decode is memory-bandwidth-bound (lesson 20): every token requires reading the full weight tensor and KV cache from HBM. The compute per token is trivial. Speculative decoding exploits this asymmetry: have a small draft model generate K tokens speculatively, then have the large target model verify all K in a single batched forward pass.
The verification is mathematically careful: each draft token is accepted only with probability min(1, πtarget(y) / πdraft(y)), ensuring the final samples are exactly from the target distribution. If all K pass, you've gained ~K× decode throughput. If some fail, you've at least gained the position of the first failure.
| Variant | Draft mechanism | Best case |
|---|---|---|
| Vanilla draft model | Separate small model (e.g. 1B) drafts for the 70B target | 1.5–2.5× on greedy decode |
| Medusa | Extra prediction heads on the target itself; no separate model | 1.5–2× with one model in memory |
| EAGLE-2 / EAGLE-3 | Small head conditioned on target's last hidden state — much higher acceptance | 2–3× even at moderate T |
| Lookahead decoding | Draft from the target itself by re-using past hidden states | 1.3–1.8×; no auxiliary model |
old_logp for the rollout doesn't match what the trainer recomputes. Symptom: PPO ratio is unexpectedly skewed away from 1 at the very first iteration. Fix: turn off speculative decoding for RL, or ensure trainer and rollout use the same kernels.
Also: spec decoding helps most at low sampling temperature, where the draft model agrees often with the target. At T=1 (typical RL exploration), draft acceptance drops and the K-token-verify overhead doesn't pay back. Plan for ~1× at T=1, ~2× at T=0.5, ~3× at T=0.
Interactive · stacking the tricks
Below: a steady-state throughput model with each trick toggleable. The bar shows total tokens/sec at fixed concurrency; KPIs name the bottleneck (KV bandwidth / weight bandwidth / FLOPs). Toggle features in any order and watch how the bottleneck shifts.
vLLM vs SGLang · what's different
| vLLM | SGLang | |
|---|---|---|
| Core data structure | PagedAttention block table | RadixAttention prefix tree (on paged storage) |
| Prefix sharing granularity | Block-aligned | Arbitrary subtree |
| Strong point | High throughput on independent requests | Tree-structured prompts; tool-use; complex prefixes |
| Programming model | OpenAI-compatible API | SGLang front-end DSL for structured generation |
| RL integration | verl, OpenRLHF default | SLIME default; first-class for agentic RL |
| Weight-update path | Custom update_weights NCCL hook | Built-in resume_memory_occupation for hot-swap |
For pure verifier-RL on single-turn tasks (math, code completion) either engine works well; vLLM has a slight edge on raw throughput. For multi-turn agentic tasks with branchy tool-use, SGLang's RadixAttention gives a real lift because the tool-call tree has substantial KV-reuse opportunities that vLLM's block-aligned prefix cache catches less aggressively.
Trade-offs you'll actually face
| Knob | Up means… | Down means… |
|---|---|---|
| Batch size (concurrent sequences) | Higher tokens/sec; bigger KV memory; longer time-to-first-token | Lower tokens/sec; smaller memory; faster first-token latency |
| Max generation length | Longer chains possible; KV memory grows linearly | Shorter chains; more concurrency at same memory |
| Chunk size (chunked prefill) | Lower prefill amortization; more decode interleaving | Prefill closer to one big matmul; potential decode stalls |
| Spec decode K | Higher per-stream tokens/sec; more wasted draft compute on miss | Lower per-stream; less waste on miss |
| Sampling temperature | More exploration; lower spec-decode acceptance; longer tails | More exploitation; spec decode shines; shorter tails |