rl_lessons / 21 · scheduling tricks lesson 7 / 9 · part III

Scheduling tricks — continuous batching, prefix caching, chunked prefill, spec decode

Lesson 20 gave us a paged KV cache. This lesson is everything else the rollout engine does with that storage: how it admits and evicts sequences, how it shares prompts across K-rollout RL, how it interleaves prefill with decode, and how it asks a small model to do part of the work. Four mechanisms; each one independently a 1.5–3× throughput win.

Why this is its own lesson
Paged storage (lesson 20) is the substrate. Continuous batching is the scheduler over it. Prefix caching is the indexing scheme that makes K-rollout RL practical. Chunked prefill is the interleaving rule that keeps prefill from blocking decode. Speculative decoding is the trick that hides decode's memory-bandwidth bottleneck. All four assume PagedAttention exists; none of them work the same way without it.

Trick 1 · Continuous batching

Naïve batched generation: pad N sequences to the length of the longest, decode them in lockstep, finish when the longest finishes. The shorter sequences sit emitting padding tokens for hundreds of steps after they've already produced an EOS. The compute is wasted; the KV slots are wasted; the throughput collapses to length-of-longest, not average.

Continuous (rolling) batching: at every step, evict any sequence that just emitted EOS, accept new sequences into the freed slots, and decode the current batch. No padding-to-longest, no idle slots. With PagedAttention as the underlying storage this is essentially free — block reclamation is O(1) — and the engine's batch is always the maximum the cache can hold.

For RL specifically: you submit K rollouts per prompt × P prompts per step. The K rollouts for the same prompt have wildly different lengths (one finishes in 100 tokens, another runs to 2000). Continuous batching lets them execute concurrently and finish independently, and the rollout pool stays full until the last one returns. The published wins are 2–20× over padded static batching, scaled by length variance — the more variable the trajectory lengths, the bigger the gap.

Trick 2 · Prefix caching

RL rollouts almost always share a prompt. K rollouts of the same problem differ only in their sampled completions; they have identical prefixes for the first |x| tokens. Prefix caching detects this — by hashing token prefixes — and reuses the KV blocks computed during the first sequence's prefill for all subsequent ones.

The win is large and RL-specific:

without sharing: K · P prefill work + K · G decode work with sharing: P prefill work + K · G decode work

where P is prompt length and G is generation length. For K=16 and a 1k-token math prompt with 200-token completions: 16 prefills (~480 ms total at 30 ms each) collapse to 1 prefill (~30 ms). Lesson 25's prefix-sharing kernel discussion calls this the highest-leverage single optimization for RL post-training; the speedup on K-rollout benchmarks is typically 3–10× on the rollout phase as a whole.

vLLM's APC vs SGLang's RadixAttention
vLLM detects prefix sharing block-by-block: two requests share a prefix iff their token sequences are identical at every block boundary. SGLang's RadixAttention (lesson 20) generalizes to arbitrary subtrees of tokens — a multi-turn agent reusing the system prompt, several earlier turns, and a tool-call header all benefit. For single-turn math RL, the two are roughly equivalent. For multi-turn agentic RL, RadixAttention pulls ahead.

Trick 3 · Chunked prefill

If you submit a 32k-token prompt to a vanilla engine, prefill becomes a 32k-long matmul and blocks the engine from running any decode steps for hundreds of ms. Decoding clients see latency spikes. Worse: in an RL rollout pool, a single long-context tool-use trajectory can stall the other 100 short-context rollouts.

Chunked prefill breaks the prompt into pieces (say, 4k tokens each), interleaves them with decode steps, and amortizes the long prefill across many engine steps. The prompt's tokens are admitted to the KV cache one chunk at a time; decoding sequences continue uninterrupted between chunks.

Without chunked prefill prefill (blocks engine 250 ms) decode decode other sequences (decode-only) WAIT for prefill to finish With chunked prefill (4k chunks) chunk 1 dec chunk 2 dec chunk 3 other sequences make decode progress between chunks (no stall)

The win is 10–20% on long-prompt workloads, and crucially it stabilizes tail latency. For RL: when one of your rollouts is a long-context tool-use trace, chunked prefill is what keeps the other 100 short-context rollouts from blocking on it.

Trick 4 · Speculative decoding

Decode is memory-bandwidth-bound (lesson 20): every token requires reading the full weight tensor and KV cache from HBM. The compute per token is trivial. Speculative decoding exploits this asymmetry: have a small draft model generate K tokens speculatively, then have the large target model verify all K in a single batched forward pass.

The verification is mathematically careful: each draft token is accepted only with probability min(1, πtarget(y) / πdraft(y)), ensuring the final samples are exactly from the target distribution. If all K pass, you've gained ~K× decode throughput. If some fail, you've at least gained the position of the first failure.

VariantDraft mechanismBest case
Vanilla draft modelSeparate small model (e.g. 1B) drafts for the 70B target1.5–2.5× on greedy decode
MedusaExtra prediction heads on the target itself; no separate model1.5–2× with one model in memory
EAGLE-2 / EAGLE-3Small head conditioned on target's last hidden state — much higher acceptance2–3× even at moderate T
Lookahead decodingDraft from the target itself by re-using past hidden states1.3–1.8×; no auxiliary model
Speculative decoding + RL caveat
Speculative decoding's correctness depends on faithfully sampling from the target policy. Sample-time numerical drift (e.g., kernel fusion that changes log-prob precision) can quietly make the rollout policy diverge from the trainer's policy — and then your old_logp for the rollout doesn't match what the trainer recomputes. Symptom: PPO ratio is unexpectedly skewed away from 1 at the very first iteration. Fix: turn off speculative decoding for RL, or ensure trainer and rollout use the same kernels.

Also: spec decoding helps most at low sampling temperature, where the draft model agrees often with the target. At T=1 (typical RL exploration), draft acceptance drops and the K-token-verify overhead doesn't pay back. Plan for ~1× at T=1, ~2× at T=0.5, ~3× at T=0.

Interactive · stacking the tricks

Below: a steady-state throughput model with each trick toggleable. The bar shows total tokens/sec at fixed concurrency; KPIs name the bottleneck (KV bandwidth / weight bandwidth / FLOPs). Toggle features in any order and watch how the bottleneck shifts.

Scheduling tricks · throughput simulator
Each trick multiplies an effective throughput term. Continuous batching reduces padding waste. Prefix caching cuts prefill cost (modelled as a 0.4× rollout-time multiplier for K-rollout-shaped traffic). Chunked prefill stabilizes throughput on long prompts. Spec decoding multiplies per-stream tokens/sec by ~2× when conditions are right.
Tokens/sec (aggregate)
Per-stream tok/s
Bottleneck
Speedup vs all-off

vLLM vs SGLang · what's different

vLLMSGLang
Core data structurePagedAttention block tableRadixAttention prefix tree (on paged storage)
Prefix sharing granularityBlock-alignedArbitrary subtree
Strong pointHigh throughput on independent requestsTree-structured prompts; tool-use; complex prefixes
Programming modelOpenAI-compatible APISGLang front-end DSL for structured generation
RL integrationverl, OpenRLHF defaultSLIME default; first-class for agentic RL
Weight-update pathCustom update_weights NCCL hookBuilt-in resume_memory_occupation for hot-swap

For pure verifier-RL on single-turn tasks (math, code completion) either engine works well; vLLM has a slight edge on raw throughput. For multi-turn agentic tasks with branchy tool-use, SGLang's RadixAttention gives a real lift because the tool-call tree has substantial KV-reuse opportunities that vLLM's block-aligned prefix cache catches less aggressively.

Trade-offs you'll actually face

KnobUp means…Down means…
Batch size (concurrent sequences)Higher tokens/sec; bigger KV memory; longer time-to-first-tokenLower tokens/sec; smaller memory; faster first-token latency
Max generation lengthLonger chains possible; KV memory grows linearlyShorter chains; more concurrency at same memory
Chunk size (chunked prefill)Lower prefill amortization; more decode interleavingPrefill closer to one big matmul; potential decode stalls
Spec decode KHigher per-stream tokens/sec; more wasted draft compute on missLower per-stream; less waste on miss
Sampling temperatureMore exploration; lower spec-decode acceptance; longer tailsMore exploitation; spec decode shines; shorter tails
Takeaway
Four orthogonal scheduling tricks on a paged storage substrate. Continuous batching kills padding waste; prefix caching kills prompt re-prefill (single biggest RL-specific win); chunked prefill kills long-prompt latency spikes; speculative decoding kills decode's bandwidth bottleneck when temperature is low enough. Real engines stack all four. The kernel-engineer's first job (lesson 25) is to make sure the scheduler exposes them correctly to the RL framework's K-rollout-shaped traffic.