Preemption and swap
Everything in lessons 02–10 quietly assumed there was always a free block when a sequence needed one. There isn't. This is what vLLM does when the cache fills up.
The reality we've been ducking
The block pool is a fixed-size resource. Three ways to exhaust it:
- An oversized prompt arrives. A 6000-token request needs ⌈6000/16⌉ = 375 blocks before it can even start decoding.
- An active decoder runs longer than expected. The model kept sampling past where you guessed the EOS would land; that one extra block is no longer free.
- A burst of arrivals pushes the active set above the steady-state working set the pool was sized for.
vLLM's invariant: never admit work it can't fit. But it has already admitted plenty of in-flight requests, and one of them just needs a block that isn't there. The scheduler has to do something. What it does is preempt.
The two strategies
Strategy A · Recompute
Drop the preempted sequence's KV blocks entirely. The blocks go back to the pool. Put the sequence back in the waiting queue. When it gets re-admitted, recompute its prefill from the original token IDs (which are basically free to store — a few KB).
- Cost: redo the full prefill — (P + G) tokens worth of compute, where P is prompt length and G is the tokens generated so far at the time of preemption.
- Saves: all HBM bandwidth — no transfer.
- Wins when: prompts are short (small P), generations are short (small G), or you're compute-rich and bandwidth-starved.
Strategy B · Swap
Copy the preempted sequence's KV blocks from HBM to CPU DRAM. On resume, allocate fresh GPU blocks and copy the saved KV back in. Both transfers are block-level memcpys over NVLink / PCIe Gen5 — sequential, coalesced, fast (~50 GB/s on H100).
- Cost: HBM ↔ DRAM transfer of (P + G) tokens worth of KV bytes, twice (out then in).
- Saves: recomputing the prefill.
- Wins when: prompts are long (P large → prefill is expensive), CPU DRAM is plentiful, and the swap bandwidth beats the prefill compute rate.
Why vLLM defaults to swap
Because paged KV (lesson 02) makes the unit of transfer small and uniform: one block is block_size · n_layers · hkv · d_head · dtype bytes, independent of how long the sequence is. There's no large contiguous allocation to find or to move. CPU DRAM is cheap and large (often 512 GB+ on a serving node). And the worst-case latency of a swap is bounded by bandwidth, not by compute; recompute's worst case is unbounded prefill.
The eviction rule · reverse FIFO
Which in-flight sequence gets preempted? vLLM picks the most recently admitted active sequence first. Two reasons, both from first principles:
- Least work lost. The most-recently admitted sequence has produced the fewest tokens. Preempting it discards/swaps the smallest amount of KV. Preempting the oldest sequence is the opposite of what you want.
- Least likely to be near completion. If you preempt a sequence that was about to emit its EOS, you've just doubled its latency for one block of memory. Reverse-FIFO biases evictions toward sequences that are statistically far from the finish line.
Alternatives exist (LRU, longest-remaining, SLO-priority) and are tunable. For the typical mix, reverse-FIFO is the right default.
Cost-model arithmetic
Let:
- P = prompt length, G = tokens generated so far
- B = bytes per token in the KV (e.g. 0.5 MB for LLaMA-1 7B fp16, 0.32 MB for LLaMA-2 70B at one TP shard)
- Cf = effective prefill throughput in bytes/sec (compute-bound — roughly model weights size / step time)
- Cs = swap bandwidth in bytes/sec (HBM ↔ DRAM)
Then:
The ratio of costs is Cs / (2 · Cf). For LLaMA-2 7B numbers (effective prefill ~10 GB/s of KV-equivalent work, swap ~30 GB/s over NVLink/PCIe Gen5): swap is about 1.5× cheaper per unit KV. In practice, swap's advantage grows when the recompute side has to redo G generated tokens too (lost on KV drop), so for long generations swap can be 2–3× cheaper than what the steady-state cost model alone predicts. The 1.5× figure is the floor; real production gaps land closer to 2–3× on long-prompt long-generation workloads.
SLO impact · the double hit
A preempted request takes the latency hit twice:
- The work done before preemption is wasted (recompute) or paused (swap).
- It re-enters the waiting queue and competes with new arrivals for the next admission slot.
p99 latency is what suffers. Under bursty load you'll see a clean knee in the latency curve at exactly the request rate where the pool first saturates: below it, no preemptions, latency follows queue theory; above it, preemptions kick in and the tail explodes.
Mitigations
- Stability floor. Never preempt a sequence that has emitted fewer than K tokens since its prefill completed. Prevents pathological thrash where the same sequence is repeatedly admitted and evicted.
- SLO-aware admission. Don't admit a new request if doing so would force a preemption that violates a latency bound. This trades throughput for tail latency.
- --max-num-seqs cap. The blunt instrument: cap concurrent admissions to a fixed number known to fit comfortably in the pool. Throughput drops; preempt rate goes to zero. Use it when your workload is unpredictable.
Interactive · pool pressure simulator
200 simulated steps of a bursty workload. The blue bar shows pool usage over time; red flashes are preempted sequences. The KPI shows what the two strategies give you. The reveal: at pool capacity below the workload's sustained working set, swap dramatically beats recompute on long-prompt mixes. At capacity well above the working set, both approach zero preempt rate and the choice doesn't matter.
Putting it together
| regime | preempt rate | p99 latency | which wins |
|---|---|---|---|
| pool ≫ working set | ~0% | queue-theory tail only | doesn't matter |
| pool ≈ working set | 5–15% | 2–3× mean | swap (if long prompts) |
| pool < working set | 30%+ | 5–10× mean | swap, by a lot |
--max-num-seqs conservatively or add HBM.