vllm_lessons / 11 · preemption and swap lesson 11 / 12

Preemption and swap

Everything in lessons 02–10 quietly assumed there was always a free block when a sequence needed one. There isn't. This is what vLLM does when the cache fills up.

The reality we've been ducking

The block pool is a fixed-size resource. Three ways to exhaust it:

  1. An oversized prompt arrives. A 6000-token request needs ⌈6000/16⌉ = 375 blocks before it can even start decoding.
  2. An active decoder runs longer than expected. The model kept sampling past where you guessed the EOS would land; that one extra block is no longer free.
  3. A burst of arrivals pushes the active set above the steady-state working set the pool was sized for.

vLLM's invariant: never admit work it can't fit. But it has already admitted plenty of in-flight requests, and one of them just needs a block that isn't there. The scheduler has to do something. What it does is preempt.

The two strategies

Strategy A · Recompute

Drop the preempted sequence's KV blocks entirely. The blocks go back to the pool. Put the sequence back in the waiting queue. When it gets re-admitted, recompute its prefill from the original token IDs (which are basically free to store — a few KB).

Strategy B · Swap

Copy the preempted sequence's KV blocks from HBM to CPU DRAM. On resume, allocate fresh GPU blocks and copy the saved KV back in. Both transfers are block-level memcpys over NVLink / PCIe Gen5 — sequential, coalesced, fast (~50 GB/s on H100).

Why vLLM defaults to swap

Because paged KV (lesson 02) makes the unit of transfer small and uniform: one block is block_size · n_layers · hkv · d_head · dtype bytes, independent of how long the sequence is. There's no large contiguous allocation to find or to move. CPU DRAM is cheap and large (often 512 GB+ on a serving node). And the worst-case latency of a swap is bounded by bandwidth, not by compute; recompute's worst case is unbounded prefill.

The eviction rule · reverse FIFO

Which in-flight sequence gets preempted? vLLM picks the most recently admitted active sequence first. Two reasons, both from first principles:

  1. Least work lost. The most-recently admitted sequence has produced the fewest tokens. Preempting it discards/swaps the smallest amount of KV. Preempting the oldest sequence is the opposite of what you want.
  2. Least likely to be near completion. If you preempt a sequence that was about to emit its EOS, you've just doubled its latency for one block of memory. Reverse-FIFO biases evictions toward sequences that are statistically far from the finish line.

Alternatives exist (LRU, longest-remaining, SLO-priority) and are tunable. For the typical mix, reverse-FIFO is the right default.

Cost-model arithmetic

Let:

Then:

trecompute  ≈  (P + G) · B / Cf
tswap-out + tswap-in  ≈  2 · (P + G) · B / Cs

The ratio of costs is Cs / (2 · Cf). For LLaMA-2 7B numbers (effective prefill ~10 GB/s of KV-equivalent work, swap ~30 GB/s over NVLink/PCIe Gen5): swap is about 1.5× cheaper per unit KV. In practice, swap's advantage grows when the recompute side has to redo G generated tokens too (lost on KV drop), so for long generations swap can be 2–3× cheaper than what the steady-state cost model alone predicts. The 1.5× figure is the floor; real production gaps land closer to 2–3× on long-prompt long-generation workloads.

SLO impact · the double hit

A preempted request takes the latency hit twice:

  1. The work done before preemption is wasted (recompute) or paused (swap).
  2. It re-enters the waiting queue and competes with new arrivals for the next admission slot.

p99 latency is what suffers. Under bursty load you'll see a clean knee in the latency curve at exactly the request rate where the pool first saturates: below it, no preemptions, latency follows queue theory; above it, preemptions kick in and the tail explodes.

Mitigations

Interactive · pool pressure simulator

200 simulated steps of a bursty workload. The blue bar shows pool usage over time; red flashes are preempted sequences. The KPI shows what the two strategies give you. The reveal: at pool capacity below the workload's sustained working set, swap dramatically beats recompute on long-prompt mixes. At capacity well above the working set, both approach zero preempt rate and the choice doesn't matter.

Pool-pressure simulator (200 steps)
Try pool=600 with strategy=recompute vs swap at long_prompt_pct=40 — the latency gap is the lesson.
mean latency
p99 latency
preempt rate
work wasted
show the cost model the simulator uses
// Per token costs (ms):
const PREFILL_MS_PER_TOK = 0.5;     // compute-bound (recompute cost)
const SWAP_MS_PER_TOK    = 0.1;     // HBM <-> DRAM
const DECODE_MS          = 10.0;    // per step over the active set
const BLOCK_SIZE         = 16;      // tokens per block

// When the pool is full and an admission requires `need` blocks,
// pop the most-recently admitted active seq (reverse FIFO) and either:
//   recompute: drop KV, requeue, pay full prefill again on re-admit
//   swap     : pay swap-out time now, pay swap-in time on resume

Putting it together

regimepreempt ratep99 latencywhich wins
pool ≫ working set~0%queue-theory tail onlydoesn't matter
pool ≈ working set5–15%2–3× meanswap (if long prompts)
pool < working set30%+5–10× meanswap, by a lot
Takeaway
Preemption is what makes vLLM degrade gracefully under load instead of OOMing or rejecting requests. Swap is the default because the cost model (2 · B / Cs < B / Cf) usually picks it. Reverse-FIFO is the eviction rule because it minimizes work lost and tail-latency damage. If your p99 has a sharp cliff above some QPS, that's the pool saturating — raise --max-num-seqs conservatively or add HBM.