Preemption and swap

Everything in lessons 02–10 quietly assumed there was always a free block when a sequence needed one. There isn't. This is what vLLM does when the cache fills up.

The reality we've been ducking

The block pool is a fixed-size resource. Three ways to exhaust it:

An oversized prompt arrives. A 6000-token request needs ⌈6000/16⌉ = 375 blocks before it can even start decoding.
An active decoder runs longer than expected. The model kept sampling past where you guessed the EOS would land; that one extra block is no longer free.
A burst of arrivals pushes the active set above the steady-state working set the pool was sized for.

vLLM's invariant: never admit work it can't fit. But it has already admitted plenty of in-flight requests, and one of them just needs a block that isn't there. The scheduler has to do something. What it does is preempt.

The two strategies

Strategy A · Recompute

Drop the preempted sequence's KV blocks entirely. The blocks go back to the pool. Put the sequence back in the waiting queue. When it gets re-admitted, recompute its prefill from the original token IDs (which are basically free to store — a few KB).

Cost: redo the full prefill — (P + G) tokens worth of compute, where P is prompt length and G is the tokens generated so far at the time of preemption.
Saves: all HBM bandwidth — no transfer.
Wins when: prompts are short (small P), generations are short (small G), or you're compute-rich and bandwidth-starved.

Strategy B · Swap

Copy the preempted sequence's KV blocks from HBM to CPU DRAM. On resume, allocate fresh GPU blocks and copy the saved KV back in. Both transfers are block-level memcpys over NVLink / PCIe Gen5 — sequential, coalesced, fast (~50 GB/s on H100).

Cost: HBM ↔ DRAM transfer of (P + G) tokens worth of KV bytes, twice (out then in).
Saves: recomputing the prefill.
Wins when: prompts are long (P large → prefill is expensive), CPU DRAM is plentiful, and the swap bandwidth beats the prefill compute rate.

Why vLLM defaults to swap

Because paged KV (lesson 02) makes the unit of transfer small and uniform: one block is block_size · n_layers · h_kv · d_head · dtype bytes, independent of how long the sequence is. There's no large contiguous allocation to find or to move. CPU DRAM is cheap and large (often 512 GB+ on a serving node). And the worst-case latency of a swap is bounded by bandwidth, not by compute; recompute's worst case is unbounded prefill.

The eviction rule · reverse FIFO

Which in-flight sequence gets preempted? vLLM picks the most recently admitted active sequence first. Two reasons, both from first principles:

Least work lost. The most-recently admitted sequence has produced the fewest tokens. Preempting it discards/swaps the smallest amount of KV. Preempting the oldest sequence is the opposite of what you want.
Least likely to be near completion. If you preempt a sequence that was about to emit its EOS, you've just doubled its latency for one block of memory. Reverse-FIFO biases evictions toward sequences that are statistically far from the finish line.

Alternatives exist (LRU, longest-remaining, SLO-priority) and are tunable. For the typical mix, reverse-FIFO is the right default.

Cost-model arithmetic

Let:

P = prompt length, G = tokens generated so far
B = bytes per token in the KV (e.g. 0.5 MB for LLaMA-1 7B fp16, 0.32 MB for LLaMA-2 70B at one TP shard)
C_f = effective prefill throughput in bytes/sec (compute-bound — roughly model weights size / step time)
C_s = swap bandwidth in bytes/sec (HBM ↔ DRAM)

Then:

t_recompute ≈ (P + G) · B / C_f

t_swap-out + t_swap-in ≈ 2 · (P + G) · B / C_s

The ratio of costs is C_s / (2 · C_f). For LLaMA-2 7B numbers (effective prefill ~10 GB/s of KV-equivalent work, swap ~30 GB/s over NVLink/PCIe Gen5): swap is about 1.5× cheaper per unit KV. In practice, swap's advantage grows when the recompute side has to redo G generated tokens too (lost on KV drop), so for long generations swap can be 2–3× cheaper than what the steady-state cost model alone predicts. The 1.5× figure is the floor; real production gaps land closer to 2–3× on long-prompt long-generation workloads.

SLO impact · the double hit

A preempted request takes the latency hit twice:

The work done before preemption is wasted (recompute) or paused (swap).
It re-enters the waiting queue and competes with new arrivals for the next admission slot.

p99 latency is what suffers. Under bursty load you'll see a clean knee in the latency curve at exactly the request rate where the pool first saturates: below it, no preemptions, latency follows queue theory; above it, preemptions kick in and the tail explodes.

Mitigations

Stability floor. Never preempt a sequence that has emitted fewer than K tokens since its prefill completed. Prevents pathological thrash where the same sequence is repeatedly admitted and evicted.
SLO-aware admission. Don't admit a new request if doing so would force a preemption that violates a latency bound. This trades throughput for tail latency.
--max-num-seqs cap. The blunt instrument: cap concurrent admissions to a fixed number known to fit comfortably in the pool. Throughput drops; preempt rate goes to zero. Use it when your workload is unpredictable.

Interactive · pool pressure simulator

200 simulated steps of a bursty workload. The blue bar shows pool usage over time; red flashes are preempted sequences. The KPI shows what the two strategies give you. The reveal: at pool capacity below the workload's sustained working set, swap dramatically beats recompute on long-prompt mixes. At capacity well above the working set, both approach zero preempt rate and the choice doesn't matter.

Pool-pressure simulator (200 steps)

Try pool=600 with strategy=recompute vs swap at long_prompt_pct=40 — the latency gap is the lesson.

pool capacity (blocks): 600 arrival rate (req/step): 3

long-prompt %: 25 strategy:

mean latency

—

p99 latency

—

preempt rate

—

work wasted

—

show the cost model the simulator uses

// Per token costs (ms):
const PREFILL_MS_PER_TOK = 0.5;     // compute-bound (recompute cost)
const SWAP_MS_PER_TOK    = 0.1;     // HBM <-> DRAM
const DECODE_MS          = 10.0;    // per step over the active set
const BLOCK_SIZE         = 16;      // tokens per block

// When the pool is full and an admission requires `need` blocks,
// pop the most-recently admitted active seq (reverse FIFO) and either:
//   recompute: drop KV, requeue, pay full prefill again on re-admit
//   swap     : pay swap-out time now, pay swap-in time on resume

Putting it together

regime	preempt rate	p99 latency	which wins
pool ≫ working set	~0%	queue-theory tail only	doesn't matter
pool ≈ working set	5–15%	2–3× mean	swap (if long prompts)
pool < working set	30%+	5–10× mean	swap, by a lot

Takeaway

Preemption is what makes vLLM degrade gracefully under load instead of OOMing or rejecting requests. Swap is the default because the cost model (2 · B / C_s < B / C_f) usually picks it. Reverse-FIFO is the eviction rule because it minimizes work lost and tail-latency damage. If your p99 has a sharp cliff above some QPS, that's the pool saturating — raise --max-num-seqs conservatively or add HBM.