The KV cache
Where every vLLM optimization is aimed. Get one formula in your head before reading anything else.
Why start here
vLLM is an inference engine, not a training framework. It exists to make decode-time generation cheaper. Every design choice in the project — paged memory, continuous batching, FlashAttention integration, FP8 KV, speculative decoding — is aimed at one object: the KV cache. Get its properties straight, and every later lesson reads as a direct consequence.
What the KV cache is, in one paragraph
An autoregressive transformer emits one token at a time. To produce token t+1 we need its query qt+1 to compare against all previous keys k1..t, and to pool the corresponding values v1..t:
If we recomputed all ki, vi every step the projection cost would be O(t) per step → O(T²) over a generation of length T. So instead, after emitting token t we append (kt, vt) to per-layer buffers. Step t+1 then projects only the new token (O(1)) and does the unavoidable O(t) attention dot product. The KV cache is just that buffer — incremental state, nothing more.
A concrete number
LLaMA-2 7B (32 layers, 32 heads, head_dim 128, fp16):
A single 2048-token sequence: ~1 GB of KV cache. On an 80 GB H100, after weights (~14 GB) and activations (~6 GB activations + framework overhead), you have ~60 GB left — so the theoretical ceiling is ~60 concurrent 2048-token sequences. In practice you're nowhere near that, because the cache is fragmented. That's lesson 02.
The two consequences
Two facts follow directly from the formula and a touch of arithmetic. Almost everything in vLLM is one of these in disguise.
Prefill — processing the prompt — is the opposite: T tokens worth of math in one parallel pass. Compute-bound, matrix-matrix products, high arithmetic intensity. The two phases having opposite bottlenecks is why chunked prefill (lesson 10) and P/D disaggregation (lesson 08) exist.
The fragmentation problem, made concrete
Textbook implementations of the KV cache pre-allocate max_seq_len slots per request, contiguously, up front. The reasoning is reasonable in isolation: you don't know how long the response will be, and you need contiguous memory so the attention kernel can stride through it cleanly. The cost shows up only across requests:
seq A: ████░░░░░░░░░░░░░░░░░░░░░░░░░░░░ (50/2048 used)
seq B: ███░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ (40/2048 used)
seq C: ██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ (30/2048 used)
─────────────────────────────────
useful: 5.9% wasted: 94.1% ← "internal fragmentation"
vLLM's fix is to allocate in fixed-size blocks (say 16 tokens) from a shared pool, on demand. Worst-case waste drops from ~max_seq_len bytes to at most BLOCK_SIZE − 1 bytes per sequence. That's the lesson 02 jump.
Interactive · feel the bytes
Pick a model, a context length, and a concurrent-sequence count. The widget shows you how much HBM the KV cache would consume, side-by-side with the weights, on an 80 GB H100. Try a few:
- LLaMA-1 7B (MHA) at 4096 ctx, 64 sequences — clearly impossible on one GPU. Where does the cliff hit?
- Switch to LLaMA-2 70B (GQA) — note that the per-token KV is smaller than the 7B's, despite a much bigger model. That's lesson 09.
- Drop dtype to fp8 — your KV halves. That's the second knob in lesson 05.
What the next four lessons do to this picture
| Lesson | What it attacks | How much |
|---|---|---|
| 02 PagedAttention | Fragmentation | Wasted KV: ~95% → ~6% |
| 03 FlashAttention | HBM traffic per attention op | O(T²) → O(T) intermediates |
| 04 Continuous batching | Cycle-level wait for stragglers | Decode utilization 45% → 90% |
| 05 APC + FP8 KV + … | Bytes per token, redundant prefill | 2× cache, ~95% prefill savings on shared prompts |