vllm_lessons / 01 · the KV cache lesson 1 / 12

The KV cache

Where every vLLM optimization is aimed. Get one formula in your head before reading anything else.

Why start here

vLLM is an inference engine, not a training framework. It exists to make decode-time generation cheaper. Every design choice in the project — paged memory, continuous batching, FlashAttention integration, FP8 KV, speculative decoding — is aimed at one object: the KV cache. Get its properties straight, and every later lesson reads as a direct consequence.

What the KV cache is, in one paragraph

An autoregressive transformer emits one token at a time. To produce token t+1 we need its query qt+1 to compare against all previous keys k1..t, and to pool the corresponding values v1..t:

attnt+1 = softmax(qt+1 · [k1, …, kt]T / √d) · [v1; …; vt]

If we recomputed all ki, vi every step the projection cost would be O(t) per step → O(T²) over a generation of length T. So instead, after emitting token t we append (kt, vt) to per-layer buffers. Step t+1 then projects only the new token (O(1)) and does the unavoidable O(t) attention dot product. The KV cache is just that buffer — incremental state, nothing more.

The one formula
Per token, per sequence, the KV cache stores:
bytes_per_token  =  2 (K and V) × n_layers × n_heads × head_dim × bytes_per_dtype
Memorize this. It's the number you'll use to predict whether a workload fits in HBM, how many concurrent sequences you can serve, and whether GQA / FP8 will buy you what you need.

A concrete number

LLaMA-2 7B (32 layers, 32 heads, head_dim 128, fp16):

2 × 32 × 32 × 128 × 2  =  524,288 bytes  ≈  0.5 MB per token

A single 2048-token sequence: ~1 GB of KV cache. On an 80 GB H100, after weights (~14 GB) and activations (~6 GB activations + framework overhead), you have ~60 GB left — so the theoretical ceiling is ~60 concurrent 2048-token sequences. In practice you're nowhere near that, because the cache is fragmented. That's lesson 02.

The two consequences

Two facts follow directly from the formula and a touch of arithmetic. Almost everything in vLLM is one of these in disguise.

Consequence A · throughput ↔ HBM occupancy
Inference throughput on a fixed model is roughly proportional to the number of sequences in flight, which is gated on how efficiently you pack KV into HBM. Less fragmentation = more concurrent sequences = more tokens per second. This is what PagedAttention (lesson 02) attacks.
Consequence B · decode is memory-bound
Each decode step reads the full KV cache once: one matmul against a buffer that grows with t. The compute is tiny (one token's worth of projections) — what limits you is HBM bandwidth. Fewer bytes moved per step = faster. This is what FlashAttention (lesson 03) and KV quantization (lesson 05) attack.

Prefill — processing the prompt — is the opposite: T tokens worth of math in one parallel pass. Compute-bound, matrix-matrix products, high arithmetic intensity. The two phases having opposite bottlenecks is why chunked prefill (lesson 10) and P/D disaggregation (lesson 08) exist.

The fragmentation problem, made concrete

Textbook implementations of the KV cache pre-allocate max_seq_len slots per request, contiguously, up front. The reasoning is reasonable in isolation: you don't know how long the response will be, and you need contiguous memory so the attention kernel can stride through it cleanly. The cost shows up only across requests:

seq A: ████░░░░░░░░░░░░░░░░░░░░░░░░░░░░    (50/2048 used)
seq B: ███░░░░░░░░░░░░░░░░░░░░░░░░░░░░░    (40/2048 used)
seq C: ██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░    (30/2048 used)
       ─────────────────────────────────
       useful: 5.9%   wasted: 94.1%   ← "internal fragmentation"

vLLM's fix is to allocate in fixed-size blocks (say 16 tokens) from a shared pool, on demand. Worst-case waste drops from ~max_seq_len bytes to at most BLOCK_SIZE − 1 bytes per sequence. That's the lesson 02 jump.

Interactive · feel the bytes

Pick a model, a context length, and a concurrent-sequence count. The widget shows you how much HBM the KV cache would consume, side-by-side with the weights, on an 80 GB H100. Try a few:

KV cache budget on an 80 GB H100
All bars are on the same 80 GB scale. Red overflow means it doesn't fit; you'd need TP / quantization / fewer sequences.
per-token KV
per-seq KV
total KV
fits?
show the formula this widget uses (≈3 lines)
// bytes_per_token = 2 (K and V) * n_layers * n_kv_heads * head_dim * dtype_bytes
const per_token = 2 * cfg.n_layers * cfg.n_kv_heads * cfg.head_dim * dtype_bytes;
const per_seq   = per_token * ctx_len;
const total_kv  = per_seq   * n_sequences;

What the next four lessons do to this picture

LessonWhat it attacksHow much
02 PagedAttentionFragmentationWasted KV: ~95% → ~6%
03 FlashAttentionHBM traffic per attention opO(T²) → O(T) intermediates
04 Continuous batchingCycle-level wait for stragglersDecode utilization 45% → 90%
05 APC + FP8 KV + …Bytes per token, redundant prefill2× cache, ~95% prefill savings on shared prompts
Takeaway
Bytes per token = 2 · L · h_kv · d_head · dtype_bytes. Throughput is bounded by HBM occupancy; decode latency is bounded by HBM bandwidth. Every later lesson is a way to push one of those two numbers.