The KV cache

Where every vLLM optimization is aimed. Get one formula in your head before reading anything else.

Why start here

vLLM is an inference engine, not a training framework. It exists to make decode-time generation cheaper. Every design choice in the project — paged memory, continuous batching, FlashAttention integration, FP8 KV, speculative decoding — is aimed at one object: the KV cache. Get its properties straight, and every later lesson reads as a direct consequence.

What the KV cache is, in one paragraph

An autoregressive transformer emits one token at a time. To produce token t+1 we need its query q_t+1 to compare against all previous keys k_1..t, and to pool the corresponding values v_1..t:

attn_t+1 = softmax(q_t+1 · [k₁, …, k_t]^T / √d) · [v₁; …; v_t]

If we recomputed all k_i, v_i every step the projection cost would be O(t) per step → O(T²) over a generation of length T. So instead, after emitting token t we append (k_t, v_t) to per-layer buffers. Step t+1 then projects only the new token (O(1)) and does the unavoidable O(t) attention dot product. The KV cache is just that buffer — incremental state, nothing more.

The one formula

Per token, per sequence, the KV cache stores:

bytes_per_token = 2 (K and V) × n_layers × n_heads × head_dim × bytes_per_dtype

Memorize this. It's the number you'll use to predict whether a workload fits in HBM, how many concurrent sequences you can serve, and whether GQA / FP8 will buy you what you need.

A concrete number

LLaMA-2 7B (32 layers, 32 heads, head_dim 128, fp16):

2 × 32 × 32 × 128 × 2 = 524,288 bytes ≈ 0.5 MB per token

A single 2048-token sequence: ~1 GB of KV cache. On an 80 GB H100, after weights (~14 GB) and activations (~6 GB activations + framework overhead), you have ~60 GB left — so the theoretical ceiling is ~60 concurrent 2048-token sequences. In practice you're nowhere near that, because the cache is fragmented. That's lesson 02.

The two consequences

Two facts follow directly from the formula and a touch of arithmetic. Almost everything in vLLM is one of these in disguise.

Consequence A · throughput ↔ HBM occupancy

Inference throughput on a fixed model is roughly proportional to the number of sequences in flight, which is gated on how efficiently you pack KV into HBM. Less fragmentation = more concurrent sequences = more tokens per second. This is what PagedAttention (lesson 02) attacks.

Consequence B · decode is memory-bound

Each decode step reads the full KV cache once: one matmul against a buffer that grows with t. The compute is tiny (one token's worth of projections) — what limits you is HBM bandwidth. Fewer bytes moved per step = faster. This is what FlashAttention (lesson 03) and KV quantization (lesson 05) attack.

Prefill — processing the prompt — is the opposite: T tokens worth of math in one parallel pass. Compute-bound, matrix-matrix products, high arithmetic intensity. The two phases having opposite bottlenecks is why chunked prefill (lesson 10) and P/D disaggregation (lesson 08) exist.

The fragmentation problem, made concrete

Textbook implementations of the KV cache pre-allocate max_seq_len slots per request, contiguously, up front. The reasoning is reasonable in isolation: you don't know how long the response will be, and you need contiguous memory so the attention kernel can stride through it cleanly. The cost shows up only across requests:

seq A: ████░░░░░░░░░░░░░░░░░░░░░░░░░░░░    (50/2048 used)
seq B: ███░░░░░░░░░░░░░░░░░░░░░░░░░░░░░    (40/2048 used)
seq C: ██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░    (30/2048 used)
       ─────────────────────────────────
       useful: 5.9%   wasted: 94.1%   ← "internal fragmentation"

vLLM's fix is to allocate in fixed-size blocks (say 16 tokens) from a shared pool, on demand. Worst-case waste drops from ~max_seq_len bytes to at most BLOCK_SIZE − 1 bytes per sequence. That's the lesson 02 jump.

Interactive · feel the bytes

Pick a model, a context length, and a concurrent-sequence count. The widget shows you how much HBM the KV cache would consume, side-by-side with the weights, on an 80 GB H100. Try a few:

LLaMA-1 7B (MHA) at 4096 ctx, 64 sequences — clearly impossible on one GPU. Where does the cliff hit?
Switch to LLaMA-2 70B (GQA) — note that the per-token KV is smaller than the 7B's, despite a much bigger model. That's lesson 09.
Drop dtype to fp8 — your KV halves. That's the second knob in lesson 05.

KV cache budget on an 80 GB H100

All bars are on the same 80 GB scale. Red overflow means it doesn't fit; you'd need TP / quantization / fewer sequences.

model: dtype:

ctx length: 2048 concurrent seqs: 16

per-token KV

—

per-seq KV

—

total KV

—

fits?

—

show the formula this widget uses (≈3 lines)

// bytes_per_token = 2 (K and V) * n_layers * n_kv_heads * head_dim * dtype_bytes
const per_token = 2 * cfg.n_layers * cfg.n_kv_heads * cfg.head_dim * dtype_bytes;
const per_seq   = per_token * ctx_len;
const total_kv  = per_seq   * n_sequences;

What the next four lessons do to this picture

Lesson	What it attacks	How much
02 PagedAttention	Fragmentation	Wasted KV: ~95% → ~6%
03 FlashAttention	HBM traffic per attention op	O(T²) → O(T) intermediates
04 Continuous batching	Cycle-level wait for stragglers	Decode utilization 45% → 90%
05 APC + FP8 KV + …	Bytes per token, redundant prefill	2× cache, ~95% prefill savings on shared prompts

Takeaway

Bytes per token = 2 · L · h_kv · d_head · dtype_bytes. Throughput is bounded by HBM occupancy; decode latency is bounded by HBM bandwidth. Every later lesson is a way to push one of those two numbers.