Optimization catalog
The remaining levers, derived from first principles. Each one targets one of the two consequences from lesson 01 — bytes per token, or bytes moved per step.
How to read this lesson
The previous four lessons have done the heavy lifting. The KV cache (01), paging it (02), reading it efficiently (03), and scheduling around it (04) are most of the throughput story. What's left is a catalog of additional levers — none of them on the scale of paged KV + continuous batching, but each one buys you a measurable factor. We cover the ones that don't have a dedicated lesson elsewhere:
- Automatic Prefix Caching (APC) — share KV blocks across requests with common prefixes.
- Quantization — fewer bytes per parameter, fewer bytes per KV slot.
- Tensor parallelism — split each matmul across N GPUs.
- CUDA graph capture — remove Python from the steady-state path.
- Speculative decoding (teaser) — see it derived in full at lesson 07.
Forward references: speculative decoding (07), prefill/decode disaggregation (08), chunked prefill (10) each get their own lesson.
1 · Automatic Prefix Caching
Almost every serving workload has shared prefixes. A system prompt that's identical for every user. A few-shot example block reused across calls. A RAG context document that backs hundreds of queries. With a naive cache, every request prefills its prompt from scratch, producing bit-for-bit identical KV blocks that you immediately overwrite when the request finishes.
The block manager from lesson 02 gave us refcounted blocks and copy-on-write. APC asks: when we're about to allocate a new block for a request, can we check whether some other request already computed the same one and just point at it?
The hash chain
For two blocks to be interchangeable, they need to encode the same KV — which means they need to contain the same token IDs and have been produced after the same earlier sequence of tokens (because every K and V depends on the full context to the left through attention). So the cache key for block i can't just be the 16 token IDs in that block. It has to be the 16 tokens plus the identity of the chain that led to it.
The trick is a chained hash:
where h0 is a fixed sentinel for the empty prefix. Now hi uniquely identifies a complete (token, position, predecessor) tuple. Two sequences whose third block happens to contain the same 16 tokens get the same h3 only if their first two blocks also matched. This is exactly what correctness demands — and it costs one SHA per block boundary, amortized over hundreds of attention FLOPs.
The cache is a hash map: h → block_id, refcount. On allocation we check; if h is present we just bump its refcount (this is the same fork-on-share mechanism from lesson 02). If not, we allocate a fresh block, compute its KV during prefill, and insert it.
The numbers
Take a system prompt of 2000 tokens, a user query of 100 tokens, block size 16. The first request prefills:
Of those, 125 are blocks of the (shared) system prompt — they hash deterministically given the prompt's token IDs. The next request, with the same system prompt and a different user query, will find those 125 blocks already in the cache and only need to compute the 7 user-query blocks itself. That's
of its prefill compute that never happens. The first request pays full freight; every subsequent request with the same system prompt pays for its query only. For a server fielding thousands of requests against one system prompt, the amortized prefill cost goes to the cost of the user query, which is usually a tiny fraction.
Interactive · the hit grid
Configure a system prompt and user query length, pick how many requests share the prompt, and see the per-block hit grid. Each cell is one 16-token block. The first request misses every block (it's populating the cache). Subsequent requests hit on the system-prompt blocks and miss on their own unique query tail. Push "shared prompt length" up and watch the green region of the grid widen at the expense of the blue stripe on the right.
2 · Quantization
Decode is memory-bound (lesson 01, consequence B). The kernel's wall-clock is set by how many bytes leave HBM per step, not by how many FLOPs the SMs do. So if you halve the bytes per element, you roughly halve the decode latency. Two targets:
Weight-only quantization (AWQ, GPTQ)
Store W in int8 or int4. Dequantize on the fly inside the matmul kernel — the activations stay in fp16 / bf16. Total weight-load traffic per step drops by 2× (int8) or 4× (int4). The arithmetic happens at fp16 precision, so accuracy stays surprisingly close to baseline: typical perplexity hit is <1% for int4 on 7B+ models with a good calibration scheme.
The catch is in choosing the per-channel scale factors:
- GPTQ: layer-wise second-order optimization. Run a calibration set through the layer, solve for scales that minimize the layer's output MSE. Slow to compute, runs once.
- AWQ: identify the 1% of weight channels that handle the activation outliers, give those higher-resolution scales. Cheaper calibration, comparable quality.
KV cache quantization (fp8)
The KV cache is the other big consumer of HBM traffic per decode step (the kernel reads all of it per step — that's exactly why decode is memory-bound). Store K and V in fp8 (E4M3 or E5M2) instead of fp16 and the per-token cache is halved:
Two downstream effects, both linear: (a) decode HBM traffic halves → decode is ~2× faster (in the bandwidth-bound regime), and (b) for a fixed HBM budget, the cache holds 2× as many tokens → 2× the concurrent sequences. The two combine.
Quality cost: small but nonzero. fp8 E4M3 has 3 mantissa bits, so each individual K/V element rounds to one of 256 values. Attention is a weighted average so the rounding noise mostly cancels, but very long contexts (32k+) start to show it. vLLM exposes --kv-cache-dtype fp8.
Why activation quantization is the hard one
Activations have outliers — a handful of channels carry magnitudes 10–100× the median. Naive int8 quantization with a single tensor-wide scale wastes 95% of the range on the outlier channels, leaving everyone else with 2–3 bits of effective resolution. Solutions exist (SmoothQuant migrates outlier scale into the weights; per-channel scaling) but the headline number is smaller than weight + KV quant. There's also a compounding problem: quantization error from layer ℓ becomes the input to layer ℓ+1's quantizer. Cascading 80 layers of small errors is what limits how aggressive you can get.
3 · Tensor parallelism (Megatron-style)
One GPU isn't big enough for a 70B model in fp16 (140 GB of weights, vs 80 GB of HBM). The two strategies are tensor parallelism (split each layer's matmul across N GPUs) and pipeline parallelism (give each GPU consecutive layers). vLLM is built around tensor parallelism.
The split, in one MLP block
A transformer MLP is y = W2 · σ(W1 · x), with W1 ∈ ℝd × 4d and W2 ∈ ℝ4d × d. Split across N GPUs:
- Column-split W1: each GPU holds d × (4d/N). Each GPU computes zk = σ(W1,k · x) independently. No communication needed; activations x are replicated.
- Row-split W2: each GPU holds (4d/N) × d. GPU k computes its partial output yk = W2,k · zk. The full y is Σk yk — one all-reduce.
Attention is sharded by heads: each GPU gets h / N heads, computes its slice of the output independently, then one all-reduce to recombine into the residual stream. Net communication per transformer layer: 2 all-reduces (MLP + attention), each carrying batch · seq · d activations.
Why it scales within a node but not across
Per-layer all-reduce cost is ~ (batch · seq · d · bytes) / bandwidth. NVLink on H100 is ~900 GB/s; PCIe is ~64 GB/s; cross-node InfiniBand is ~50 GB/s. The same all-reduce that takes ~50 μs on NVLink takes ~700 μs on PCIe. With 80 layers per pass that's 50 ms of pure communication per token on PCIe — completely dominating the compute. So TP works inside a DGX-style node with NVLink; across nodes you switch to pipeline parallelism, which only sends activations once per pipeline stage rather than per layer.
4 · CUDA graph capture
A decode step launches a few hundred small CUDA kernels: attention, two matmuls per MLP, layer-norms, activation functions, residual adds, all per layer for L = 32–80 layers. Each kernel launch incurs ~10 μs of Python + driver overhead. At L = 80 layers and ~6 ops per layer that's:
A 70B decode step on H100 takes ~5–10 ms of actual GPU work. So the Python overhead is genuinely 30–50% of step time. The fix is CUDA graphs: record the whole sequence of launches once (with placeholder inputs), then on every subsequent decode step submit the recorded graph with one API call. Python is out of the loop entirely.
The catch is that the graph shape is fixed at capture time — the batch dimension, the active set, the block-table strides are all baked in. vLLM handles this by capturing graphs at a small set of common batch sizes at startup (1, 2, 4, 8, 16, 32, …) and dispatching to the smallest one that fits the current step. If your active set is 6, you replay the captured-for-8 graph and waste two slots — still a big win vs. uncaptured.
5 · Speculative decoding — teaser
The idea, in two lines: a small "draft" model proposes K tokens. The big "target" model evaluates all K in one parallel forward pass (which costs ~the same as evaluating one token, because both are memory-bound and the K-token pass amortizes the KV read). A clever rejection-sampling rule decides which proposals to accept.
The speedup is a + 1 tokens per target step, where a is the average number of accepted draft tokens (typically 2–4). The derivation, the failure modes, and the Medusa / EAGLE / Lookahead variants are lesson 07.
6 · Prefill/decode disaggregation — sketch
From lesson 01 we know prefill is compute-bound and decode is memory-bound. They're best served by different hardware: prefill wants high-FLOP / lower-HBM (think H100 SXM), decode wants high-bandwidth-per-dollar (or simply more cards). Disaggregation runs them on separate pools, with the prefill pool handing off the KV cache to the decode pool over fast interconnect. Done right, it eliminates the interference between phases inside one server and lets each pool be sized independently. Full treatment in lesson 08.
Where each lever sits
| Technique | What it pushes | Typical factor |
|---|---|---|
| APC | Prefill cost on shared prompts | up to 95% prefill skipped |
| fp8 KV | HBM bytes/step · cache capacity | 2× / 2× |
| int4 weights | HBM bytes/step on weight load | ~4× decode |
| TP (within node) | Fits a model that doesn't fit | N× memory headroom, ~linear throughput |
| CUDA graphs | Python overhead in decode loop | 10–50% step time, model-dependent |
| Spec decoding | Tokens/target step | 2–3× (workload-dependent) |