Optimization catalog

The remaining levers, derived from first principles. Each one targets one of the two consequences from lesson 01 — bytes per token, or bytes moved per step.

How to read this lesson

The previous four lessons have done the heavy lifting. The KV cache (01), paging it (02), reading it efficiently (03), and scheduling around it (04) are most of the throughput story. What's left is a catalog of additional levers — none of them on the scale of paged KV + continuous batching, but each one buys you a measurable factor. We cover the ones that don't have a dedicated lesson elsewhere:

Automatic Prefix Caching (APC) — share KV blocks across requests with common prefixes.
Quantization — fewer bytes per parameter, fewer bytes per KV slot.
Tensor parallelism — split each matmul across N GPUs.
CUDA graph capture — remove Python from the steady-state path.
Speculative decoding (teaser) — see it derived in full at lesson 07.

Forward references: speculative decoding (07), prefill/decode disaggregation (08), chunked prefill (10) each get their own lesson.

1 · Automatic Prefix Caching

Almost every serving workload has shared prefixes. A system prompt that's identical for every user. A few-shot example block reused across calls. A RAG context document that backs hundreds of queries. With a naive cache, every request prefills its prompt from scratch, producing bit-for-bit identical KV blocks that you immediately overwrite when the request finishes.

The block manager from lesson 02 gave us refcounted blocks and copy-on-write. APC asks: when we're about to allocate a new block for a request, can we check whether some other request already computed the same one and just point at it?

The hash chain

For two blocks to be interchangeable, they need to encode the same KV — which means they need to contain the same token IDs and have been produced after the same earlier sequence of tokens (because every K and V depends on the full context to the left through attention). So the cache key for block i can't just be the 16 token IDs in that block. It has to be the 16 tokens plus the identity of the chain that led to it.

The trick is a chained hash:

h_i = hash( tokens_i || h_i−1 )

where h₀ is a fixed sentinel for the empty prefix. Now h_i uniquely identifies a complete (token, position, predecessor) tuple. Two sequences whose third block happens to contain the same 16 tokens get the same h₃ only if their first two blocks also matched. This is exactly what correctness demands — and it costs one SHA per block boundary, amortized over hundreds of attention FLOPs.

The cache is a hash map: h → block_id, refcount. On allocation we check; if h is present we just bump its refcount (this is the same fork-on-share mechanism from lesson 02). If not, we allocate a fresh block, compute its KV during prefill, and insert it.

Why tail blocks are never cached

A request whose total prompt is 97 tokens produces 6 full blocks (96 tokens) + one tail block with 1 token in it. We never cache the tail. Its hash would change the moment the next token is generated and appended — every decode step would invalidate the entry. Caching only happens at full-block boundaries; the cost is at most BLOCK_SIZE − 1 redundantly-computed tail tokens per request, which is noise.

The numbers

Take a system prompt of 2000 tokens, a user query of 100 tokens, block size 16. The first request prefills:

⌈2000 / 16⌉ + ⌈100 / 16⌉ = 125 + 7 = 132 blocks

Of those, 125 are blocks of the (shared) system prompt — they hash deterministically given the prompt's token IDs. The next request, with the same system prompt and a different user query, will find those 125 blocks already in the cache and only need to compute the 7 user-query blocks itself. That's

125 / 132 ≈ 94.7%

of its prefill compute that never happens. The first request pays full freight; every subsequent request with the same system prompt pays for its query only. For a server fielding thousands of requests against one system prompt, the amortized prefill cost goes to the cost of the user query, which is usually a tiny fraction.

What can defeat APC

Any per-request variation that lands before the shared portion. Examples: prepending a user-specific timestamp, an A/B-test token, a session UUID. The very first token shifts every subsequent hash, and your hit rate goes to zero. Move per-request data to the end of the prompt.

Interactive · the hit grid

Configure a system prompt and user query length, pick how many requests share the prompt, and see the per-block hit grid. Each cell is one 16-token block. The first request misses every block (it's populating the cache). Subsequent requests hit on the system-prompt blocks and miss on their own unique query tail. Push "shared prompt length" up and watch the green region of the grid widen at the expense of the blue stripe on the right.

APC hit-rate explorer

Each cell is one 16-token block. Green = cache hit (already computed by a prior request). Blue = miss (this request computes it, populates the cache). The KPIs sum across all requests. Note: the "reuse rate" KPI is token-weighted (tokens saved / tokens total); the "cached hits = X / Y blocks" line at the bottom is block-count-weighted, so the two percentages can differ when tail blocks are present.

scenario:

system prompt tokens: 96 user query tokens: 32

requests: 2 block size: 16

total prefill tokens · no APC

—

total prefill tokens · with APC

—

reuse rate

—

compute saved

—

show the hash-chain logic this widget uses

// One hash entry per (block-index, parent_hash, token_ids_of_block).
parent_hash = ""
for each full block i in request:
    h = hash(tokens[i], parent_hash)
    if h in by_hash:  cache_hits   += 1     // fork existing block, ref++
    else:             cache_misses += 1     // allocate, compute, insert
                      by_hash[h]    = block_id
    parent_hash = h
// Tail block (less than BLOCK_SIZE tokens) is NEVER cached.

2 · Quantization

Decode is memory-bound (lesson 01, consequence B). The kernel's wall-clock is set by how many bytes leave HBM per step, not by how many FLOPs the SMs do. So if you halve the bytes per element, you roughly halve the decode latency. Two targets:

Weight-only quantization (AWQ, GPTQ)

Store W in int8 or int4. Dequantize on the fly inside the matmul kernel — the activations stay in fp16 / bf16. Total weight-load traffic per step drops by 2× (int8) or 4× (int4). The arithmetic happens at fp16 precision, so accuracy stays surprisingly close to baseline: typical perplexity hit is <1% for int4 on 7B+ models with a good calibration scheme.

The catch is in choosing the per-channel scale factors:

GPTQ: layer-wise second-order optimization. Run a calibration set through the layer, solve for scales that minimize the layer's output MSE. Slow to compute, runs once.
AWQ: identify the 1% of weight channels that handle the activation outliers, give those higher-resolution scales. Cheaper calibration, comparable quality.

KV cache quantization (fp8)

The KV cache is the other big consumer of HBM traffic per decode step (the kernel reads all of it per step — that's exactly why decode is memory-bound). Store K and V in fp8 (E4M3 or E5M2) instead of fp16 and the per-token cache is halved:

bytes_per_token_fp8 = ½ · bytes_per_token_fp16

Two downstream effects, both linear: (a) decode HBM traffic halves → decode is ~2× faster (in the bandwidth-bound regime), and (b) for a fixed HBM budget, the cache holds 2× as many tokens → 2× the concurrent sequences. The two combine.

Quality cost: small but nonzero. fp8 E4M3 has 3 mantissa bits, so each individual K/V element rounds to one of 256 values. Attention is a weighted average so the rounding noise mostly cancels, but very long contexts (32k+) start to show it. vLLM exposes --kv-cache-dtype fp8.

Why activation quantization is the hard one

Activations have outliers — a handful of channels carry magnitudes 10–100× the median. Naive int8 quantization with a single tensor-wide scale wastes 95% of the range on the outlier channels, leaving everyone else with 2–3 bits of effective resolution. Solutions exist (SmoothQuant migrates outlier scale into the weights; per-channel scaling) but the headline number is smaller than weight + KV quant. There's also a compounding problem: quantization error from layer ℓ becomes the input to layer ℓ+1's quantizer. Cascading 80 layers of small errors is what limits how aggressive you can get.

3 · Tensor parallelism (Megatron-style)

One GPU isn't big enough for a 70B model in fp16 (140 GB of weights, vs 80 GB of HBM). The two strategies are tensor parallelism (split each layer's matmul across N GPUs) and pipeline parallelism (give each GPU consecutive layers). vLLM is built around tensor parallelism.

The split, in one MLP block

A transformer MLP is y = W₂ · σ(W₁ · x), with W₁ ∈ ℝ^{d × 4d} and W₂ ∈ ℝ^{4d × d}. Split across N GPUs:

Column-split W₁: each GPU holds d × (4d/N). Each GPU computes z_k = σ(W_1,k · x) independently. No communication needed; activations x are replicated.
Row-split W₂: each GPU holds (4d/N) × d. GPU k computes its partial output y_k = W_2,k · z_k. The full y is Σ_k y_k — one all-reduce.

Attention is sharded by heads: each GPU gets h / N heads, computes its slice of the output independently, then one all-reduce to recombine into the residual stream. Net communication per transformer layer: 2 all-reduces (MLP + attention), each carrying batch · seq · d activations.

Why it scales within a node but not across

Per-layer all-reduce cost is ~ (batch · seq · d · bytes) / bandwidth. NVLink on H100 is ~900 GB/s; PCIe is ~64 GB/s; cross-node InfiniBand is ~50 GB/s. The same all-reduce that takes ~50 μs on NVLink takes ~700 μs on PCIe. With 80 layers per pass that's 50 ms of pure communication per token on PCIe — completely dominating the compute. So TP works inside a DGX-style node with NVLink; across nodes you switch to pipeline parallelism, which only sends activations once per pipeline stage rather than per layer.

4 · CUDA graph capture

A decode step launches a few hundred small CUDA kernels: attention, two matmuls per MLP, layer-norms, activation functions, residual adds, all per layer for L = 32–80 layers. Each kernel launch incurs ~10 μs of Python + driver overhead. At L = 80 layers and ~6 ops per layer that's:

80 × 6 × 10 μs = 4.8 ms of launch overhead

A 70B decode step on H100 takes ~5–10 ms of actual GPU work. So the Python overhead is genuinely 30–50% of step time. The fix is CUDA graphs: record the whole sequence of launches once (with placeholder inputs), then on every subsequent decode step submit the recorded graph with one API call. Python is out of the loop entirely.

The catch is that the graph shape is fixed at capture time — the batch dimension, the active set, the block-table strides are all baked in. vLLM handles this by capturing graphs at a small set of common batch sizes at startup (1, 2, 4, 8, 16, 32, …) and dispatching to the smallest one that fits the current step. If your active set is 6, you replay the captured-for-8 graph and waste two slots — still a big win vs. uncaptured.

5 · Speculative decoding — teaser

The idea, in two lines: a small "draft" model proposes K tokens. The big "target" model evaluates all K in one parallel forward pass (which costs ~the same as evaluating one token, because both are memory-bound and the K-token pass amortizes the KV read). A clever rejection-sampling rule decides which proposals to accept.

The unbiasedness claim

Speculative decoding is an exact sampler of the target model's distribution, not an approximation. The acceptance rule (Leviathan et al. 2023) — accept token x with probability min(1, p_target(x) / p_draft(x)) and, on first rejection, sample from the residual max(0, p_target − p_draft) renormalized — preserves the target's marginal distribution at every position.

The speedup is a + 1 tokens per target step, where a is the average number of accepted draft tokens (typically 2–4). The derivation, the failure modes, and the Medusa / EAGLE / Lookahead variants are lesson 07.

6 · Prefill/decode disaggregation — sketch

From lesson 01 we know prefill is compute-bound and decode is memory-bound. They're best served by different hardware: prefill wants high-FLOP / lower-HBM (think H100 SXM), decode wants high-bandwidth-per-dollar (or simply more cards). Disaggregation runs them on separate pools, with the prefill pool handing off the KV cache to the decode pool over fast interconnect. Done right, it eliminates the interference between phases inside one server and lets each pool be sized independently. Full treatment in lesson 08.

Where each lever sits

Technique	What it pushes	Typical factor
APC	Prefill cost on shared prompts	up to 95% prefill skipped
fp8 KV	HBM bytes/step · cache capacity	2× / 2×
int4 weights	HBM bytes/step on weight load	~4× decode
TP (within node)	Fits a model that doesn't fit	N× memory headroom, ~linear throughput
CUDA graphs	Python overhead in decode loop	10–50% step time, model-dependent
Spec decoding	Tokens/target step	2–3× (workload-dependent)

Takeaway

Every item in this catalog is one of the two consequences from lesson 01 wearing a different hat. APC, fp8 KV, weight quant, TP, CUDA graphs — each one reduces either bytes-per-token or bytes-moved-per-step, just at a different layer (data, dtype, sharding, dispatch). The first-principles question for any new "vLLM optimization" you encounter: which of those two does it move?