Inference optimization as a decision discipline
By now you can size a replica (lesson 04) and a fleet (lesson 05) and read off which wall binds. This lesson is the catalogue of fixes — but it is not a checklist. Each technique is the answer to one named bottleneck from lesson 01's wall analysis, and applying it against the wrong wall ranges from useless to actively harmful. The skill is diagnosis first, optimization second.
The decision tree — diagnose the wall, then pick
Start from the symptom you measured, follow it to the wall, and the technique falls out. This is the whole lesson in one picture; the sections after it are just the arithmetic behind each branch.
1 · Quantization — fewer bytes per weight
Wall it targets: memory bandwidth + memory capacity. Lesson 01 said decode latency ≈ model_bytes / BW. Weights are 2N bytes at fp16 (lesson 02); store them at int8 and they become N, at int4 they become N/2. Decode streams fewer bytes per token, so per-token latency drops roughly proportionally:
A 70B at fp16 streams 140 GB → ~42 ms/token on one H100 (3.35 TB/s); at int8 it streams 70 GB → ~21 ms/token. The second win is capacity: halving the weight term frees HBM for KV, raising the batch ceiling from lesson 04 — more concurrent requests per GPU, lower $/token.
KV-cache quantization is a separate lever. Store K and V at fp8 instead of fp16 and lesson 02's kv_bytes/token = 2·L·H_kv·d·dtype halves. That directly doubles the KV-limited batch ceiling from lesson 04 — independent of what you do to the weights. It targets the capacity wall specifically, and it composes with GQA (which already shrank H_kv) and paging.
2 · Speculative decoding — spend spare compute to beat the bandwidth wall
Wall it targets: decode is bandwidth-bound at SMALL batch. When batch is small the GPU reads the full weight set to emit a single token and the SMs sit mostly idle (far left of the ridge). You have spare compute — speculative decoding spends it. A cheap draft model (or EAGLE / Medusa head) proposes K tokens; the target model verifies all K in one forward pass. Accepted tokens are free relative to the one weight-read you were paying for anyway.
With per-token acceptance probability α, the expected number of tokens accepted (committed) per target pass is the truncated geometric sum:
So at α=0.8, K=4: (1 − 0.8⁵)/(1 − 0.8) ≈ 3.3 tokens per target pass instead of 1 — a ~3× decode speedup, minus the draft's own cost. If the draft costs a fraction c of the target per token, you pay roughly 1 + c·K target-equivalents per step, so net speedup ≈ E[accepted] / (1 + c·K).
3 · Prefix caching / RadixAttention — delete redundant prefill
Wall it targets: wasted compute on repeated prefill. Lesson 03 told you to characterize the workload's prefix-sharing fraction f — how much of each request's prompt is byte-identical to others (system prompts, few-shot blocks, agent history, RAG context, many-sample decoding). Prefill recomputes the KV for those shared tokens every time. Prefix caching keeps the KV of seen prefixes resident and reuses it; RadixAttention organizes all cached prefixes in a radix tree so any new request matches the longest stored prefix automatically.
The leverage is entirely the workload's f. At f≈5% (mostly-unique chat prompts) it's a rounding error. At f≈80% (agents replaying one system prompt, RAG over a shared corpus, tree-of-thought) it is a multi-× win — which is the entire premise of the SGLang track. Measure f before you celebrate; mechanism in SGLang 04.
4 · Chunked prefill — stop long prompts from spiking TPOT
Wall it targets: TTFT/TPOT contention on a shared replica (lesson 04). A monolithic 4K-token prefill is one long compute-bound burst that freezes every in-flight decode until it finishes — the TPOT spike lesson 04 named. Chunked prefill splits that prompt into chunks (say eight 512-token slices) and interleaves each chunk with the ongoing decode batch in the same step.
The trade is quantified and lopsided: the chunked prompt's own TTFT rises a little (it now shares steps), but every other request's TPOT tail stops spiking. You exchange a tiny prefill-latency increase for a large TPOT-tail improvement and higher goodput (lesson 03's metric). Worth it whenever prompts are long and you have a TPOT SLO; mechanism in vLLM 10.
5 · Mixture-of-Experts — trade compute for capacity
Wall it targets: compute. A dense model spends 2N FLOPs/token over all N params. MoE routes each token to k of E experts, so only the active params do work: cost drops to 2N_active FLOPs/token (e.g. 8 experts, top-2 → ~¼ the FLOPs of a same-total-size dense model). When you're compute-bound — large-batch prefill, long context — that is a direct win.
6 · The attention O(seq²) correction — why long context is its own regime
Lesson 02's 2N rule deliberately ignored attention's O(seq²) term. That's fine at short context, but it has a crossover. The MLP/projection FLOPs scale with 2N (∝ d_model); the attention score+value FLOPs scale with seq²·d_model. The attention term rivals the per-token MLP term roughly when:
Below the crossover, prefill cost is the familiar 2N·seq and 2N dominates. Above it (contexts of tens of thousands of tokens, on models with d_model in the thousands) the seq² term takes over: prefill FLOPs grow quadratically, prefill becomes firmly compute-bound, and TTFT scales super-linearly with prompt length. This is why long context is a distinct cost regime — your 2N napkin math under-counts it — and why long-context serving leans on the compute-wall tools (MoE, the seq²-aware kernels) plus the capacity tools (KV quant, paging) at once.
The technique × wall table
| Technique | Wall it targets | When it pays | When it's a trap |
|---|---|---|---|
| Weight quantization (int8/fp8/int4) | bandwidth + capacity | bandwidth-bound decode; tight HBM | compute-bound prefill (dequant overhead); int4 accuracy |
| KV-cache quantization (fp8 KV) | memory capacity | KV-limited batch ceiling | capacity already roomy; accuracy on long ctx |
| Speculative decoding | bandwidth (small batch) | latency-critical, small batch, spare compute | large batch (compute-bound) — benefit collapses |
| Prefix caching / RadixAttention | redundant prefill compute | high prefix-share f (agents, RAG, samples) | low f — pure overhead for a rounding error |
| Chunked prefill | TTFT/TPOT contention | long prompts + a TPOT SLO | uniformly short prompts (no contention to fix) |
| MoE | compute | compute-bound, FLOPs are the limiter | memory-capacity or network wall (all experts resident; all-to-all) |
Interactive · Speculative decoding break-even
Tune the acceptance rate, speculation length, draft cost, and — crucially — the batch size. Watch the net speedup as the batch climbs toward the ridge: the same draft/target setup that's a big win at batch 1 barely helps once the GPU is compute-bound. This is the "right wall" thesis made into a slider.
What carries forward
- Diagnose, then optimize. Every technique is the answer to one named wall (memory-capacity, bandwidth, network, compute, $). Name your wall first — an optimization that doesn't relax the binding constraint does nothing or hurts.
- Bandwidth wall, small batch: weight quantization and speculative decoding — but spec decoding's win evaporates as batch approaches the ridge.
- Capacity wall (KV too big): KV quantization, GQA, paging — they raise the lesson-04 batch ceiling.
- Redundant prefill: prefix caching, with win ∝ the workload's prefix-share f (lesson 03). Contention: chunked prefill trades tiny TTFT for big TPOT-tail relief.
- Compute wall: MoE trades FLOPs for capacity + all-to-all; and remember the seq² term makes long context its own compute regime.
- Re-run the loop after every change. Relaxing one wall moves the bottleneck to the next — the binding constraint shifts, so the right next optimization changes too.