all_lessons / ml_system_design / 06 · optimization lesson 6 / 20

Inference optimization as a decision discipline

By now you can size a replica (lesson 04) and a fleet (lesson 05) and read off which wall binds. This lesson is the catalogue of fixes — but it is not a checklist. Each technique is the answer to one named bottleneck from lesson 01's wall analysis, and applying it against the wrong wall ranges from useless to actively harmful. The skill is diagnosis first, optimization second.

The thesis — optimize the binding wall, nothing else
An optimization speeds up the system only if it relaxes the constraint that is currently limiting it. Lesson 01 gave you the five walls: memory-capacity, memory-bandwidth, network, compute, and $. Every technique below is introduced the same way: binds when wall X is closest; useless or harmful otherwise. Before you reach for any of them, re-run the loop and name your wall. If you can't name it, you can't optimize it.

The decision tree — diagnose the wall, then pick

Start from the symptom you measured, follow it to the wall, and the technique falls out. This is the whole lesson in one picture; the sections after it are just the arithmetic behind each branch.

which wall is closest? (lesson 01) decode bandwidth-bound, small batch → spec decoding · quantization memory-capacity (KV too big) → quantize KV · GQA · paging (raises the batch ceiling) redundant prefill (shared prefix) → prefix caching / RadixAttention TTFT vs TPOT contention (long prompts stall decodes) → chunked prefill compute-bound (large batch / long ctx) → MoE (fewer active params) · mind the seq² attention term no clear wall? you're at the SLO knee — replicate (lesson 05), don't micro-optimize

1 · Quantization — fewer bytes per weight

Wall it targets: memory bandwidth + memory capacity. Lesson 01 said decode latency ≈ model_bytes / BW. Weights are 2N bytes at fp16 (lesson 02); store them at int8 and they become N, at int4 they become N/2. Decode streams fewer bytes per token, so per-token latency drops roughly proportionally:

decode_latency(int8) ≈ (N bytes) / BW ≈ ½ · decode_latency(fp16)

A 70B at fp16 streams 140 GB → ~42 ms/token on one H100 (3.35 TB/s); at int8 it streams 70 GB → ~21 ms/token. The second win is capacity: halving the weight term frees HBM for KV, raising the batch ceiling from lesson 04 — more concurrent requests per GPU, lower $/token.

Trap: useless or harmful when you're already compute-bound
At large-batch prefill you're right of the roofline ridge (~295 FLOP/byte) — compute-bound, not bandwidth-bound. There, weight bytes aren't the limiter, so int8 weights buy you nothing on latency, and the per-matmul dequant step (int8 → fp16 before the FLOPs) is added work that can make it slower. Quantization is a bandwidth/capacity lever; spending it on a compute wall is a loss. And int4 in particular degrades accuracy enough that you must gate it behind an eval harness (lesson 10) — a faster wrong answer is not an optimization.

KV-cache quantization is a separate lever. Store K and V at fp8 instead of fp16 and lesson 02's kv_bytes/token = 2·L·H_kv·d·dtype halves. That directly doubles the KV-limited batch ceiling from lesson 04 — independent of what you do to the weights. It targets the capacity wall specifically, and it composes with GQA (which already shrank H_kv) and paging.

2 · Speculative decoding — spend spare compute to beat the bandwidth wall

Wall it targets: decode is bandwidth-bound at SMALL batch. When batch is small the GPU reads the full weight set to emit a single token and the SMs sit mostly idle (far left of the ridge). You have spare compute — speculative decoding spends it. A cheap draft model (or EAGLE / Medusa head) proposes K tokens; the target model verifies all K in one forward pass. Accepted tokens are free relative to the one weight-read you were paying for anyway.

With per-token acceptance probability α, the expected number of tokens accepted (committed) per target pass is the truncated geometric sum:

E[accepted] = (1 − α^(K+1)) / (1 − α)

So at α=0.8, K=4: (1 − 0.8⁵)/(1 − 0.8) ≈ 3.3 tokens per target pass instead of 1 — a ~3× decode speedup, minus the draft's own cost. If the draft costs a fraction c of the target per token, you pay roughly 1 + c·K target-equivalents per step, so net speedup ≈ E[accepted] / (1 + c·K).

Critical trap: the benefit collapses as batch grows
Verification is "free" only because you had idle compute. Push batch toward the ridge (~295 tokens-in-flight, lesson 01) and the GPU becomes compute-bound — now those extra K verification FLOPs are not free, they compete with the real batch. Speculative decoding's win shrinks toward zero (and can go negative) as batch rises. It is a small-batch / latency-critical optimization, the opposite regime from throughput-maximizing large batches. Mechanism and acceptance-rate tuning: vLLM 07 and SGLang 10.

3 · Prefix caching / RadixAttention — delete redundant prefill

Wall it targets: wasted compute on repeated prefill. Lesson 03 told you to characterize the workload's prefix-sharing fraction f — how much of each request's prompt is byte-identical to others (system prompts, few-shot blocks, agent history, RAG context, many-sample decoding). Prefill recomputes the KV for those shared tokens every time. Prefix caching keeps the KV of seen prefixes resident and reuses it; RadixAttention organizes all cached prefixes in a radix tree so any new request matches the longest stored prefix automatically.

prefill_saved ≈ f · prefill_cost   |   win scales with f, not with the technique

The leverage is entirely the workload's f. At f≈5% (mostly-unique chat prompts) it's a rounding error. At f≈80% (agents replaying one system prompt, RAG over a shared corpus, tree-of-thought) it is a multi-× win — which is the entire premise of the SGLang track. Measure f before you celebrate; mechanism in SGLang 04.

4 · Chunked prefill — stop long prompts from spiking TPOT

Wall it targets: TTFT/TPOT contention on a shared replica (lesson 04). A monolithic 4K-token prefill is one long compute-bound burst that freezes every in-flight decode until it finishes — the TPOT spike lesson 04 named. Chunked prefill splits that prompt into chunks (say eight 512-token slices) and interleaves each chunk with the ongoing decode batch in the same step.

monolithic prefill: decodes frozen for the whole burst [######## 4K prefill ########] then decode decode decode ↑ TPOT tail spikes here chunked prefill: decode never stalls more than one slice [512][dec][512][dec][512][dec] … TPOT stays flat, TTFT +tiny

The trade is quantified and lopsided: the chunked prompt's own TTFT rises a little (it now shares steps), but every other request's TPOT tail stops spiking. You exchange a tiny prefill-latency increase for a large TPOT-tail improvement and higher goodput (lesson 03's metric). Worth it whenever prompts are long and you have a TPOT SLO; mechanism in vLLM 10.

5 · Mixture-of-Experts — trade compute for capacity

Wall it targets: compute. A dense model spends 2N FLOPs/token over all N params. MoE routes each token to k of E experts, so only the active params do work: cost drops to 2N_active FLOPs/token (e.g. 8 experts, top-2 → ~¼ the FLOPs of a same-total-size dense model). When you're compute-bound — large-batch prefill, long context — that is a direct win.

Trap: it moves the cost onto the capacity and network walls
All E experts must sit in HBM simultaneously — the weight term is unchanged or worse, so MoE buys cheaper FLOPs at the price of memory capacity. Worse, routing is data-dependent: token-to-expert assignment is uneven (load imbalance idles some experts while others queue), and expert-parallel layouts add an all-to-all exchange every MoE layer — a network cost that, like TP's all-reduce (lesson 05), wants to stay intra-node. So MoE is great on a compute wall and costly on a memory or network wall. Mechanics in system_ml 09; the modeling trade-offs in DL 12.

6 · The attention O(seq²) correction — why long context is its own regime

Lesson 02's 2N rule deliberately ignored attention's O(seq²) term. That's fine at short context, but it has a crossover. The MLP/projection FLOPs scale with 2N (∝ d_model); the attention score+value FLOPs scale with seq²·d_model. The attention term rivals the per-token MLP term roughly when:

attention FLOPs ∝ seq²·d_model   rival   MLP FLOPs ∝ seq·(≈12·d_model²)   ⇒   crossover at seq ≈ 12·d_model (≈8k d_model for a 70B → crossover near ~16–24k tokens)

Below the crossover, prefill cost is the familiar 2N·seq and 2N dominates. Above it (contexts of tens of thousands of tokens, on models with d_model in the thousands) the seq² term takes over: prefill FLOPs grow quadratically, prefill becomes firmly compute-bound, and TTFT scales super-linearly with prompt length. This is why long context is a distinct cost regime — your 2N napkin math under-counts it — and why long-context serving leans on the compute-wall tools (MoE, the seq²-aware kernels) plus the capacity tools (KV quant, paging) at once.

The technique × wall table

TechniqueWall it targetsWhen it paysWhen it's a trap
Weight quantization (int8/fp8/int4)bandwidth + capacitybandwidth-bound decode; tight HBMcompute-bound prefill (dequant overhead); int4 accuracy
KV-cache quantization (fp8 KV)memory capacityKV-limited batch ceilingcapacity already roomy; accuracy on long ctx
Speculative decodingbandwidth (small batch)latency-critical, small batch, spare computelarge batch (compute-bound) — benefit collapses
Prefix caching / RadixAttentionredundant prefill computehigh prefix-share f (agents, RAG, samples)low f — pure overhead for a rounding error
Chunked prefillTTFT/TPOT contentionlong prompts + a TPOT SLOuniformly short prompts (no contention to fix)
MoEcomputecompute-bound, FLOPs are the limitermemory-capacity or network wall (all experts resident; all-to-all)

Interactive · Speculative decoding break-even

Tune the acceptance rate, speculation length, draft cost, and — crucially — the batch size. Watch the net speedup as the batch climbs toward the ridge: the same draft/target setup that's a big win at batch 1 barely helps once the GPU is compute-bound. This is the "right wall" thesis made into a slider.

Speculative decoding break-even

Assumptions: per-token acceptance is i.i.d. with rate α; expected committed tokens per target pass = (1−α^(K+1))/(1−α); draft adds ~c·K target-equivalents of work, so raw speedup = E[accepted]/(1+c·K). The batch damp models "spare compute vanishes near the ridge": net = raw damped by max(0, 1 − batch/ridge), ridge ≈ 295 tokens-in-flight (lesson 01/02). Order-of-magnitude — real engines also pay scheduling overhead.

exp. accepted / pass
raw speedup
batch-damped net
verdict

What carries forward