all_lessons / ml_system_design / 01 · what's different lesson 1 / 20

What makes ML systems design different

The classic distributed-systems playbook — stateless workers, scale out on cheap boxes, push state to a database — quietly fails on ML serving and training. Three facts break it: the unit of compute costs as much as a car, the bottleneck is memory bandwidth not CPU, and what looks like one workload is actually two with opposite profiles. Internalize these and the rest of the track is consequences.

First principle: the cost structure is inverted

A web service is built on the assumption that compute is cheap and roughly free to replicate. A request costs microseconds of a CPU that costs cents per hour; if you need more, you start more identical stateless workers behind a load balancer. The hard problems are state (databases, consistency) and fan-out (one request touching many services).

ML serving inverts every term. The unit of compute is one GPU — an H100 lists around $25,000–$40,000 and rents for roughly $2–$10 / hour. A single LLM request can occupy that GPU for hundreds of milliseconds. Utilization isn't a tidiness concern; at these prices, 10% idle is the difference between a profitable product and a burning one.

The consequence
In web design you optimize for developer velocity and accept some waste, because the hardware is cheap. In ML design you optimize for hardware utilization (MFU in training, goodput-per-GPU in serving), because the hardware is the entire cost. Almost every "weird" decision in this track — batching unrelated users together, paging KV like an OS, splitting prefill from decode — is a direct attempt to stop a $30k chip from idling.

First principle: the wall is bandwidth, not compute

The instinct from CS is that "fast" means "fewer FLOPs." On a modern GPU that instinct is usually wrong. Consider an H100: roughly 990 TFLOP/s of bf16 matrix compute, but only 3.35 TB/s of memory bandwidth. The ratio — call it the machine's arithmetic intensity — is about 295 FLOPs per byte. A kernel that doesn't do at least ~300 math operations on every byte it reads from memory leaves the compute units starving, waiting on memory.

LLM decoding is the canonical victim. Generating one token reads all the model's weights from memory to do a tiny amount of math per weight (batch size 1 ⇒ roughly one multiply-add per weight). That is an arithmetic intensity near 1 — two orders of magnitude below what the hardware wants. So single-stream decode runs at a small fraction of peak FLOP/s and its speed is set almost entirely by how fast you can stream the weights:

decode latency per token ≈ model_bytes / memory_bandwidth

For a 70B model in fp16 (140 GB) on an H100 (3.35 TB/s): 140e9 / 3.35e12 ≈ 42 ms / token as a hardware floor, regardless of how "fast" the GPU's math units are. This one equation explains why batching exists (amortize the weight read over many tokens), why quantization helps decode (fewer bytes to stream), and why KV-cache size matters so much (it's more bytes to move). We will use it constantly.

The trap it sets for designers
Teams routinely "optimize" by reducing FLOPs (a cheaper attention variant, fewer layers) and see no speedup, because they were never compute-bound. The first question in any ML performance design is not "how many FLOPs?" but "am I compute-bound or memory-bound right now?" — the roofline question of lesson 02. Get that wrong and every subsequent decision is aimed at the wrong wall.

First principle: it's two workloads wearing one trench coat

An LLM request has two phases with opposite hardware profiles, and treating them as one workload is the most common design error.

PREFILL — process the whole prompt at once All P prompt tokens in parallel → big matmuls high arithmetic intensity → COMPUTE-bound DECODE — emit one token at a time 1 token at a time → tiny matmuls, full weight read arithmetic intensity ≈ 1 → BANDWIDTH-bound One request = a short compute-bound burst, then a long bandwidth-bound tail. prefill (1 step) decode (hundreds of steps, one token each) They contend for the same GPU. A long prefill blocks everyone's decode (a latency spike); batching decodes starves if prefills hog the compute. Lessons 04–05 are largely about refereeing this fight. This split is why metrics come in pairs (TTFT for prefill, TPOT for decode) and why "disaggregated serving" — separate machines per phase — is even a thing.

This is unique to autoregressive generation. No web request has a "phase 1 that's CPU-bound and phase 2 that's I/O-bound on the same data, hundreds of times in a row." Almost every serving design decision — the metrics you pick (lesson 03), how you batch (04), whether you disaggregate (05) — descends from this single structural fact.

Putting it together: the design loop

Given those three facts, the loop from the orientation page is the rational response. Here it is again, now with the why attached:

StepWhat you doBecause of the principle…
1 · RequirementsPin SLOs, workload, scale, budget$/GPU-hour is so high that "build it and see" is unaffordable
2 · ArithmeticFLOPs, bytes, bandwidth, $The wall is usually bandwidth/memory — you must compute which
3 · TopologyParallelism, replication, layoutModels don't fit on one GPU; phases want different placement
4 · BottleneckFind the wall that binds firstOptimizing a non-binding wall buys nothing
5 · IterateRelax it, re-run, repeatRemoving one wall always exposes the next

Interactive · find the wall

Below: a single decode step of a dense model on one GPU. Drag the batch size. At batch 1 you are deep in bandwidth-bound territory (the weights are read once and barely used). As batch grows, the same weight-read serves more tokens, arithmetic intensity climbs, and eventually you cross the roofline ridge into compute-bound. The crossover point is the single most important number in serving design — it's the batch size where the GPU finally stops wasting its math units.

Memory-bound or compute-bound?

Increase batch size and watch arithmetic intensity climb toward the machine's ridge (~295 FLOP/byte for H100 bf16). Left of the ridge = bandwidth-bound (add batch, it's free throughput). Right = compute-bound (batching no longer helps; you need more FLOP/s or fewer of them).

arith. intensity
regime
decode latency / token
tokens / sec (all streams)

The thing to notice
Throughput keeps rising with batch size for free until you hit the ridge — you're serving more users at the same per-token latency, because you were wasting bandwidth-bound cycles anyway. That "free throughput up to the ridge" is the economic engine of LLM serving, and the reason continuous batching (lesson 04) is the first thing every serving stack implements.

What carries forward