What makes ML systems design different
The classic distributed-systems playbook — stateless workers, scale out on cheap boxes, push state to a database — quietly fails on ML serving and training. Three facts break it: the unit of compute costs as much as a car, the bottleneck is memory bandwidth not CPU, and what looks like one workload is actually two with opposite profiles. Internalize these and the rest of the track is consequences.
First principle: the cost structure is inverted
A web service is built on the assumption that compute is cheap and roughly free to replicate. A request costs microseconds of a CPU that costs cents per hour; if you need more, you start more identical stateless workers behind a load balancer. The hard problems are state (databases, consistency) and fan-out (one request touching many services).
ML serving inverts every term. The unit of compute is one GPU — an H100 lists around $25,000–$40,000 and rents for roughly $2–$10 / hour. A single LLM request can occupy that GPU for hundreds of milliseconds. Utilization isn't a tidiness concern; at these prices, 10% idle is the difference between a profitable product and a burning one.
First principle: the wall is bandwidth, not compute
The instinct from CS is that "fast" means "fewer FLOPs." On a modern GPU that instinct is usually wrong. Consider an H100: roughly 990 TFLOP/s of bf16 matrix compute, but only 3.35 TB/s of memory bandwidth. The ratio — call it the machine's arithmetic intensity — is about 295 FLOPs per byte. A kernel that doesn't do at least ~300 math operations on every byte it reads from memory leaves the compute units starving, waiting on memory.
LLM decoding is the canonical victim. Generating one token reads all the model's weights from memory to do a tiny amount of math per weight (batch size 1 ⇒ roughly one multiply-add per weight). That is an arithmetic intensity near 1 — two orders of magnitude below what the hardware wants. So single-stream decode runs at a small fraction of peak FLOP/s and its speed is set almost entirely by how fast you can stream the weights:
For a 70B model in fp16 (140 GB) on an H100 (3.35 TB/s): 140e9 / 3.35e12 ≈ 42 ms / token as a hardware floor, regardless of how "fast" the GPU's math units are. This one equation explains why batching exists (amortize the weight read over many tokens), why quantization helps decode (fewer bytes to stream), and why KV-cache size matters so much (it's more bytes to move). We will use it constantly.
First principle: it's two workloads wearing one trench coat
An LLM request has two phases with opposite hardware profiles, and treating them as one workload is the most common design error.
This is unique to autoregressive generation. No web request has a "phase 1 that's CPU-bound and phase 2 that's I/O-bound on the same data, hundreds of times in a row." Almost every serving design decision — the metrics you pick (lesson 03), how you batch (04), whether you disaggregate (05) — descends from this single structural fact.
Putting it together: the design loop
Given those three facts, the loop from the orientation page is the rational response. Here it is again, now with the why attached:
| Step | What you do | Because of the principle… |
|---|---|---|
| 1 · Requirements | Pin SLOs, workload, scale, budget | $/GPU-hour is so high that "build it and see" is unaffordable |
| 2 · Arithmetic | FLOPs, bytes, bandwidth, $ | The wall is usually bandwidth/memory — you must compute which |
| 3 · Topology | Parallelism, replication, layout | Models don't fit on one GPU; phases want different placement |
| 4 · Bottleneck | Find the wall that binds first | Optimizing a non-binding wall buys nothing |
| 5 · Iterate | Relax it, re-run, repeat | Removing one wall always exposes the next |
Interactive · find the wall
Below: a single decode step of a dense model on one GPU. Drag the batch size. At batch 1 you are deep in bandwidth-bound territory (the weights are read once and barely used). As batch grows, the same weight-read serves more tokens, arithmetic intensity climbs, and eventually you cross the roofline ridge into compute-bound. The crossover point is the single most important number in serving design — it's the batch size where the GPU finally stops wasting its math units.
What carries forward
- Utilization is the objective, because the hardware is the cost. Every design is judged on whether the expensive chip is busy doing useful work.
- Ask "which wall?" before optimizing. Memory capacity, memory bandwidth, network, compute, or budget — exactly one binds at a time.
- Prefill and decode are different workloads. Expect every serving decision to treat them separately.
- The loop is the skill. Lessons 02 (the numbers) and 03 (the requirements) finish equipping it; everything after is the loop, run on a new problem.