What makes ML systems design different

The classic distributed-systems playbook — stateless workers, scale out on cheap boxes, push state to a database — quietly fails on ML serving and training. Three facts break it: the unit of compute costs as much as a car, the bottleneck is memory bandwidth not CPU, and what looks like one workload is actually two with opposite profiles. Internalize these and the rest of the track is consequences.

First principle: the cost structure is inverted

A web service is built on the assumption that compute is cheap and roughly free to replicate. A request costs microseconds of a CPU that costs cents per hour; if you need more, you start more identical stateless workers behind a load balancer. The hard problems are state (databases, consistency) and fan-out (one request touching many services).

ML serving inverts every term. The unit of compute is one GPU — an H100 lists around $25,000–$40,000 and rents for roughly $2–$10 / hour. A single LLM request can occupy that GPU for hundreds of milliseconds. Utilization isn't a tidiness concern; at these prices, 10% idle is the difference between a profitable product and a burning one.

The consequence

In web design you optimize for developer velocity and accept some waste, because the hardware is cheap. In ML design you optimize for hardware utilization (MFU in training, goodput-per-GPU in serving), because the hardware is the entire cost. Almost every "weird" decision in this track — batching unrelated users together, paging KV like an OS, splitting prefill from decode — is a direct attempt to stop a $30k chip from idling.

First principle: the wall is bandwidth, not compute

The instinct from CS is that "fast" means "fewer FLOPs." On a modern GPU that instinct is usually wrong. Consider an H100: roughly 990 TFLOP/s of bf16 matrix compute, but only 3.35 TB/s of memory bandwidth. The ratio — call it the machine's arithmetic intensity — is about 295 FLOPs per byte. A kernel that doesn't do at least ~300 math operations on every byte it reads from memory leaves the compute units starving, waiting on memory.

LLM decoding is the canonical victim. Generating one token reads all the model's weights from memory to do a tiny amount of math per weight (batch size 1 ⇒ roughly one multiply-add per weight). That is an arithmetic intensity near 1 — two orders of magnitude below what the hardware wants. So single-stream decode runs at a small fraction of peak FLOP/s and its speed is set almost entirely by how fast you can stream the weights:

decode latency per token ≈ model_bytes / memory_bandwidth

For a 70B model in fp16 (140 GB) on an H100 (3.35 TB/s): 140e9 / 3.35e12 ≈ 42 ms / token as a hardware floor, regardless of how "fast" the GPU's math units are. This one equation explains why batching exists (amortize the weight read over many tokens), why quantization helps decode (fewer bytes to stream), and why KV-cache size matters so much (it's more bytes to move). We will use it constantly.

The trap it sets for designers

Teams routinely "optimize" by reducing FLOPs (a cheaper attention variant, fewer layers) and see no speedup, because they were never compute-bound. The first question in any ML performance design is not "how many FLOPs?" but "am I compute-bound or memory-bound right now?" — the roofline question of lesson 02. Get that wrong and every subsequent decision is aimed at the wrong wall.

First principle: it's two workloads wearing one trench coat

An LLM request has two phases with opposite hardware profiles, and treating them as one workload is the most common design error.

This is unique to autoregressive generation. No web request has a "phase 1 that's CPU-bound and phase 2 that's I/O-bound on the same data, hundreds of times in a row." Almost every serving design decision — the metrics you pick (lesson 03), how you batch (04), whether you disaggregate (05) — descends from this single structural fact.

Putting it together: the design loop

Given those three facts, the loop from the orientation page is the rational response. Here it is again, now with the why attached:

Step	What you do	Because of the principle…
1 · Requirements	Pin SLOs, workload, scale, budget	$/GPU-hour is so high that "build it and see" is unaffordable
2 · Arithmetic	FLOPs, bytes, bandwidth, $	The wall is usually bandwidth/memory — you must compute which
3 · Topology	Parallelism, replication, layout	Models don't fit on one GPU; phases want different placement
4 · Bottleneck	Find the wall that binds first	Optimizing a non-binding wall buys nothing
5 · Iterate	Relax it, re-run, repeat	Removing one wall always exposes the next

Interactive · find the wall

Below: a single decode step of a dense model on one GPU. Drag the batch size. At batch 1 you are deep in bandwidth-bound territory (the weights are read once and barely used). As batch grows, the same weight-read serves more tokens, arithmetic intensity climbs, and eventually you cross the roofline ridge into compute-bound. The crossover point is the single most important number in serving design — it's the batch size where the GPU finally stops wasting its math units.

The thing to notice

Throughput keeps rising with batch size for free until you hit the ridge — you're serving more users at the same per-token latency, because you were wasting bandwidth-bound cycles anyway. That "free throughput up to the ridge" is the economic engine of LLM serving, and the reason continuous batching (lesson 04) is the first thing every serving stack implements.

What carries forward

Utilization is the objective, because the hardware is the cost. Every design is judged on whether the expensive chip is busy doing useful work.
Ask "which wall?" before optimizing. Memory capacity, memory bandwidth, network, compute, or budget — exactly one binds at a time.
Prefill and decode are different workloads. Expect every serving decision to treat them separately.
The loop is the skill. Lessons 02 (the numbers) and 03 (the requirements) finish equipping it; everything after is the loop, run on a new problem.