Orientation — start here
This page is the map. It tells you what "ML systems design" means in this track, why it deserves its own series separate from the mechanism tracks, and the single loop you will run twelve times. Read it once; come back to it whenever a later lesson feels like a pile of disconnected facts.
The gap this track fills
The other series on this site each teach a mechanism deeply: how a kernel uses the GPU, how FSDP shards a model, how RadixAttention reuses a prefix, how GRPO drops the critic. Each answers a how.
None of them answers the question you actually get asked on the job, or in a staff-level interview:
That is a different skill from knowing the mechanisms — the way knowing the rules of chess is different from playing well. This track is the playing-well layer. It assumes the mechanisms exist (and links down to them) and spends its attention on selection under constraints.
The loop, in one page
Every design — inference, training, RL, a data plane — is the same five steps. You will see this loop in lesson 01 stated formally, and then run explicitly in 04 through 12.
- Requirements. Turn vague asks into numbers: SLOs (latency percentiles, throughput), workload shape (prompt/output lengths, request rate), scale, and budget. A design without a number is an opinion. lesson 03
- Arithmetic. Estimate the cost in the three currencies that bind ML systems: FLOPs (compute), bytes (memory capacity and bandwidth), and dollars (GPU-hours). This is back-of-the-envelope, done before any code. lesson 02
- Topology. Lay the work onto hardware: which parallelism, how many replicas, what is co-located vs split apart, how data flows. lessons 04–09
- Bottleneck. Of the four walls — memory capacity, memory bandwidth, network, compute (and the meta-wall, dollars) — find which one the arithmetic says you hit first. That wall, and only that wall, is your problem right now.
- Iterate. Apply the one mechanism that moves that wall. Re-run the arithmetic. A different wall is now closest. Repeat until the closest wall is the budget — then you are done, because no further engineering helps without more money.
Three numbers to carry in your head
You will derive these in lesson 02, but seeing them now makes the early lessons concrete. For a dense transformer with N parameters:
| Quantity | Rule of thumb | Why it matters |
|---|---|---|
| Forward FLOPs / token | ≈ 2N | Sets inference compute and prefill cost |
| Training FLOPs / token | ≈ 6N | Fwd + bwd; sets pretraining GPU-months |
| Weights memory | 2N bytes (fp16/bf16) | The floor before you've served a single token |
A 70B model is therefore ~140 GB of weights — already more than one 80 GB H100. That single fact forces tensor parallelism or quantization before you've thought about anything else. The whole track is fact-after-fact like this: a number forces a structural decision.
How the lessons depend on each other
This is a strict-linear chain. Each lesson assumes everything before it and nothing after.
Read 01–03 carefully and in order — they are the foundation and they are short. After that, the inference (04–06), training (07–08), and RL (09) blocks can be skimmed by interest, but the capstone (12) assumes you've done all of them.
What this track is not
- Not a mechanism tutorial. We won't derive FlashAttention or implement PagedAttention. We decide when you need them and link to the track that builds them.
- Not classic web-system design. Load balancers, sharded SQL, and CAP theorem have their place, but the binding constraints here are GPU memory bandwidth and interconnect, not database round-trips. Lesson 01 is about exactly this difference.
- Not framework documentation. vLLM, SGLang, Megatron, and veRL are instances of these designs. We reason about the design space they all live in, so you can evaluate the next framework too.