Orientation — start here

This page is the map. It tells you what "ML systems design" means in this track, why it deserves its own series separate from the mechanism tracks, and the single loop you will run twelve times. Read it once; come back to it whenever a later lesson feels like a pile of disconnected facts.

The gap this track fills

The other series on this site each teach a mechanism deeply: how a kernel uses the GPU, how FSDP shards a model, how RadixAttention reuses a prefix, how GRPO drops the critic. Each answers a how.

None of them answers the question you actually get asked on the job, or in a staff-level interview:

The question this track is about

"Here is a model, a cluster, a budget, and a latency target. Design the system." Which parallelism? How many GPUs? Replicate or shard? What breaks first when traffic triples? What does it cost per million tokens, and which single change halves that?

That is a different skill from knowing the mechanisms — the way knowing the rules of chess is different from playing well. This track is the playing-well layer. It assumes the mechanisms exist (and links down to them) and spends its attention on selection under constraints.

The loop, in one page

Every design — inference, training, RL, a data plane — is the same five steps. You will see this loop in lesson 01 stated formally, and then run explicitly in 04 through 12.

Requirements. Turn vague asks into numbers: SLOs (latency percentiles, throughput), workload shape (prompt/output lengths, request rate), scale, and budget. A design without a number is an opinion. lesson 03
Arithmetic. Estimate the cost in the three currencies that bind ML systems: FLOPs (compute), bytes (memory capacity and bandwidth), and dollars (GPU-hours). This is back-of-the-envelope, done before any code. lesson 02
Topology. Lay the work onto hardware: which parallelism, how many replicas, what is co-located vs split apart, how data flows. lessons 04–09
Bottleneck. Of the four walls — memory capacity, memory bandwidth, network, compute (and the meta-wall, dollars) — find which one the arithmetic says you hit first. That wall, and only that wall, is your problem right now.
Iterate. Apply the one mechanism that moves that wall. Re-run the arithmetic. A different wall is now closest. Repeat until the closest wall is the budget — then you are done, because no further engineering helps without more money.

The discipline that makes it "linearized"

You never apply a mechanism (paging, quantization, pipeline parallel, speculative decoding) until step 4 has named the wall it removes. Optimizations are answers to a measured question. A design that lists ten optimizations without naming which bottleneck each one targets is a cargo cult, and an interviewer will catch it in one follow-up question: "why that one, and what does it buy you?"

Three numbers to carry in your head

You will derive these in lesson 02, but seeing them now makes the early lessons concrete. For a dense transformer with N parameters:

Quantity	Rule of thumb	Why it matters
Forward FLOPs / token	≈ 2N	Sets inference compute and prefill cost
Training FLOPs / token	≈ 6N	Fwd + bwd; sets pretraining GPU-months
Weights memory	2N bytes (fp16/bf16)	The floor before you've served a single token

A 70B model is therefore ~140 GB of weights — already more than one 80 GB H100. That single fact forces tensor parallelism or quantization before you've thought about anything else. The whole track is fact-after-fact like this: a number forces a structural decision.

How the lessons depend on each other

This is a strict-linear chain. Each lesson assumes everything before it and nothing after.

01 what's different ─┐ 02 napkin math ──────┼─→ 04 single replica ─→ 05 at scale ─→ 06 optimization 03 SLOs / workload ──┘ │ ▼ 07 pretraining ─→ 08 data plane ─→ 09 RL post-training ←──────────┘ │ │ (RL = training + inference in a loop, └───────────────────────────────────┘ so it reuses 04–05 and 07) ▼ 10 evaluation / flywheel ─→ 11 production / cost ─→ 12 capstone (uses all)

Read 01–03 carefully and in order — they are the foundation and they are short. After that, the inference (04–06), training (07–08), and RL (09) blocks can be skimmed by interest, but the capstone (12) assumes you've done all of them.

What this track is not

Not a mechanism tutorial. We won't derive FlashAttention or implement PagedAttention. We decide when you need them and link to the track that builds them.
Not classic web-system design. Load balancers, sharded SQL, and CAP theorem have their place, but the binding constraints here are GPU memory bandwidth and interconnect, not database round-trips. Lesson 01 is about exactly this difference.
Not framework documentation. vLLM, SGLang, Megatron, and veRL are instances of these designs. We reason about the design space they all live in, so you can evaluate the next framework too.

A note on the numbers

Every figure here is a 2024–2025-era rule of thumb (H100/H200 class hardware, dense and MoE transformers). Hardware moves; the method doesn't. When the H300 ships, you re-run the same arithmetic with new constants. Memorize the loop, not the gigabytes.