ML Systems Design, from first principles

The other tracks teach mechanisms — FlashAttention, FSDP, RadixAttention, GRPO. This one teaches the skill that sits on top of them: given a workload and a cluster, design the system. Every lesson runs the same loop — requirements → arithmetic → topology → bottleneck → iterate — and every choice is justified in bytes, FLOPs, and dollars.

Who this is for

You can explain what a KV cache is, what FSDP shards, and roughly what PPO does — but when someone says "design the training-and-serving platform for a reasoning model on 1,024 H100s," you don't have a method to attack it. By the end you do: a repeatable design loop, the back-of-the-envelope numbers it runs on, and worked designs for inference, pretraining, RL post-training, and the production lifecycle around them. This is the staff-MLE / design-interview layer.

The one method, applied again and again

"System design" sounds like a grab-bag of opinions. It isn't. Every design in this series is the same five-step loop, run until the numbers stop moving. The art is only in which constraint binds first — and that is always a number you can estimate before you write any code.

This is what "linearized" means here: you never reach for a mechanism (paging, sharding, speculative decoding) until the arithmetic has named the bottleneck it removes. Optimizations are answers to measured questions, never a checklist.

Part I · The method and its numbers

Orientation — start here

What this track is, how it relates to the mechanism tracks (GPU kernels, vLLM, SGLang, System ML, RL), and how to read it. The design loop in one page.

What makes ML systems design different

Why the web-systems playbook (stateless, CPU-bound, scale-out) misfires on ML. The resource is a $30k GPU, the bottleneck is memory bandwidth, and the workload is two workloads (prefill vs decode). The design loop, stated.

Napkin math — the numbers every design rests on

The 2N / 6N FLOPs rules, parameter & optimizer & activation memory, KV-cache bytes/token, the roofline and arithmetic intensity, the GPU spec sheet, and $/token. Estimate any ML system's cost before writing code.

Requirements — SLOs, workload, and Little's Law

TTFT, TPOT/ITL, throughput, goodput, and why you specify percentiles not means. Little's Law as the bridge from latency to concurrency to GPU count. Characterizing a workload you can't yet see.

Estimating latency — TTFT and TPOT in milliseconds

Lesson 03 named the metrics; this one predicts them. Prefill is compute-bound → TTFT = 2NP/(peak·MFU); decode is bandwidth-bound → TPOT = (2N+B·L·kv)/BW. The weight-bound vs KV-bound flip at B*, E2E = TTFT + TPOT·(G−1), and for offline batch jobs, makespan.

Estimating throughput — tokens/s, requests/s, and the frontier

Throughput = batch ÷ TPOT: it rises near-free, then saturates at a bandwidth ceiling no batch beats. The frontier as a computed curve, the knee under the SLO, the chain bytes → batch → throughput → $/token, and the batch/offline regime — drop the SLO, ride the ceiling (embeddings = prefill-only).

Part II · Inference system design

Designing one serving replica

The request lifecycle, continuous batching, the KV memory budget, and how prefill and decode fight for the same GPU. Sizing max-batch and max-context from the memory equation. The single-replica throughput–latency frontier.

Scaling inference — replication, TP, and disaggregation

When to replicate, when to shard with tensor parallel, and when to split prefill from decode onto different machines. Cache-aware routing, autoscaling on goodput, multi-tenancy, and the capacity/cost model for a fleet.

Inference optimization as a decision discipline

Quantization, speculative decoding, prefix caching, chunked prefill, MoE. For each: the arithmetic for when it pays and when it's a trap. A decision tree, not a checklist.

Part III · Training system design

Designing a pretraining system

The parallelism design space (DP / FSDP / TP / PP / EP / 3D) as a constraint-satisfaction problem: pick the config from model size, cluster, and memory. MFU as the score, communication overlap, and checkpoint / fault-tolerance at 1,000-GPU scale.

The communication tax — when a collective hides

07 called parallelism's costs "overlappable." This cashes it out: every axis moves a computable buffer (∝N for DP/FSDP, ∝b·s·h for TP/PP/EP), and one inequality — t_comm ≤ t_compute — decides if it's free. Two clean thresholds fall out: DP hides at ~8K tok/GPU; TP tax ≈ ⅓·peak/(BW·h), ~5% in-node vs ~80% across nodes. The 18× rule, derived.

The scaling ceiling — how far "add GPUs" gets you

Run the tax forward to thousands of GPUs. Strong scaling (fixed model) decays toward an Amdahl ceiling as the comm fraction climbs; weak scaling (grow the model) holds flat — why frontier runs co-design cluster + model + tokens. And the wall that isn't hardware: critical batch size caps data parallelism at dp ≤ B_crit/(micro·accum·seq).

The scaling ladder — 7B on 8 GPUs to 405B on 2,048

One continuous climb that threads 07 + 7a + 7b: at each rung, bill → binding wall → cheapest relieving axis → re-check. Watch the wall move — state > 1 GPU, state > 1 node, weights > TP alone, 6N/token — forcing FSDP → TP → PP → MoE in order. The config is an output, never an input. Ends with a topology recommender.

The data plane that keeps the GPUs fed

An $8M training run idles if the dataloader stalls. Throughput targets, streaming & sharding, packing, deterministic resumption, and the train/data co-design. Where the bottleneck hides and how to prove it isn't the GPU.

Part IV · RL post-training system design

Designing an RL post-training system

RL is a training system and an inference system wired in a loop. Actor / learner / reward roles, colocated vs disaggregated placement, weight sync, the rollout-bound vs train-bound diagnosis, and the async / off-policy staleness trade.

RL framework bottlenecks from first principles

The RL loop as a distributed system: decode bandwidth, learner FLOPs/HBM, reward/env latency, weight movement, and freshness budgets. Includes a bottleneck triage widget for actor, learner, reward, and sync walls.

Placement, layout conversion, and weight sync

Why actor serving and learner training want different parallel layouts, how full-policy fanout becomes the wall, and when to use colocated engines, resharding, bucketed updates, DMA, or relay workers.

Async rollout, environments, and streaming dataflow

The rollout plane as a serving fleet plus trajectory store: continuous batching, speculative decoding, env pools, async reward queues, dynamic repacking, backpressure, and bounded staleness.

SOTA framework optimization playbook

A workload-first map of TRL, OpenRLHF, verl, slime, NeMo RL, LlamaRL, AReaL, Laminar, AsyncFlow, Relax, and Agent Lightning. Choose by rollout wall, sync wall, env latency, model size, and staleness tolerance.

Part V · The MLE lifecycle

Designing evaluation and the feedback flywheel

A system you can't measure, you can't improve. Offline evals, LLM-as-judge and its biases, online A/B with guardrails, and the data flywheel that turns production traffic into the next training set. Regression gating as a release contract.

Production — reliability, observability, and cost

The metrics that actually predict an incident, deploy & rollback for stateful GPU services, blast-radius and redundancy math, and cost governance — the lever that decides whether the system survives its own success.

Part VI · Synthesis

Capstone — design a reasoning-model platform

One walkthrough that threads every prior lesson: take a reasoning model from pretraining → RL → serving → eval → iterate, on a fixed cluster and budget, naming the binding constraint at each stage. Includes an interactive design self-test, then hands off to the case-study library.

Part VII · Design case studies

Six self-contained worked designs. Each takes the same loop to a workload whose binding wall is different — so the resulting system looks different even though the model is ordinary. Read in any order; together they're the proof that the method generalizes.

Code assistant (Copilot-style) — TTFT-bound

Tiny outputs, near-static 6K context typed one keystroke at a time. The wall is redundant prefill; prefix caching turns a compute problem into a routing + KV-residency problem. Smaller is faster here.

Consumer chatbot at scale — memory × cost

Multi-turn history growth × six-figure concurrency × a 10× diurnal swing. The wall is KV memory and the bill; fp8/GQA density, model tiering, and autoscaling against the swing.

RAG knowledge assistant — prefill + retrieval

Long retrieved context dominates TTFT, a second subsystem (retrieval) shares the budget, and a new wrinkle appears: cache coherence — invalidating hot-doc KV when documents change.

Agentic tool-use platform — tail + state

Tasks are 10–30 sequential calls, so per-call p99 becomes the typical case (fan-out tail amplification), and tasks are stateful — pinned to their KV. A reliability/scheduling problem.

Long-context document AI — seq² + KV

100K–1M tokens: prefill goes quadratic (the seq² regime) and a single request's KV exceeds one GPU, forcing sequence/context parallelism. Amortizing the giant prefill is the big win.

High-throughput batch & embeddings — tokens / $

The mirror image of the code assistant: no latency SLO, so batching is the whole game. Push past the roofline ridge, quantize, run on spot GPUs — pure tokens-per-dollar.

How to use this

Read 01–03b in order, no skipping. They are the method. Every later lesson assumes the design loop and the napkin-math numbers; the arithmetic is load-bearing, not decoration.
Do the back-of-the-envelope before reading the answer. Each design lesson states the requirements, then pauses. Estimate the GPU count yourself, then check. The gap between your guess and the number is the lesson.
Touch the widgets. Each has one knob whose extreme setting flips the binding constraint — memory becomes bandwidth, latency-bound becomes throughput-bound. Finding that flip is the design intuition.
This track points down at the mechanism tracks. When a lesson decides "shard with tensor parallel here," the how lives in System ML 06. This series is the why-and-when; follow the links for the how.
Then drill with Part VII (13–18). After the capstone, the six case studies apply the loop cold to real products. Predict the binding wall before you read each one — that prediction is the skill the whole track is building.

Prerequisites

None are strictly required — the arithmetic is built from scratch in lesson 02 — but the payoff multiplies if you've seen the mechanisms in action. Ideal prior exposure: vLLM 01 (KV cache), System ML 01 (why distributed), and RL Post-Training 01. This track turns those mechanisms into design decisions.