System ML — Parallel Training & Inference, From First Principles
A linearized tour of how a 70B-parameter model is trained on thousands of GPUs and served to millions of users. Each lesson isolates one mechanism — derived from a bandwidth-vs-compute argument, not memorized from a slide deck.
interview_questions/traditional_ml/ track covers the interview-prep side of the same fundamentals.
The system you're learning
Every distributed deep-learning system reduces to one tension: HBM, compute, and interconnect are three orthogonal resources, and the way you partition the work decides which one binds first. The seven parallelism strategies below are seven different bets about which axis is cheap and which is scarce on your hardware. The art is composing them so all three are saturated at once.
Each strategy lives at a different point on the triangle. Pure DDP sits squarely in the centre (it spends HBM and interconnect, not much else). FSDP slides toward the interconnect corner — trading communication for memory. TP and EP push almost everything onto the interconnect. The lessons walk this triangle.
The four questions this series answers
- Why can't I just buy a bigger GPU? Because the model, the gradients, the optimizer state, and the activations grow at different rates than HBM does. Memory is the first wall (lesson 01).
- Why is AllReduce a constant — not a function of N? Ring topology. The bandwidth-optimal collective derivation in lesson 02 is the single most useful piece of math in this series.
- Why does TP "want" to stay on one node? The bandwidth pyramid: 10 TB/s registers, 3 TB/s HBM, 0.9 TB/s NVLink, 0.05 TB/s IB per NIC. Two orders of magnitude per layer (lesson 03).
- What changes between training and inference? Training is dominated by gradient sync; inference is dominated by per-step memory bandwidth. The same hardware, used differently. Lessons 11–12.
Part I · Foundations (lessons 01–03 · the language and the constraints)
AllReduce = ReduceScatter + AllGather and the ring derivation that makes its cost independent of N for large tensors.Part II · Training parallelism (lessons 04–09 · six strategies, six places to shard)
Each strategy shards something different: the data, the optimizer state, the weights themselves, the layers, the sequence, or the experts. Read in order and you can derive any modern training stack (Megatron-LM, DeepSpeed, NeMo, FSDP) as a composition of these six.
replicate shard
weights optimizer
┌── DDP ───────────────────── FSDP / ZeRO ──┐ memory
│ (lesson 04) (lesson 05) │
│ │
│ shard within shard across │
│ layers layers │
├── TP ──────────────────────── PP ──────────┤ compute
│ (lesson 06) (lesson 07) │
│ │
│ shard the shard the │
│ sequence experts │
└── SP / CP ────────────────── EP ───────────┘ specialty
(lesson 08) (lesson 09)
Part III · Composition & inference (lessons 10–12 · putting it together, and what changes at serve-time)
TP=8 × PP=4 × DP=16 is the canonical layout for ~70B on 512 GPUs. The decision tree: TP first intra-node, PP across nodes, DP/FSDP outermost, SP/CP for long-context, EP for MoE.Part IV · The single-GPU stack (lessons 13–19 · framework, kernels, compilers)
Parts I–III sit at the cluster level: ranks, collectives, fabric. Below all that is one GPU running one forward pass — and the per-GPU throughput is set by a stack of layers most of which you never see. The Python you write becomes an autograd graph, which becomes a dispatcher trace, which becomes a stream of kernel launches, which become matmuls and elementwise ops, which read and write HBM. Each layer is a place where performance leaks or is reclaimed.
user code ────▶ PyTorch dispatcher ────▶ autograd graph
(lesson 13) (lesson 13) (lesson 13)
│
┌────────────────────────┘
▼
autocast / precision ─────▶ caching allocator
(lesson 14) (lesson 15)
│ │
▼ │
kernel launch ◀────── CUDA stream │
(lesson 16) │
│ │
▼ │
┌── handwritten CUDA / cuBLAS / cuDNN ───┤
├── Triton (lesson 17) │
└── Inductor codegen (lesson 18) │
│
CUDA Graphs / TensorRT (lesson 19)
│
▼
HBM
torch.matmul" actually does. The dispatcher's device/dtype dispatch, the autograd graph that quietly forms behind your back, and the per-op Python tax that torch.compile eventually fixes.cudaMalloc is too slow per-op. PyTorch's caching allocator, fragmentation in real training, and expandable_segments as the modern escape hatch. Streams and the per-stream pool.torch.compile, and when it doesn't.Part V · CUDA kernels — moved to a dedicated track
What was Part V (lessons 20–28 here) — GPU execution model, memory hierarchy, vector add, coalesced access, tiled matmul, warps & divergence, reductions, occupancy, tensor cores — is now the foundations half of the GPU Kernels for LLM Serving track, where it sits directly under the serving lessons that depend on it (FlashAttention, paged KV, prefix reuse, scheduling, framework).
How to read this
- Linearly. Lesson n assumes the math of n−1. The collective derivation in 02 is reused in 04, 05, 06, and 09; the bandwidth pyramid from 03 explains every "why intra-node?" sentence in Part II.
- Touch every widget. Each one has a regime where the strategy fails. Find it. The failure is the lesson — "what breaks DDP" tells you why FSDP exists; "what blows up TP" tells you why TP stays on one node.
- Do the back-of-envelope. The numbers in this series are not decorative. Memorize four: NVLink ~900 GB/s, IB ~50 GB/s/NIC, HBM ~3 TB/s, H100 peak ~1 PFLOP bf16. Most distributed-training arguments are settled by comparing two of these.
interview_questions/traditional_ml/. The inference-serving side has its own lesson series in vllm/lessons/. RL training-system architecture (which composes all of these) is in RL/lessons/.