system_ml / lessons / index 19 lessons · ~4h read

System ML — Parallel Training & Inference, From First Principles

A linearized tour of how a 70B-parameter model is trained on thousands of GPUs and served to millions of users. Each lesson isolates one mechanism — derived from a bandwidth-vs-compute argument, not memorized from a slide deck.

Who this is for
You can read PyTorch and you have run a single-GPU training job. The word "AllReduce" rings a bell but you couldn't draw it. By the end you'll be able to take a fresh model spec ("70B params, 80 layers, 32k context, MoE 8×expert"), name the parallelism strategy out loud, and predict roughly where the bottleneck will live. The companion interview_questions/traditional_ml/ track covers the interview-prep side of the same fundamentals.

The system you're learning

Every distributed deep-learning system reduces to one tension: HBM, compute, and interconnect are three orthogonal resources, and the way you partition the work decides which one binds first. The seven parallelism strategies below are seven different bets about which axis is cheap and which is scarce on your hardware. The art is composing them so all three are saturated at once.

HBM (memory) Compute (FLOPs) Interconnect (BW) DATA PARALLEL replicate weights · AllReduce grads FSDP / ZeRO shard params · trade BW for HBM TENSOR PARALLEL shard within layer · NVLink-heavy PIPELINE PARALLEL shard across layers · bubble cost SEQ / CONTEXT PAR. shard the sequence axis EXPERT PARALLEL MoE · AllToAll twice per layer

Each strategy lives at a different point on the triangle. Pure DDP sits squarely in the centre (it spends HBM and interconnect, not much else). FSDP slides toward the interconnect corner — trading communication for memory. TP and EP push almost everything onto the interconnect. The lessons walk this triangle.

The four questions this series answers

  1. Why can't I just buy a bigger GPU? Because the model, the gradients, the optimizer state, and the activations grow at different rates than HBM does. Memory is the first wall (lesson 01).
  2. Why is AllReduce a constant — not a function of N? Ring topology. The bandwidth-optimal collective derivation in lesson 02 is the single most useful piece of math in this series.
  3. Why does TP "want" to stay on one node? The bandwidth pyramid: 10 TB/s registers, 3 TB/s HBM, 0.9 TB/s NVLink, 0.05 TB/s IB per NIC. Two orders of magnitude per layer (lesson 03).
  4. What changes between training and inference? Training is dominated by gradient sync; inference is dominated by per-step memory bandwidth. The same hardware, used differently. Lessons 11–12.

Part I · Foundations (lessons 01–03 · the language and the constraints)

01
Why distributed at all?
The three walls — memory, compute, throughput — and the bytes-per-parameter accounting that decides whether your model fits. A 70B model needs ~1.1 TB of training state on one rank. One GPU has 80 GB.
02
Collectives — the seven primitives
Broadcast, Reduce, AllReduce, AllGather, ReduceScatter, AllToAll. The identity AllReduce = ReduceScatter + AllGather and the ring derivation that makes its cost independent of N for large tensors.
03
Interconnect — NVLink, NVSwitch, IB
The bandwidth pyramid. NVLink is ~18× faster than IB per pair. This single number determines whether tensor parallel stays intra-node and whether pipeline parallel can stretch across racks.

Part II · Training parallelism (lessons 04–09 · six strategies, six places to shard)

Each strategy shards something different: the data, the optimizer state, the weights themselves, the layers, the sequence, or the experts. Read in order and you can derive any modern training stack (Megatron-LM, DeepSpeed, NeMo, FSDP) as a composition of these six.

                replicate                           shard
                weights                              optimizer
        ┌── DDP ───────────────────── FSDP / ZeRO ──┐ memory
        │  (lesson 04)                  (lesson 05) │
        │                                            │
        │ shard within                shard across   │
        │   layers                       layers      │
        ├── TP ──────────────────────── PP ──────────┤ compute
        │  (lesson 06)                  (lesson 07)  │
        │                                            │
        │ shard the                   shard the      │
        │ sequence                    experts        │
        └── SP / CP ────────────────── EP ───────────┘ specialty
           (lesson 08)                 (lesson 09)
04
Data parallel (DDP) — the simplest fan-out
Replicate weights, split the batch, AllReduce the gradients. Why bucketing matters. The "compute / communication overlap" trick that hides the AllReduce behind backward. Why DDP runs out at the memory wall.
05
FSDP / ZeRO — sharding what DDP replicates
ZeRO-1 (optimizer), ZeRO-2 (gradients), ZeRO-3 (parameters). The memory ↔ communication tradeoff exactly. HSDP — the production compromise that shards intra-node, replicates inter-node.
06
Tensor parallel — Megatron's column/row trick
When a single layer is bigger than one GPU. Column-parallel A, row-parallel B, one AllReduce per block. The four AllReduces per layer that pin TP to one node.
07
Pipeline parallel — and the bubble
Shard layers, not weights within a layer. The bubble fraction (N−1)/(M+N−1) and why 1F1B doesn't shrink it (it shrinks memory instead). Interleaved scheduling.
08
Sequence and context parallel — long context
SP shards the activations TP can't reach. CP shards the attention itself — Ring Attention rotates K/V around the ring and reuses the FlashAttention online-softmax accumulator. The trick that makes 1M-token training viable.
09
Expert parallel — MoE and AllToAll
An MoE layer routes each token to k of E experts. Sharding experts means tokens have to travel: two AllToAlls forward, two backward. Why load balancing is a loss term, not a config flag.

Part III · Composition & inference (lessons 10–12 · putting it together, and what changes at serve-time)

10
3D / nD parallelism — composition rules
Why TP=8 × PP=4 × DP=16 is the canonical layout for ~70B on 512 GPUs. The decision tree: TP first intra-node, PP across nodes, DP/FSDP outermost, SP/CP for long-context, EP for MoE.
11
Inference — TP for latency, replicas for throughput
Decode is memory-bound. A TP=8 replica reads weights 8× faster than a TP=1 replica — but its per-GPU throughput doesn't improve. The Amdahl-flavoured math that picks the right shape.
12
Disaggregated prefill / decode
Prefill is compute-bound; decode is memory-bound. Same model, opposite bottlenecks. Separate pools, transfer the KV cache between them, and the per-pool utilization climbs at the cost of one cross-rack copy.

Part IV · The single-GPU stack (lessons 13–19 · framework, kernels, compilers)

Parts I–III sit at the cluster level: ranks, collectives, fabric. Below all that is one GPU running one forward pass — and the per-GPU throughput is set by a stack of layers most of which you never see. The Python you write becomes an autograd graph, which becomes a dispatcher trace, which becomes a stream of kernel launches, which become matmuls and elementwise ops, which read and write HBM. Each layer is a place where performance leaks or is reclaimed.

  user code  ────▶  PyTorch dispatcher  ────▶  autograd graph
  (lesson 13)         (lesson 13)                (lesson 13)
                                                       │
                              ┌────────────────────────┘
                              ▼
                       autocast / precision  ─────▶  caching allocator
                          (lesson 14)                  (lesson 15)
                              │                            │
                              ▼                            │
                       kernel launch  ◀──────  CUDA stream │
                          (lesson 16)                      │
                              │                            │
                              ▼                            │
                  ┌── handwritten CUDA / cuBLAS / cuDNN ───┤
                  ├── Triton  (lesson 17)                  │
                  └── Inductor codegen  (lesson 18)        │
                                                           │
                            CUDA Graphs / TensorRT  (lesson 19)
                                       │
                                       ▼
                                   HBM
13
PyTorch internals — dispatcher & autograd
What "calling torch.matmul" actually does. The dispatcher's device/dtype dispatch, the autograd graph that quietly forms behind your back, and the per-op Python tax that torch.compile eventually fixes.
14
Mixed precision — bf16, fp16, fp8
Why bf16 won. The fp16 loss-scaler trick and why bf16 deleted it. fp8's two formats (E4M3 / E5M2) and why the master weights are still fp32. The autocast mechanic in one diagram.
15
CUDA memory & the caching allocator
Why cudaMalloc is too slow per-op. PyTorch's caching allocator, fragmentation in real training, and expandable_segments as the modern escape hatch. Streams and the per-stream pool.
16
Custom kernels & fusion
Every kernel launch reads inputs from HBM and writes outputs to HBM. Fusion is the art of doing two kernels' worth of work on one HBM round-trip. FlashAttention as the case study; the elementwise epilogue as the warm-up.
17
Triton — the ML kernel DSL
A "program" is a block of threads; pointers are blocks of addresses. The SRAM ↔ HBM dance you have to choreograph yourself. When Triton wins over torch.compile, and when it doesn't.
18
torch.compile — Dynamo + AOT Autograd + Inductor
Dynamo turns Python into a graph (or several, with graph breaks). AOT Autograd attaches a backward. Inductor lowers to Triton kernels. The three failure modes — graph breaks, recompilation, fallbacks — and what each costs.
19
CUDA Graphs & TensorRT — serve-time graph capture
Decoding a token is ~10 μs of Python overhead per launch — capture the graph once, replay forever. The shape-dependence trap. TensorRT-LLM and what it does beyond torch.compile (pre-tuned kernels, layer-by-layer plan files, persistent kernels).

Part V · CUDA kernels — moved to a dedicated track

What was Part V (lessons 20–28 here) — GPU execution model, memory hierarchy, vector add, coalesced access, tiled matmul, warps & divergence, reductions, occupancy, tensor cores — is now the foundations half of the GPU Kernels for LLM Serving track, where it sits directly under the serving lessons that depend on it (FlashAttention, paged KV, prefix reuse, scheduling, framework).

If you came here for CUDA basics
Open gpu_kernel_serving — lessons 01 through 09 are the CUDA primitives (execution model → memory hierarchy → vector add → coalescing → tiled matmul → warps → reductions → occupancy → tensor cores). Lessons 10 through 17 then use those primitives to build vLLM- and SGLang-style serving.

How to read this

  1. Linearly. Lesson n assumes the math of n−1. The collective derivation in 02 is reused in 04, 05, 06, and 09; the bandwidth pyramid from 03 explains every "why intra-node?" sentence in Part II.
  2. Touch every widget. Each one has a regime where the strategy fails. Find it. The failure is the lesson — "what breaks DDP" tells you why FSDP exists; "what blows up TP" tells you why TP stays on one node.
  3. Do the back-of-envelope. The numbers in this series are not decorative. Memorize four: NVLink ~900 GB/s, IB ~50 GB/s/NIC, HBM ~3 TB/s, H100 peak ~1 PFLOP bf16. Most distributed-training arguments are settled by comparing two of these.
Companion material
Interview-prep material lives in interview_questions/traditional_ml/. The inference-serving side has its own lesson series in vllm/lessons/. RL training-system architecture (which composes all of these) is in RL/lessons/.