System ML — Parallel Training & Inference, From First Principles

A linearized tour of how a 70B-parameter model is trained on thousands of GPUs and served to millions of users. Each lesson isolates one mechanism — derived from a bandwidth-vs-compute argument, not memorized from a slide deck.

Who this is for

You can read PyTorch and you have run a single-GPU training job. The word "AllReduce" rings a bell but you couldn't draw it. By the end you'll be able to take a fresh model spec ("70B params, 80 layers, 32k context, MoE 8×expert"), name the parallelism strategy out loud, and predict roughly where the bottleneck will live. The companion interview_questions/traditional_ml/ track covers the interview-prep side of the same fundamentals.

The system you're learning

Every distributed deep-learning system reduces to one tension: HBM, compute, and interconnect are three orthogonal resources, and the way you partition the work decides which one binds first. The seven parallelism strategies below are seven different bets about which axis is cheap and which is scarce on your hardware. The art is composing them so all three are saturated at once.

Each strategy lives at a different point on the triangle. Pure DDP sits squarely in the centre (it spends HBM and interconnect, not much else). FSDP slides toward the interconnect corner — trading communication for memory. TP and EP push almost everything onto the interconnect. The lessons walk this triangle.

The four questions this series answers

Why can't I just buy a bigger GPU? Because the model, the gradients, the optimizer state, and the activations grow at different rates than HBM does. Memory is the first wall (lesson 01).
Why is AllReduce a constant — not a function of N? Ring topology. The bandwidth-optimal collective derivation in lesson 02 is the single most useful piece of math in this series.
Why does TP "want" to stay on one node? The bandwidth pyramid: 10 TB/s registers, 3 TB/s HBM, 0.9 TB/s NVLink, 0.05 TB/s IB per NIC. Two orders of magnitude per layer (lesson 03).
What changes between training and inference? Training is dominated by gradient sync; inference is dominated by per-step memory bandwidth. The same hardware, used differently. Lessons 14–15.

Part I · Foundations (lessons 01–03 · the language and the constraints)

Why distributed at all?

The three walls — memory, compute, throughput — and the bytes-per-parameter accounting that decides whether your model fits. A 70B model needs ~1.1 TB of training state on one rank. One GPU has 80 GB.

Collectives — the seven primitives

Broadcast, Reduce, AllReduce, AllGather, ReduceScatter, AllToAll. The identity AllReduce = ReduceScatter + AllGather and the ring derivation that makes its cost independent of N for large tensors.

Interconnect — NVLink, NVSwitch, IB

The bandwidth pyramid. NVLink is ~18× faster than IB per pair. This single number determines whether tensor parallel stays intra-node and whether pipeline parallel can stretch across racks.

Part II · Training parallelism (lessons 04–09 · six strategies, six places to shard)

Each strategy shards something different: the data, the optimizer state, the weights themselves, the layers, the sequence, or the experts. Read in order and you can derive any modern training stack (Megatron-LM, DeepSpeed, NeMo, FSDP) as a composition of these six.

                replicate                           shard
                weights                              optimizer
        ┌── DDP ───────────────────── FSDP / ZeRO ──┐ memory
        │  (lesson 04)                  (lesson 05) │
        │                                            │
        │ shard within                shard across   │
        │   layers                       layers      │
        ├── TP ──────────────────────── PP ──────────┤ compute
        │  (lesson 06)                  (lesson 07)  │
        │                                            │
        │ shard the                   shard the      │
        │ sequence                    experts        │
        └── SP / CP ────────────────── EP ───────────┘ specialty
           (lesson 08)                 (lesson 09)

Data parallel (DDP) — the simplest fan-out

Replicate weights, split the batch, AllReduce the gradients. Why bucketing matters. The "compute / communication overlap" trick that hides the AllReduce behind backward. Why DDP runs out at the memory wall.

FSDP / ZeRO — sharding what DDP replicates

ZeRO-1 (optimizer), ZeRO-2 (gradients), ZeRO-3 (parameters). The memory ↔ communication tradeoff exactly. HSDP — the production compromise that shards intra-node, replicates inter-node.

Tensor parallel — Megatron's column/row trick

When a single layer is bigger than one GPU. Column-parallel A, row-parallel B, one AllReduce per block. The four AllReduces per layer that pin TP to one node.

Pipeline parallel — and the bubble

Shard layers, not weights within a layer. The bubble fraction (N−1)/(M+N−1) and why 1F1B doesn't shrink it (it shrinks memory instead). Interleaved scheduling.

Sequence and context parallel — long context

SP shards the activations TP can't reach. CP shards the attention itself — Ring Attention rotates K/V around the ring and reuses the FlashAttention online-softmax accumulator. The trick that makes 1M-token training viable.

Expert parallel — MoE and AllToAll

An MoE layer routes each token to k of E experts. Sharding experts means tokens have to travel: two AllToAlls forward, two backward. Why load balancing is a loss term, not a config flag.

Part III · Training memory & batch (lessons 10–11 · the other two HBM consumers, and the batch math)

Parts I–II shard the parameters, gradients, and optimizer state across ranks. But two more quantities grow with a real run — the activations and the effective batch — and neither is fixed by sharding weights. These two lessons close lesson 01's memory accounting and set up the batch math before we compose everything in Part IV — which is exactly why they sit here, ahead of the kernel stack: lessons 01, 05, and 12 already lean on activation memory.

Activation checkpointing & recomputation

The fourth HBM consumer — activations — is the one FSDP can't touch, and the only one that grows with batch and sequence. Throw them away on forward, recompute on backward: a flat ~33% compute tax buys a √L collapse in activation memory. Selective recompute makes it nearly free.

Gradient accumulation & the effective batch

The optimizer wants B_eff; the GPU fits a micro-batch. Accumulation G is the slack between them — B_eff = b·N·G — and no_sync cuts DDP comm G× for free. The critical-batch knee and why LR scales with the batch.

Part IV · Composition, overlap & inference (lessons 12–15 · putting it together, hiding the comms, and serve-time)

3D / nD parallelism — composition rules

Why TP=8 × PP=4 × DP=16 is the canonical layout for ~70B on 512 GPUs. The decision tree: TP first intra-node, PP across nodes, DP/FSDP outermost, SP/CP for long-context, EP for MoE.

Communication–computation overlap

Every collective is dead time unless it hides behind compute. The single principle behind DDP bucketing, FSDP prefetch, SP/TP fusion, and the pipeline schedule — CUDA streams, the prefetch-depth knob, and why some collectives (the TP AllReduce on the critical path) are far harder to hide than others. The lever that turns the comm costs of Parts II–III from wall-clock into (mostly) zero.

Inference — TP for latency, replicas for throughput

Decode is memory-bound. A TP=8 replica reads weights 8× faster than a TP=1 replica — but its per-GPU throughput doesn't improve. The Amdahl-flavoured math that picks the right shape — plus the other three levers on single-replica decode speed: the KV cache (and the memory wall it sets), quantization, and speculative decoding.

Disaggregated prefill / decode

Prefill is compute-bound; decode is memory-bound. Same model, opposite bottlenecks. Separate pools, transfer the KV cache between them, and the per-pool utilization climbs at the cost of one cross-rack copy.

Part V · The single-GPU stack (lessons 16–22 · framework, kernels, compilers)

Parts I–IV sit at the cluster level: ranks, collectives, fabric. Below all that is one GPU running one forward pass — and the per-GPU throughput is set by a stack of layers most of which you never see. The Python you write becomes an autograd graph, which becomes a dispatcher trace, which becomes a stream of kernel launches, which become matmuls and elementwise ops, which read and write HBM. Each layer is a place where performance leaks or is reclaimed.

  user code  ────▶  PyTorch dispatcher  ────▶  autograd graph
  (lesson 16)         (lesson 16)                (lesson 16)
                                                       │
                              ┌────────────────────────┘
                              ▼
                       autocast / precision  ─────▶  caching allocator
                          (lesson 17)                  (lesson 18)
                              │                            │
                              ▼                            │
                       kernel launch  ◀──────  CUDA stream │
                          (lesson 19)                      │
                              │                            │
                              ▼                            │
                  ┌── handwritten CUDA / cuBLAS / cuDNN ───┤
                  ├── Triton  (lesson 20)                  │
                  └── Inductor codegen  (lesson 21)        │
                                                           │
                            CUDA Graphs / TensorRT  (lesson 22)
                                       │
                                       ▼
                                   HBM

PyTorch internals — dispatcher & autograd

What "calling torch.matmul" actually does. The dispatcher's device/dtype dispatch, the autograd graph that quietly forms behind your back, and the per-op Python tax that torch.compile eventually fixes.

Mixed precision — bf16, fp16, fp8

Why bf16 won. The fp16 loss-scaler trick and why bf16 deleted it. fp8's two formats (E4M3 / E5M2) and why the master weights are still fp32. The autocast mechanic in one diagram.

CUDA memory & the caching allocator

Why cudaMalloc is too slow per-op. PyTorch's caching allocator, fragmentation in real training, and expandable_segments as the modern escape hatch. Streams and the per-stream pool.

Custom kernels & fusion

Every kernel launch reads inputs from HBM and writes outputs to HBM. Fusion is the art of doing two kernels' worth of work on one HBM round-trip. FlashAttention as the case study; the elementwise epilogue as the warm-up.

Triton — the ML kernel DSL

A "program" is a block of threads; pointers are blocks of addresses. The SRAM ↔ HBM dance you have to choreograph yourself. When Triton wins over torch.compile, and when it doesn't.

torch.compile — Dynamo + AOT Autograd + Inductor

Dynamo turns Python into a graph (or several, with graph breaks). AOT Autograd attaches a backward. Inductor lowers to Triton kernels. The three failure modes — graph breaks, recompilation, fallbacks — and what each costs.

CUDA Graphs & TensorRT — serve-time graph capture

Decoding a token is ~10 μs of Python overhead per launch — capture the graph once, replay forever. The shape-dependence trap. TensorRT-LLM and what it does beyond torch.compile (pre-tuned kernels, layer-by-layer plan files, persistent kernels).

Part VI · Running the job (lessons 23–26 · measure it, keep it stable, feed it, survive it)

Parts I–V give you every mechanism and how to make one GPU fast. But a real training run is a three-week, multi-thousand-GPU job: you have to see the bottleneck you were taught to predict, keep the loss from diverging, stay fed with data, and survive the hardware dying under it. These four lessons are the cross-cutting concerns every large run actually spends its engineering on.

Profiling, MFU & finding the bottleneck

The whole series taught you to predict where the bottleneck lives; this is how you measure it. The PyTorch profiler / Nsight trace, reading a kernel timeline, attributing a slow step to compute vs collective vs data-load — and MFU, the single number that scores every optimization in Parts II–V.

Numerical stability & convergence

A 70B run lives or dies on not diverging. Global gradient-norm clipping (itself a collective across FSDP/TP shards), loss-spike detection and rollback, LR warmup + the WSD/cosine schedule, and the z-loss / logit-softcap tricks that keep MoE and large-vocab training stable.

Feeding the GPUs — the distributed data pipeline

Two silent failure modes: partition wrong (DistributedSampler — a seeded shared shuffle, strided slice) and starve the GPU (prefetch workers until T_l/W ≤ T_c). Both still train, so both cost money without an error. Then the operational half: packing + varlen attention, resumable loader state, the storage tier, and shuffle quality at web scale.

Checkpointing & fault tolerance

M_job = M_1/N makes failure routine at scale. Save all state (Adam moments dominate), and set the cadence by Young–Daly: τ* = √(2CM). Shrink the checkpoint cost C with sharded + async + redundant writes; recover with resharding + elastic restart. Then the recovery half: detecting hangs (NCCL timeouts, flight recorder), checkpoint durability, NaN-as-a-fault, and goodput.

Beyond this track · CUDA kernels & serving engines

The CUDA-primitives track that once lived here — GPU execution model, memory hierarchy, vector add, coalesced access, tiled matmul, warps & divergence, reductions, occupancy, tensor cores — is now the foundations half of the GPU Kernels for ML Engineers track, where it sits directly under the serving lessons that depend on it (FlashAttention, paged KV, prefix reuse, scheduling, framework).

If you came here for CUDA basics

Open gpu_kernels — lessons 01 through 09 are the CUDA primitives (execution model → memory hierarchy → vector add → coalescing → tiled matmul → warps → reductions → occupancy → tensor cores). Lessons 10 through 17 then use those primitives to build vLLM- and SGLang-style serving.

How to read this

Linearly. Lesson n assumes the math of n−1. The collective derivation in 02 is reused in 04, 05, 06, and 09; the bandwidth pyramid from 03 explains every "why intra-node?" sentence in Part II.
Touch every widget. Each one has a regime where the strategy fails. Find it. The failure is the lesson — "what breaks DDP" tells you why FSDP exists; "what blows up TP" tells you why TP stays on one node.
Do the back-of-envelope. The numbers in this series are not decorative. Memorize four: NVLink ~900 GB/s peak (≈150 GB/s effective on a flat ring), IB ~50 GB/s/NIC, HBM ~3.35 TB/s, H100 peak ~1 PFLOP bf16. Most distributed-training arguments are settled by comparing two of these.

Companion material

Interview-prep material lives in interview_questions/traditional_ml/. The inference-serving side has its own lesson series in vllm/lessons/. RL training-system architecture (which composes all of these) is in reinforcement_learning/lessons/.