system_ml / 01 · why distributed? lesson 1 / 19

Why distributed at all?

Three failure modes — memory, compute, throughput — each with its own response. Most of the complexity in this series exists to address one of them.

The honest framing

"Distributed" is not an architectural goal; it is a tax you pay because something doesn't fit. Whenever a teammate says "let's go distributed", three questions should fire before any code is touched:

  1. Does the training state fit in one GPU's HBM? Weights + gradients + optimizer state + activations. If no — you have a memory wall, and your response is to shard those things (FSDP / ZeRO, lesson 05).
  2. If it fits, can one GPU finish training in tolerable wall-clock time? If no — you have a throughput wall, and your response is to replicate the model and split the data across workers (DDP / FSDP, lesson 04).
  3. For inference: can one replica meet the QPS / latency target? If no — you have a serving wall, and your response is replication plus, sometimes, intra-replica sharding (lesson 11).

A model that fits, trains in reasonable time, and serves at moderate QPS doesn't need any of this. Distributed engineering is genuinely expensive — NCCL deadlocks, opaque hangs, version skew, straggling ranks, silent data corruption at scale. Pay only for one of the three reasons above.

Walking the first wall — the bytes per parameter

The memory wall is the cleanest one to reason about, because we can count bytes. For a parameter θ stored in bf16 (2 bytes), under Adam, in the standard mixed-precision recipe:

bytes_per_param  =  2  (bf16 weight)  +  2  (bf16 grad)  +  4  (fp32 master weight)  +  8  (fp32 Adam m + v)  =  16

Sixteen bytes is the headline number. For a 70B-parameter model that's 1.12 TB of training state — before we count activations, before we count the KV-cache-like intermediates that backward needs. On an 80 GB H100, that's 14× too big to fit anywhere on the device.

Where the 16 comes from, exactly
Mixed-precision training (Micikevicius 2017) keeps a master copy of every parameter in fp32 because Adam's update is numerically unstable in bf16. The forward and backward use bf16 copies for speed, but the optimizer step happens in fp32. So you pay for both: 2 bytes of bf16 weight + 4 bytes of fp32 master + 8 bytes of Adam state per param. Without the master copy you'd lose the small updates (a 1e-4 LR update to a 1.0 bf16 weight rounds to zero). Trade that off and you save 4 bytes per param at the cost of training stability.

Activations are the other axis, and they scale with sequence length and batch, not just with parameters. For a transformer with hidden size d, sequence length T, batch B, and L layers, the activations checkpoint-able for backward are roughly:

activation_bytes  ≈  B · T · L · d · 2 · k

where k is some small multiplier (typically 12–20 for a vanilla layer; activation checkpointing cuts it dramatically at the cost of recomputing forward during backward). For Llama-70B-ish numbers (d=8192, L=80) and a single sequence at T=8192: roughly ~130–200 GB per sequence without checkpointing, dropping to ~20 GB with aggressive activation checkpointing — on top of the 1.1 TB of state. The point is not the exact number but its shape: activations grow with batch, state grows with model size, and they grow independently.

Animated · the 16-byte stack vs the HBM ceiling

The bytes-per-param stack literally stacks: bf16 weight + bf16 grad + fp32 master + Adam m + Adam v. Multiply by parameter count and the column grows. Slide it past the H100 ceiling, then the H200, then the B200. Watch the column turn red the moment it overshoots — that's the memory wall as a height. The animation phase shows each byte-category appearing in sequence so you can see which 4 bytes you'd save with Adafactor, or with mixed-precision tricks.

Bytes-per-param column · grows with model size, capped by HBM
Each colored slab is one byte-category, scaled by P. Horizontal dashed lines are HBM ceilings for current GPUs. ▶ to animate the stack assembling; the column turns red when it pierces your selected ceiling.
column height
vs ceiling
fits on 1 GPU?
min GPUs needed

Walking the second wall — the time per step

Even if the model fit in one GPU's HBM, training a 70B model on 15 trillion tokens at single-GPU speed is infeasible. The arithmetic is brutal but useful to keep in your head:

total_FLOPs  ≈  6 · params · tokens

The factor of 6 comes from the standard scaling-laws accounting (Kaplan et al. 2020; also used in Chinchilla / Hoffmann et al. 2022): 2 FLOPs per parameter per token for the forward pass, plus 4 for the backward pass — split as 2 for the input-gradient pass and 2 for the weight-gradient pass. For 70B × 15T tokens, that's 6.3 × 10²⁴ FLOPs. An H100 SXM does about 1 × 10¹⁵ bf16 FLOPs/s peak; in practice you achieve ~40% of that (MFU, "model FLOP utilization") because of bandwidth-bound layers and overhead. One GPU at 0.4 PFLOPS:

time  =  6.3 × 10²⁴ / (4 × 10¹⁴)  =  1.6 × 10¹⁰ seconds  ≈  500 years

500 years. Even if memory were free, time isn't. The whole point of data parallel (lesson 04) is to drop this number by a factor of N by running N forward passes simultaneously on different data — the elapsed time falls to ~6 months at 1024 GPUs. The cost is one AllReduce of all gradients per step. We will spend lesson 02 making sure that AllReduce is essentially free.

2D · three walls in one picture

The three walls — memory, compute, throughput — are three different scaling lines that each get hit at different model sizes. Pick a model size with the slider; the three bars on the left show how full each budget is. The animated "ball" on each track shows the linear scaling. When a bar fills past 100%, that wall is hit and turns red.

Three walls, three budgets · scroll model size and watch them fill
Memory wall: training-state bytes vs one GPU's HBM. Compute wall: per-step FLOPs vs one GPU's per-second budget at a target step time. Throughput wall: total training FLOPs vs a deadline. Each bar fills linearly; whichever fills first is the binding constraint.
memory wall
compute wall (step)
throughput wall (deadline)
first wall hit

The roofline — the one diagram for "what's bound by what"

Every kernel on a GPU is bound by one of two things: how fast you can stream bytes from HBM, or how fast the tensor cores can do FLOPs on bytes already in SRAM. The roofline plot makes the choice visible:

arithmetic intensity (FLOPs / byte) throughput (FLOPs/s) ridge point ≈ peak / BW memory-bound throughput = BW × intensity compute-bound throughput = peak FLOPs/s decode step (batch=1) ~1 FLOP/byte attention (FlashAttn, long T) training MLP / prefill ~200+ FLOP/byte

Two of the most important distributed-systems consequences fall out of this picture:

The accounting habit

The most important skill in this series is doing back-of-envelope estimates of bytes and FLOPs before writing any code. Practising it on a few benchmark numbers:

QuantitySymbol / formula70B model (B=1, T=8k)
Param state (mixed-prec Adam)16 · params1.12 TB
Forward activations (no checkpointing)~B · T · L · d · 14 bytes~150 GB
Forward activations (checkpointed)~B · T · L · 2 bytes~2.6 GB
One step FLOPs6 · params · B · T~3.4 × 10¹⁵
One step time at 1 PFLOPS @ 40% MFUFLOPs / (0.4 · peak)~8.4 ms (per GPU, fictionally)
Grad AllReduce volume per step~2 · 2 · params280 GB / rank, ring-asymptotic
Grad AllReduce time at 100 GB/s ring BWvol / BW2.8 s if not overlapped

That last line is the punchline: a 2.8s AllReduce against an 8ms step is ~350× too slow. So the AllReduce has to be either small (FSDP cuts it; lesson 05), local (intra-node NVLink; lesson 03), or hidden behind the compute (overlap; lesson 04). All three are real strategies, and all three are tested in the lessons ahead.

3D · the regime cube

Plot a model in (memory, compute, bandwidth) space. Each axis is "how much of that resource does one rank need". Models cluster into regimes: small models live near the origin (any GPU), 7B models press against the memory axis, frontier 405B models pin all three. The dot color tells you which axis is the binding wall. Rotate the cube to read off the depth.

Models in resource space · isometric cube
Each colored dot is a model; axis-aligned dropped lines show its (mem, compute, BW) coordinates. Color = binding axis (blue = memory, orange = compute, green = bandwidth). Click a dot to see its numbers.
selected model
— click a dot —
memory load
compute load
BW load

Interactive · feel the memory wall

Slide the model size, the precision, the optimizer, and the batch around. Watch the bar chart show which buckets fit on which GPU. The dotted line is your chosen GPU's HBM budget. The lesson is in which slider hits the wall first — and that wall is what each subsequent lesson is going to dismantle.

Training-state budget on one GPU
Per-rank memory required to start a single forward+backward. Bars exceed the dashed budget? That's the memory wall. Note: activations here assume no checkpointing; flipping the checkpoint toggle is one of FSDP's main competitor strategies (lesson 05 explains why we usually want both).
total state
vs one GPU
min ranks (FSDP, perfect shard)
first slider that hit it

Where each wall sends you

WallSymptomFirst responseLessons
Memory (state)OOM at step 0FSDP / ZeRO-3 → activation checkpointing → TP05, then 06
Memory (activations)OOM at longer sequenceActivation checkpointing → SP → CP05, 08
Memory (KV cache)OOM serving long contextGQA → KV quantization → paged KVsee vLLM/09, vLLM/02
Time (training)One-step time × steps > deadlineDP → FSDP-HSDP → 3D parallelism04, 05, 10
Latency (inference)TTFT or ITL miss targetTP per replica → speculative decode → PD-disagg11, 12
Throughput (inference)QPS shortfallReplicate → continuous batching → APC11, see vLLM/04
Takeaway
Distributed is a response to a constraint, not a goal. Diagnose which wall you're hitting (memory? time? throughput?) before reaching for a hammer. The lessons in Part II are each a different hammer; reading them in order is reading the hammers from cheapest to most expensive.