Why distributed at all?

Three failure modes — memory, compute, throughput — each with its own response. Most of the complexity in this series exists to address one of them.

The honest framing

"Distributed" is not an architectural goal; it is a tax you pay because something doesn't fit. Whenever a teammate says "let's go distributed", three questions should fire before any code is touched:

Does the training state fit in one GPU's HBM? Weights + gradients + optimizer state + activations. If no — you have a memory wall, and your response is to shard those things (FSDP / ZeRO, lesson 05).
If it fits, can one GPU finish training in tolerable wall-clock time? If no — you have a throughput wall, and your response is to replicate the model and split the data across workers (DDP / FSDP, lesson 04).
For inference: can one replica meet the QPS / latency target? If no — you have a serving wall, and your response is replication plus, sometimes, intra-replica sharding (lesson 14).

A model that fits, trains in reasonable time, and serves at moderate QPS doesn't need any of this. Distributed engineering is genuinely expensive — NCCL deadlocks, opaque hangs, version skew, straggling ranks, silent data corruption at scale. Pay only for one of the three reasons above.

Walking the first wall — the bytes per parameter

The memory wall is the cleanest one to reason about, because we can count bytes. For a parameter θ stored in bf16 (2 bytes), under Adam, in the standard mixed-precision recipe:

bytes_per_param = 2 (bf16 weight) + 2 (bf16 grad) + 4 (fp32 master weight) + 8 (fp32 Adam m + v) = 16

Sixteen bytes is the headline number. For a 70B-parameter model that's 1.12 TB of training state — before we count activations, before we count the KV-cache-like intermediates that backward needs. On an 80 GB H100, that's 14× too big to fit anywhere on the device.

Where the 16 comes from, exactly

Mixed-precision training (Micikevicius 2017) keeps a master copy of every parameter in fp32 because Adam's update is numerically unstable in bf16. The forward and backward use bf16 copies for speed, but the optimizer step happens in fp32. So you pay for both: 2 bytes of bf16 weight + 4 bytes of fp32 master + 8 bytes of Adam state per param. Without the master copy you'd lose the small updates (a 1e-4 LR update to a 1.0 bf16 weight rounds to zero). Trade that off and you save 4 bytes per param at the cost of training stability.

Activations are the other axis, and they scale with sequence length and batch, not just with parameters. For a transformer with hidden size d, sequence length T, batch B, and L layers, the activations checkpoint-able for backward are roughly:

activation_bytes ≈ B · T · L · d · 2 · k

where k is not a fudge factor but a count: how many B·T·d-sized tensors each layer stashes on the forward pass because the backward needs them to form gradients. Walking one transformer block, the stashed tensors are roughly — attention: the Q, K, V projections, plus the softmax/attention output that feeds the output projection; MLP: the up-projection output and the input to the GELU/SiLU nonlinearity (its derivative needs the pre-activation); two LayerNorm/RMSNorm inputs (one before attention, one before the MLP); and the residual-stream tensor carried into each add. Tally those and you land at roughly a dozen-to-twenty B·T·d tensors per layer — that is the k ≈ 12–20, a count of saved activations rather than a tuned constant (and activation checkpointing cuts it dramatically by throwing most of them away and recomputing forward during backward). For Llama-70B-ish numbers (d=8192, L=80) and a single sequence at T=8192: roughly ~130–200 GB per sequence without checkpointing. Checkpointing trades memory for recompute, and how much it saves depends on the granularity: storing only each layer's block input and recomputing the rest — full recompute — drops this to ~2–3 GB (the B·T·L·2-byte figure in the table below), while selective recompute (keeping a few cheap-to-store tensors) lands closer to ~20 GB (full treatment in lesson 10) — all on top of the 1.1 TB of state. The point is not the exact number but its shape: activations grow with batch, state grows with model size, and they grow independently.

The fourth HBM consumer: activations

The 16-byte stack above names three consumers — weights, gradients, optimizer state — all of which scale with parameter count. There is a fourth, and it behaves differently. Every forward op stashes its output so the backward pass can use it to compute gradients; that stashed tensor is an activation, and the pile of them is the fourth thing competing for HBM.

WHY IT IS DIFFERENT

Three properties make activations the awkward consumer. (a) They are the only consumer that scales with batch × sequence, not with params — double the batch or the context and this term doubles while the other three sit still. (b) FSDP / ZeRO shard params, grads, and optimizer state across ranks, but they cannot shard activations — each rank still holds the full activations for the tokens it processes. (c) The rough size is ≈ batch · seq · n_layers · hidden · k · bytes with k a small per-layer constant (the B·T·L·d·14 figure above), plus an attention term that grows with sequence length — detailed in lesson 08.

Because sharding weights does nothing for it, activation memory needs its own lever: throw the stashed tensors away on forward and recompute them on backward. That √L recomputation trade is lesson 10. Keep this consumer in mind — lessons 05, 07, 08, and 12 all lean on it.

Animated · the 16-byte stack vs the HBM ceiling

The bytes-per-param stack literally stacks: bf16 weight + bf16 grad + fp32 master + Adam m + Adam v. Multiply by parameter count and the column grows. Slide it past the H100 ceiling, then the H200, then the B200. Watch the column turn red the moment it overshoots — that's the memory wall as a height. The animation phase shows each byte-category appearing in sequence so you can see which 4 bytes you'd save with Adafactor, or with mixed-precision tricks.

Bytes-per-param column · grows with model size, capped by HBM

Each colored slab is one byte-category, scaled by P. Horizontal dashed lines are HBM ceilings for current GPUs. ▶ to animate the stack assembling; the column turns red when it pierces your selected ceiling.

params (B): 70 ceiling:

column height

—

vs ceiling

—

fits on 1 GPU?

—

min GPUs needed

—

Walking the second wall — the time per step

Even if the model fit in one GPU's HBM, training a 70B model on 15 trillion tokens at single-GPU speed is infeasible. The arithmetic is brutal but useful to keep in your head:

total_FLOPs ≈ 6 · params · tokens

The factor of 6 comes from the standard scaling-laws accounting (Kaplan et al. 2020; also used in Chinchilla / Hoffmann et al. 2022): 2 FLOPs per parameter per token for the forward pass, plus 4 for the backward pass — split as 2 for the input-gradient pass and 2 for the weight-gradient pass. For 70B × 15T tokens, that's 6.3 × 10²⁴ FLOPs. An H100 SXM does about 1 × 10¹⁵ bf16 FLOPs/s peak; in practice you achieve ~40% of that (MFU, "model FLOP utilization") because of bandwidth-bound layers and overhead. One GPU at 0.4 PFLOPS:

time = 6.3 × 10²⁴ / (4 × 10¹⁴) = 1.6 × 10¹⁰ seconds ≈ 500 years

500 years. Even if memory were free, time isn't. The whole point of data parallel (lesson 04) is to drop this number by a factor of N by running N forward passes simultaneously on different data — the elapsed time falls to ~6 months at 1024 GPUs. The cost is one AllReduce of all gradients per step. We will spend lesson 02 making sure that AllReduce is essentially free.

Scaling laws — how big, on how much data

The 6ND law above takes the token budget D as given — 15T tokens fell out of the sky. It doesn't. The token budget is itself the output of a scaling law, and that law is what the training-time estimate then multiplies.

Chinchilla (Hoffmann et al. 2022) asked: for a fixed compute budget C ≈ 6ND, how should you split it between parameters N and training tokens D to minimize loss? The answer is that they should scale together — neither a huge model on few tokens nor a tiny model on a flood of tokens is optimal. The compute-optimal ratio lands at roughly:

D ≈ 20 · N (tokens per parameter)

So a 70B model "wants" about 1.4 × 10¹² ≈ 1.4T tokens to be compute-optimal. Plug N=70B and D=1.4T into 6ND and you get the FLOP budget; that is the number the 500-year single-GPU arithmetic was implicitly built on. Chinchilla is the bridge from "I have this much compute" to "train these many params on these many tokens".

INFERENCE-AWARE CORRECTION

Production models are deliberately over-trained past Chinchilla — more tokens, smaller N — and Llama-3-8B on 15T tokens (D ≈ 1875·N, ~90× past Chinchilla's 20) is the canonical example. The reason is an accounting asymmetry that the compute-optimal frame ignores: training cost is paid once, but inference cost scales with every token you ever serve. A smaller model is cheaper to serve on every single request forever. So if shrinking N (and paying for it with extra training tokens) keeps quality fixed while cutting per-token serve cost, it wins — even though it burned more training FLOPs to get there. Chinchilla minimizes training FLOPs for a target loss; the real objective minimizes training + lifetime-inference cost.

Axis	Compute-optimal (Chinchilla)	Inference-optimal (over-trained)
Tokens / param	~20	100s – 1000s
Model size N for a target loss	larger	smaller
Training FLOPs for that loss	minimal	higher (paid once)
Per-token inference cost	higher	lower (paid forever)
Who it's for	a one-off research run, or a model you'll rarely serve	a model you'll serve to millions

The takeaway for this series: the token budget that drives the throughput wall is not a free parameter. It is set by a scaling law, then bent away from compute-optimal by how heavily you expect to serve the model. Both regimes still hit all three walls — they just move where on the (params, tokens) plane you sit when you hit them.

2D · three walls in one picture

The three walls — memory, compute, throughput — are three different scaling lines that each get hit at different model sizes. Pick a model size with the slider; the three bars on the left show how full each budget is. The animated "ball" on each track shows the linear scaling. When a bar fills past 100%, that wall is hit and turns red.

Three walls, three budgets · scroll model size and watch them fill

Memory wall: training-state bytes vs one GPU's HBM. Compute wall: per-step FLOPs vs one GPU's per-second budget at a target step time. Throughput wall: total training FLOPs vs a deadline. Each bar fills linearly; whichever fills first is the binding constraint.

model (B params): 70 deadline (days): 60 tokens (T): 15

memory wall

—

compute wall (step)

—

throughput wall (deadline)

—

first wall hit

—

The roofline — the one diagram for "what's bound by what"

Every kernel on a GPU is bound by one of two things: how fast you can stream bytes from HBM, or how fast the tensor cores can do FLOPs on bytes already in SRAM. The roofline plot makes the choice visible:

Two of the most important distributed-systems consequences fall out of this picture:

Decode is memory-bound, prefill is compute-bound. Per-token decode reads the entire weight matrix from HBM and does one matmul-vector. Arithmetic intensity is roughly batch · 1. So a per-replica batch size of 1 lives far down the orange slope — far below peak. Bigger batches climb the slope; eventually you hit the ridge and become compute-bound. This single fact drives PD-disaggregation, continuous batching, speculative decoding, and almost every inference optimization. (Lesson 15, lesson 14.)
Training prefers high arithmetic intensity, so big batches win — until the AllReduce stops scaling. Training a transformer layer at batch 1024 on a long sequence comfortably sits above the ridge. Distributed training keeps that batch large by adding ranks: each rank takes B/N of a batch, the per-rank work stays comfortably compute-bound, and the only added cost is gradient sync. This is why DDP works at all.

The accounting habit

The most important skill in this series is doing back-of-envelope estimates of bytes and FLOPs before writing any code. Practising it on a few benchmark numbers:

Quantity	Symbol / formula	70B model (B=1, T=8k)
Param state (mixed-prec Adam)	16 · params	1.12 TB
Forward activations (no checkpointing)	~B · T · L · d · 14 bytes	~150 GB
Forward activations (checkpointed, full recompute — full treatment in lesson 10)	~B · T · L · 2 bytes	~2.6 GB (selective recompute ≈ 20 GB)
One step FLOPs	6 · params · B · T	~3.4 × 10¹⁵
One step time at 1 PFLOPS @ 40% MFU	FLOPs / (0.4 · peak)	~8.4 ms (per GPU, fictionally)
Grad AllReduce volume per step	2 B/param × 2 (ring AllReduce) ≈ 4 · params bytes/rank	140 GB bf16 grads → ~280 GB / rank moved
Grad AllReduce time at ~150 GB/s effective intra-node ring (effective, well below the ~900 GB/s NVLink peak — see lesson 03)	vol / BW	~1.9 s if not overlapped

The two factors of 2 in the volume row are different: the first is bytes/param (bf16 grads), the second is the ring-AllReduce traffic factor — a ring moves ~2·S per rank for a payload of S (reduce-scatter then all-gather). So 140 GB of bf16 grads becomes ~280 GB moved per rank.

That last line is the punchline: a ~1.9s AllReduce against an 8ms step is ~230× too slow. So the AllReduce has to be either small (FSDP cuts it; lesson 05), local (intra-node NVLink; lesson 03), or hidden behind the compute (overlap; lesson 04). All three are real strategies, and all three are tested in the lessons ahead.

3D · the regime cube

Plot a model in (memory, compute, bandwidth) space. Each axis is "how much of that resource does one rank need". Models cluster into regimes: small models live near the origin (any GPU), 7B models press against the memory axis, frontier 405B models pin all three. The dot color tells you which axis is the binding wall. Rotate the cube to read off the depth.

Interactive · feel the memory wall

Slide the model size, the precision, the optimizer, and the batch around. Watch the bar chart show which buckets fit on which GPU. The dotted line is your chosen GPU's HBM budget. The lesson is in which slider hits the wall first — and that wall is what each subsequent lesson is going to dismantle.

Training-state budget on one GPU

Per-rank memory required to start a single forward+backward. Bars exceed the dashed budget? That's the memory wall. Note: activations here assume no checkpointing; flipping the checkpoint toggle models full recompute (~B·T·L·2 bytes; full treatment in lesson 10) and is one of FSDP's main competitor strategies (lesson 05 explains why we usually want both).

params (B): 70 batch × seq: 8

layers: 80 hidden d: 8192

optimizer: GPU: activation checkpointing (full recompute · lesson 10)

total state

—

vs one GPU

—

min ranks (FSDP, perfect shard)

—

first slider that hit it

—

Where each wall sends you

Wall	Symptom	First response	Lessons
Memory (state)	OOM at step 0	FSDP / ZeRO-3 → activation checkpointing → TP	05, then 06
Memory (activations)	OOM at longer sequence	Activation checkpointing → SP → CP	05, 08
Memory (KV cache)	OOM serving long context	GQA → KV quantization → paged KV	see vLLM/09, vLLM/02
Time (training)	One-step time × steps > deadline	DP → FSDP-HSDP → 3D parallelism	04, 05, 12
Latency (inference)	TTFT or ITL miss target	TP per replica → speculative decode → PD-disagg	14, 15
Throughput (inference)	QPS shortfall	Replicate → continuous batching → APC	14, see vLLM/04

Takeaway

Distributed is a response to a constraint, not a goal. Diagnose which wall you're hitting (memory? time? throughput?) before reaching for a hammer. The lessons in Part II are each a different hammer; reading them in order is reading the hammers from cheapest to most expensive.