Designing a pretraining system
Serving (lessons 03–06) was a latency game. Pretraining is a different beast: there is one job, it must run for weeks, and its memory bill exceeds a single GPU by 10–100×. The whole design reduces to one sentence — split the model across devices so each GPU's share fits in 80 GB and the GPUs stay busy. Parallelism is not a menu of techniques you pick from taste; it is the solution to that constraint-satisfaction problem.
The bill that forces the split
From lesson 02: mixed-precision Adam holds ~16 bytes per parameter — 2 for the bf16 weight, plus ~14 for the fp32 master weight, fp32 momentum, fp32 variance, and the gradient. So the persistent training state is:
Activations are separate and scale with batch · seq · layers — they grow with the work, not the model. The 16N term is the forcing function. Worked across model sizes against an 80 GB H100:
| Model N | weights (2N) | train state (16N) | H100s just for state |
|---|---|---|---|
| 7B | 14 GB | 112 GB | ≥ 2 |
| 70B | 140 GB | 1,120 GB ≈ 1.1 TB | ≥ 14 |
| 405B | 810 GB | 6,480 GB ≈ 6.5 TB | ≥ 81 |
Even the "small" 7B overflows one GPU once you add optimizer state and activations. A 70B needs 14 GPUs' worth of memory before a single activation byte. There is no single-device option; the only question is how to spread it.
The axes — each relieves one constraint, at one cost
Five ways to cut a model into pieces. Read each as "what memory does it shrink, and what does it cost in communication?"
1 · Data parallel / FSDP-ZeRO — shard the 16N state across replicas
Plain data parallelism replicates the whole model on each of dp ranks and all-reduces gradients each step (system_ml 04). That replicates the 16N bill dp times — wasteful. FSDP / ZeRO instead shards that state across the dp ranks, so each holds 16N / dp (system_ml 05). ZeRO-1 shards the optimizer state, ZeRO-2 adds gradients, ZeRO-3 adds the parameters themselves — progressively more memory relief.
Cost: with ZeRO-3 each layer's parameters must be all-gathered just before its forward/backward, then freed. That is bandwidth-heavy but overlappable — prefetch layer i+1's weights while computing layer i. Overlapped, MFU holds; not overlapped, the GPU stalls on the network and MFU tanks.
2 · Tensor parallel — shard each matmul across GPUs
Split every weight matrix (and its activations) column- or row-wise across tp GPUs, so weights and activations both shrink by tp (system_ml 06). The catch: producing the correct output requires an all-reduce twice per transformer layer (once in attention, once in the MLP). With 80 layers that is 160 all-reduces per step, each on the critical path.
3 · Pipeline parallel — split the layers across stages
Assign contiguous blocks of layers to pp stages, so each stage holds only params/pp (system_ml 07). The only thing that crosses a stage boundary is the activation tensor at that cut — a cheap point-to-point send, not a collective. Because it is so light, PP is the axis that survives crossing nodes on InfiniBand.
Its cost is the pipeline bubble: while stage 1 processes the first microbatch, stages 2…pp sit idle waiting for work to reach them. The idle fraction is:
With pp=8 and M=8 the bubble is 7/15 ≈ 47% — nearly half the GPUs idle. Push to M=64 and it falls to 7/71 ≈ 10%. PP only pays off with many microbatches to amortize the fill/drain.
4 · Sequence / context parallel — shard the sequence dimension
Activations scale with batch · seq · layers, so long contexts blow up activation memory even when weights fit. Sequence/context parallelism shards the sequence dimension across GPUs so each holds a slice of the tokens' activations (system_ml 08). This is the long-context lever; it trades extra communication in attention (which now spans the sequence shards) for activation memory.
5 · Expert parallel (MoE) — shard the experts
A Mixture-of-Experts layer has many expert FFNs but routes each token to only a few. Expert parallelism places different experts on different GPUs (system_ml 09). The cost is an all-to-all token exchange — every GPU ships its tokens to whichever GPU holds their chosen expert, then ships the results back. Bandwidth-sensitive, but it lets total parameter count grow far past what dense parallelism affords.
6 · 3D / nD parallelism — compose them
Real frontier runs use several axes at once: TP inside the node, PP across nodes, FSDP/DP filling the rest of the cluster, sometimes SP and EP layered on (system_ml 10). The degrees multiply: total_GPUs = tp · pp · dp. The recipe below is how you choose those degrees in order.
The design recipe — ordered, and it is the heart of the lesson
You do not pick axes by preference. You apply them in a fixed order, cheapest-comms first, stopping as soon as it fits:
- Compute the bill. 16N for state, plus an activation estimate ∝ batch·seq·layers. That is your target per-GPU number to get under 80 GB.
- TP within a node first (degree up to 8). It is the cheapest relief on the fast NVLink link and cuts both weights and activations by tp. Exhaust the node before reaching across one.
- PP across nodes if a single node still can't hold the model. It splits layers with only cheap point-to-point comms that tolerate InfiniBand — but budget enough microbatches to keep the bubble small.
- FSDP / ZeRO across the dp dimension to shard whatever optimizer state remains and to fill the rest of the cluster as data-parallel replicas, dividing the state by dp.
- Set the global batch. global_batch = micro_batch · grad_accum · dp must hit your target token batch — large enough for throughput, but not so large it hurts convergence (huge batches waste tokens per step of learning).
- Score by MFU and verify every collective overlaps compute. If a config fits but MFU is poor, a collective isn't hidden — fix the overlap or rebalance the degrees.
MFU — the single score for a config
Model FLOPs Utilization is achieved useful FLOP/s divided by the hardware's peak. From lesson 02, it is the number that swings the multi-million-dollar pretraining bill. Rough reading:
- 40% is good, 50%+ is excellent for large dense pretraining on H100s.
- A collective that isn't overlapped with compute is dead GPU time — it drags MFU straight down. Most "why is my MFU 20%?" investigations end at an un-hidden all-gather or all-reduce, or a PP bubble that wasn't amortized.
Two levers you will always reach for
Gradient (activation) checkpointing
Instead of storing every layer's activations for the backward pass, store only a few checkpoints and recompute the rest during backward. This slashes activation memory dramatically at the cost of ~33% extra compute (roughly one extra forward pass). It is the standard long-context lever: when activations — not weights — are what overflow 80 GB, checkpointing buys the headroom that no parallelism axis cheaply provides.
Fault tolerance at 1000-GPU scale
Interactive · parallelism planner — does it fit?
Pick a model and a cluster, then set the parallelism degrees. The widget computes the data-parallel degree, the per-GPU memory, whether it fits in 80 GB, and a rough MFU. Watch the two failure modes: OOM (raise TP/PP or enable checkpointing) and low MFU (a fat PP bubble, or TP pushed past 8 onto the slow inter-node link).
What carries forward
- Pretraining is constraint satisfaction: fit each GPU's share into 80 GB and keep MFU high by overlapping every collective with compute.
- The bill is 16N + activations. A 70B needs ≥ 14 H100s of memory before activations; nothing fits on one device.
- The ordered recipe: TP within a node (≤8, NVLink) → PP across nodes (cheap p2p, mind the bubble) → FSDP/ZeRO shards 16N/dp and fills the cluster → set global batch → score by MFU.
- The 18× intra-vs-inter-node gap decides placement: chatty TP stays in-node, light PP crosses nodes.
- Activation checkpointing trades ~33% compute for big activation-memory cuts — the long-context lever.
- At 1000-GPU scale, failure is routine: async checkpoints (≈16N bytes each), low MTTR, and a cadence that balances I/O against lost work.