Designing a pretraining system

Serving (lessons 03–06) was a latency game. Pretraining is a different beast: there is one job, it must run for weeks, and its memory bill exceeds a single GPU by 10–100×. The whole design reduces to one sentence — split the model across devices so each GPU's share fits in 80 GB and the GPUs stay busy. Parallelism is not a menu of techniques you pick from taste; it is the solution to that constraint-satisfaction problem.

The framing: this is constraint satisfaction, not a technique catalog

Two hard constraints, in tension. (a) It must fit: per-GPU memory ≤ 80 GB. (b) It must be fast: MFU stays high, which means every collective communication overlaps compute instead of stalling it. Each parallelism axis relieves one memory pressure at a known communication cost. The art is composing the cheapest axes that satisfy (a) without breaking (b).

The bill that forces the split

From lesson 02: mixed-precision Adam holds ~16 bytes per parameter — 2 for the bf16 weight, plus ~14 for the fp32 master weight, fp32 momentum, fp32 variance, and the gradient. So the persistent training state is:

state ≈ 16N bytes (2N weights + ~14N optimizer/grad) + activations

Activations are separate and scale with batch · seq · layers — they grow with the work, not the model. The 16N term is the forcing function. Worked across model sizes against an 80 GB H100:

Model N	weights (2N)	train state (16N)	H100s just for state
7B	14 GB	112 GB	≥ 2
70B	140 GB	1,120 GB ≈ 1.1 TB	≥ 14
405B	810 GB	6,480 GB ≈ 6.5 TB	≥ 81

Even the "small" 7B overflows one GPU once you add optimizer state and activations. A 70B needs 14 GPUs' worth of memory before a single activation byte. There is no single-device option; the only question is how to spread it.

The axes — each relieves one constraint, at one cost

Five ways to cut a model into pieces. Read each as "what memory does it shrink, and what does it cost in communication?"

1 · Data parallel / FSDP-ZeRO — shard the 16N state across replicas

Plain data parallelism replicates the whole model on each of dp ranks and all-reduces gradients each step (system_ml 04). That replicates the 16N bill dp times — wasteful. FSDP / ZeRO instead shards that state across the dp ranks, so each holds 16N / dp (system_ml 05). ZeRO-1 shards the optimizer state, ZeRO-2 adds gradients, ZeRO-3 adds the parameters themselves — progressively more memory relief.

Cost: with ZeRO-3 each layer's parameters must be all-gathered just before its forward/backward, then freed. That is bandwidth-heavy but overlappable — prefetch layer i+1's weights while computing layer i. Overlapped, MFU holds; not overlapped, the GPU stalls on the network and MFU tanks.

2 · Tensor parallel — shard each matmul across GPUs

Split every weight matrix (and its activations) column- or row-wise across tp GPUs, so weights and activations both shrink by tp (system_ml 06). The catch: producing the correct output requires an all-reduce twice per transformer layer (once in attention, once in the MLP). With 80 layers that is 160 all-reduces per step, each on the critical path.

TP must stay inside a node — the 18× rule from lesson 02

NVLink is ~900 GB/s intra-node; InfiniBand is ~50 GB/s inter-node — an 18× gap. TP's per-layer all-reduces are too frequent to hide over the slow link, so crossing a node boundary collapses MFU. Practical limit: tp ≤ 8 (one node). TP is the cheap, in-node lever; never stretch it across nodes.

3 · Pipeline parallel — split the layers across stages

Assign contiguous blocks of layers to pp stages, so each stage holds only params/pp (system_ml 07). The only thing that crosses a stage boundary is the activation tensor at that cut — a cheap point-to-point send, not a collective. Because it is so light, PP is the axis that survives crossing nodes on InfiniBand.

Its cost is the pipeline bubble: while stage 1 processes the first microbatch, stages 2…pp sit idle waiting for work to reach them. The idle fraction is:

bubble = (pp − 1) / (M + pp − 1) (M = microbatches per step)

With pp=8 and M=8 the bubble is 7/15 ≈ 47% — nearly half the GPUs idle. Push to M=64 and it falls to 7/71 ≈ 10%. PP only pays off with many microbatches to amortize the fill/drain.

4 · Sequence / context parallel — shard the sequence dimension

Activations scale with batch · seq · layers, so long contexts blow up activation memory even when weights fit. Sequence/context parallelism shards the sequence dimension across GPUs so each holds a slice of the tokens' activations (system_ml 08). This is the long-context lever; it trades extra communication in attention (which now spans the sequence shards) for activation memory.

5 · Expert parallel (MoE) — shard the experts

A Mixture-of-Experts layer has many expert FFNs but routes each token to only a few. Expert parallelism places different experts on different GPUs (system_ml 09). The cost is an all-to-all token exchange — every GPU ships its tokens to whichever GPU holds their chosen expert, then ships the results back. Bandwidth-sensitive, but it lets total parameter count grow far past what dense parallelism affords.

6 · 3D / nD parallelism — compose them

Real frontier runs use several axes at once: TP inside the node, PP across nodes, FSDP/DP filling the rest of the cluster, sometimes SP and EP layered on (system_ml 10). The degrees multiply: total_GPUs = tp · pp · dp. The recipe below is how you choose those degrees in order.

The design recipe — ordered, and it is the heart of the lesson

You do not pick axes by preference. You apply them in a fixed order, cheapest-comms first, stopping as soon as it fits:

Compute the bill. 16N for state, plus an activation estimate ∝ batch·seq·layers. That is your target per-GPU number to get under 80 GB.
TP within a node first (degree up to 8). It is the cheapest relief on the fast NVLink link and cuts both weights and activations by tp. Exhaust the node before reaching across one.
PP across nodes if a single node still can't hold the model. It splits layers with only cheap point-to-point comms that tolerate InfiniBand — but budget enough microbatches to keep the bubble small.
FSDP / ZeRO across the dp dimension to shard whatever optimizer state remains and to fill the rest of the cluster as data-parallel replicas, dividing the state by dp.
Set the global batch. global_batch = micro_batch · grad_accum · dp must hit your target token batch — large enough for throughput, but not so large it hurts convergence (huge batches waste tokens per step of learning).
Score by MFU and verify every collective overlaps compute. If a config fits but MFU is poor, a collective isn't hidden — fix the overlap or rebalance the degrees.

Worked: place a 70B model on a cluster

Bill: 16 · 70 = 1,120 GB of state — needs ≥ 14 H100s' memory before activations. Step 2 — TP=8 (one node): weights and activations divide by 8, so weights drop to 140/8 ≈ 17.5 GB/GPU and activations shrink in step. Step 4 — FSDP across dp: shard the 16N optimizer state over the data-parallel dimension. On, say, 256 GPUs with TP=8, PP=1, we get dp = 256/8 = 32. Per-GPU state ≈ 1,120 / (8 · 32) ≈ 4.4 GB. Add ~17.5 GB of bf16 weights held during compute and a checkpointed activation budget of ~10–20 GB and per-GPU memory lands comfortably under 80 GB. The model fits, no node-crossing TP, and the only frequent collective (ZeRO-3 all-gather) is overlappable — MFU can stay in the good zone.

MFU — the single score for a config

Model FLOPs Utilization is achieved useful FLOP/s divided by the hardware's peak. From lesson 02, it is the number that swings the multi-million-dollar pretraining bill. Rough reading:

40% is good, 50%+ is excellent for large dense pretraining on H100s.
A collective that isn't overlapped with compute is dead GPU time — it drags MFU straight down. Most "why is my MFU 20%?" investigations end at an un-hidden all-gather or all-reduce, or a PP bubble that wasn't amortized.

Two levers you will always reach for

Gradient (activation) checkpointing

Instead of storing every layer's activations for the backward pass, store only a few checkpoints and recompute the rest during backward. This slashes activation memory dramatically at the cost of ~33% extra compute (roughly one extra forward pass). It is the standard long-context lever: when activations — not weights — are what overflow 80 GB, checkpointing buys the headroom that no parallelism axis cheaply provides.

Fault tolerance at 1000-GPU scale

At scale, hardware failure is the normal case, not the exception

Across 1,000+ GPUs a node or GPU dies every few hours. A run that can't survive that never finishes. The defenses: checkpoint frequently (a checkpoint is ≈ 16N bytes — for 70B that's ~1.1 TB to write), checkpoint asynchronously (snapshot to host RAM, flush to storage in the background so training barely pauses), and keep MTTR low so a restart costs minutes, not hours. The cost of a crash is the work since the last checkpoint — lose a 30-minute interval on a 1,000-GPU run and you've burned ~500 GPU-hours. Checkpoint cadence is a direct trade between I/O overhead and expected lost work.

Interactive · parallelism planner — does it fit?

Pick a model and a cluster, then set the parallelism degrees. The widget computes the data-parallel degree, the per-GPU memory, whether it fits in 80 GB, and a rough MFU. Watch the two failure modes: OOM (raise TP/PP or enable checkpointing) and low MFU (a fat PP bubble, or TP pushed past 8 onto the slow inter-node link).

Parallelism planner / does it fit?

Assumptions: 80 GB H100s; ZeRO-3 shards the full 16N state across dp; weights held in bf16 during compute; activation term ∝ layers·batch·seq, divided by TP·PP, ×0.3 with checkpointing; M=32 microbatches for the bubble; MFU starts at 50% and is docked by the PP bubble and by any cross-node TP (tp>8). Order-of-magnitude, per lesson 02's ±30% contract.

params N (B) 70 total GPUs 256 TP degree 8 PP degree 1 activation checkpointing on

dp degree

–

per-GPU mem

–

fits 80 GB?

–

est. MFU

–

What carries forward

Pretraining is constraint satisfaction: fit each GPU's share into 80 GB and keep MFU high by overlapping every collective with compute.
The bill is 16N + activations. A 70B needs ≥ 14 H100s of memory before activations; nothing fits on one device.
The ordered recipe: TP within a node (≤8, NVLink) → PP across nodes (cheap p2p, mind the bubble) → FSDP/ZeRO shards 16N/dp and fills the cluster → set global batch → score by MFU.
The 18× intra-vs-inter-node gap decides placement: chatty TP stays in-node, light PP crosses nodes.
Activation checkpointing trades ~33% compute for big activation-memory cuts — the long-context lever.
At 1000-GPU scale, failure is routine: async checkpoints (≈16N bytes each), low MTTR, and a cadence that balances I/O against lost work.