The scaling ladder — 7B on 8 GPUs to 405B on 2,048
07 gave the ordered recipe, 07a the cost of each axis, 07b the limits. Now we climb the ladder in one continuous walkthrough — same model family, growing cluster — and at every rung do exactly one thing: compute the bill, find the wall that binds, add the cheapest axis that relieves it, re-check. The point isn't the four answers; it's that each new axis appears only when a number forces it. That forcing is the whole skill.
Rung 0 · 7B — does it even fit on one GPU?
Inference fits (14 GB of weights on an 80 GB H100). Training does not: 16·7 = 112 GB of state alone, before a single activation. So even the "small" model needs ≥ 2 GPUs' worth of memory — the forcing function of lesson 07.
On 8 GPUs: the wall is memory, and the cheapest cure is the one with the rarest, most-hideable collective — FSDP/ZeRO-3 across dp=8. Each GPU holds 16·7/8 = 14 GB of sharded state, plus ~14 GB of bf16 weights gathered during compute, plus a checkpointed activation budget — comfortably under 80 GB. No TP, no PP. The only collective is the gradient sync, hidden behind the backward pass as long as local tokens clear 07a's threshold. Binding wall: memory. Axis added: FSDP. MFU: high.
Rung 1 · 70B on one node (8 GPUs) — the wall moves
16·70 = 1{,}120 GB of state. Sharded across dp=8 that's 140 GB/GPU — still over 80, and that's before the bf16 weights you must hold during the matmul (140 GB ÷ 8 = 17.5 GB) and activations. FSDP alone can't save a single node here. Add the in-node lever: TP=8 cuts weights and activations by 8× — but with tp=8, dp=1 on one node, the 16N state still lands near 140 GB/GPU.
Rung 2 · 70B on 256 GPUs — the comfortable config
Now there's room to compose. Apply 07's order: TP=8 inside each node (weights → 17.5 GB/GPU, on NVLink so 07a's tax is ~5%), then FSDP across dp = 256/8 = 32 to shard the rest:
Fits with headroom. Now check the two non-memory walls. Comm (07a): TP stays in-node, grad sync hides if local tokens ≥ ~8K — set micro-batch accordingly. DP cap (07b): global batch = dp · micro · seq; with dp=32 and seq 8K that's well under a 2M-token critical batch, so there's room to grow dp further. Binding wall: none is tight — this is the sweet spot. This is exactly lesson 07's worked example, now justified at every step.
Rung 3 · 405B on 2,048 GPUs — when even TP can't hold the weights
16·405 = 6{,}480 GB of state; weights alone are 810 GB. Try TP=8: weights become 810/8 ≈ 101 GB/GPU — over 80 before optimizer or activations. TP maxed at the node boundary still can't hold the model. The next-cheapest axis in 07's order is the one that survives crossing nodes: pipeline parallel.
Add PP=16 across nodes (each stage holds 1/16 of the layers), keep TP=8 in-node, and let dp = 2,048/(8·16) = 16 shard the rest with FSDP:
Memory is solved — but PP introduced a new binding wall, the bubble (07/07a). With pp=16 you need many microbatches to amortize it: at M=16 the bubble is 15/31 ≈ 48% (catastrophic); push to M=64 and it's 15/79 ≈ 19%; M=128 → ~10%. Binding wall: the PP bubble. Lever: microbatch count, not memory. The design problem shifted from "does it fit" to "keep the pipe full" — a different question answered by a different knob, which is the whole reason we re-check after every axis.
The ladder as one table
| Rung | Model | GPUs | Binding wall | Axis added | Config |
|---|---|---|---|---|---|
| 0 | 7B | 8 | state > 1 GPU | FSDP/ZeRO-3 | dp=8 |
| 1 | 70B | 8 | state > 1 node | (forces multi-node) | — infeasible — |
| 2 | 70B | 256 | weights/GPU | TP + FSDP | tp=8, dp=32 |
| 3 | 405B | 2,048 | weights > TP alone | + PP (then bubble) | tp=8, pp=16, dp=16 |
| 4 | >1T sparse | 4,096+ | 6N compute/token | + EP (MoE) | tp·pp·dp·ep |
Read down the "binding wall" column: it changes every rung. That is the entire message of this track applied to training — you don't memorize configs, you find the wall and relieve it with the cheapest axis, then look again. The config is an output, never an input.
Interactive · the topology recommender
Give it a model and a cluster; it climbs the ladder for you — applying 07's order until the model fits, then reporting the binding wall and a rough MFU. Try the rungs above and watch the config it picks. Then break it: ask for 405B on 64 GPUs (won't fit — it tells you the minimum), or 7B on 4,096 (fits trivially, but you're past the DP cap of 07b, so it warns the extra GPUs are wasted). The recommender is just this lesson's loop in code.
What carries forward
- The config is an output of the loop, not a choice. At each scale: bill → binding wall → cheapest relieving axis → re-check. Memorize the loop, derive the config.
- The binding wall moves as you climb: 7B = state > 1 GPU (FSDP); 70B = state > 1 node (forces multi-node, then TP+FSDP); 405B = weights > TP alone (forces PP, then the bubble binds); >1T = 6N compute/token (forces MoE/EP).
- Axes are added in 07's order and never sooner: TP in-node (cheap, ≤8) → PP across nodes (survives IB) → FSDP fills dp. Each appears only when a number forces it.
- After each axis, a new wall appears — memory gives way to the comm tax (07a) gives way to the DP cap (07b) gives way to the bubble. Re-checking is the discipline.
- This is the capstone of the training arc: 07 (axes + recipe) + 07a (cost) + 07b (limits) compose into one repeatable climb you can run on any model + cluster pair.