all_lessons / ml_system_design / 07c · scaling ladder lesson 7c / 20

The scaling ladder — 7B on 8 GPUs to 405B on 2,048

07 gave the ordered recipe, 07a the cost of each axis, 07b the limits. Now we climb the ladder in one continuous walkthrough — same model family, growing cluster — and at every rung do exactly one thing: compute the bill, find the wall that binds, add the cheapest axis that relieves it, re-check. The point isn't the four answers; it's that each new axis appears only when a number forces it. That forcing is the whole skill.

The loop, specialized for training
At each rung: (1) bill = 16N state + activations (lesson 02); (2) the binding wall is whichever is true first — weights don't fit, state doesn't fit, comm won't hide (07a), or DP is capped (07b); (3) add the cheapest axis per 07's order — TP in-node → PP across nodes → FSDP across dp; (4) re-run. We never pick an axis we don't need.

Rung 0 · 7B — does it even fit on one GPU?

Inference fits (14 GB of weights on an 80 GB H100). Training does not: 16·7 = 112 GB of state alone, before a single activation. So even the "small" model needs ≥ 2 GPUs' worth of memory — the forcing function of lesson 07.

On 8 GPUs: the wall is memory, and the cheapest cure is the one with the rarest, most-hideable collective — FSDP/ZeRO-3 across dp=8. Each GPU holds 16·7/8 = 14 GB of sharded state, plus ~14 GB of bf16 weights gathered during compute, plus a checkpointed activation budget — comfortably under 80 GB. No TP, no PP. The only collective is the gradient sync, hidden behind the backward pass as long as local tokens clear 07a's threshold. Binding wall: memory. Axis added: FSDP. MFU: high.

Rung 1 · 70B on one node (8 GPUs) — the wall moves

16·70 = 1{,}120 GB of state. Sharded across dp=8 that's 140 GB/GPU — still over 80, and that's before the bf16 weights you must hold during the matmul (140 GB ÷ 8 = 17.5 GB) and activations. FSDP alone can't save a single node here. Add the in-node lever: TP=8 cuts weights and activations by 8× — but with tp=8, dp=1 on one node, the 16N state still lands near 140 GB/GPU.

The lesson of rung 1: 70B simply does not fit on one node
No combination of in-node axes gets 1{,}120 GB of state under 8 × 80 = 640 GB of node memory. The arithmetic forces you off the node. This is why you never see a 70B pretrained on a single DGX — the bill exceeds the box. The question was never "which axis," it was "how many nodes," and the bill answered it: ≥ 2 nodes minimum, and many more for headroom and speed.

Rung 2 · 70B on 256 GPUs — the comfortable config

Now there's room to compose. Apply 07's order: TP=8 inside each node (weights → 17.5 GB/GPU, on NVLink so 07a's tax is ~5%), then FSDP across dp = 256/8 = 32 to shard the rest:

state/GPU ≈ 16·70 / (tp·dp) = 1{,}120 / (8·32) ≈ 4.4 GB  +  17.5 GB weights  +  ~15 GB acts (ckpt) ≈ 37 GB ✓

Fits with headroom. Now check the two non-memory walls. Comm (07a): TP stays in-node, grad sync hides if local tokens ≥ ~8K — set micro-batch accordingly. DP cap (07b): global batch = dp · micro · seq; with dp=32 and seq 8K that's well under a 2M-token critical batch, so there's room to grow dp further. Binding wall: none is tight — this is the sweet spot. This is exactly lesson 07's worked example, now justified at every step.

Rung 3 · 405B on 2,048 GPUs — when even TP can't hold the weights

16·405 = 6{,}480 GB of state; weights alone are 810 GB. Try TP=8: weights become 810/8 ≈ 101 GB/GPUover 80 before optimizer or activations. TP maxed at the node boundary still can't hold the model. The next-cheapest axis in 07's order is the one that survives crossing nodes: pipeline parallel.

Add PP=16 across nodes (each stage holds 1/16 of the layers), keep TP=8 in-node, and let dp = 2,048/(8·16) = 16 shard the rest with FSDP:

weights/GPU ≈ 810/(8·16) ≈ 6.3 GB  |  state/GPU ≈ 6{,}480/(8·16·16) ≈ 3.2 GB  |  + acts ✓ fits

Memory is solved — but PP introduced a new binding wall, the bubble (07/07a). With pp=16 you need many microbatches to amortize it: at M=16 the bubble is 15/31 ≈ 48% (catastrophic); push to M=64 and it's 15/79 ≈ 19%; M=128 → ~10%. Binding wall: the PP bubble. Lever: microbatch count, not memory. The design problem shifted from "does it fit" to "keep the pipe full" — a different question answered by a different knob, which is the whole reason we re-check after every axis.

Past dense 405B: when parameters outgrow the budget, go sparse
Push the dense model bigger and the 6N compute per token (lesson 02) becomes unaffordable — every token pays for every parameter. The escape is Mixture-of-Experts: grow total parameters but route each token to a few experts, so compute per token stays flat. That adds expert parallel (07a's all-to-all) to the stack. The ladder's top rungs are sparse for exactly this reason — it's the only way to keep growing N without growing the per-token bill.

The ladder as one table

RungModelGPUsBinding wallAxis addedConfig
07B8state > 1 GPUFSDP/ZeRO-3dp=8
170B8state > 1 node(forces multi-node)— infeasible —
270B256weights/GPUTP + FSDPtp=8, dp=32
3405B2,048weights > TP alone+ PP (then bubble)tp=8, pp=16, dp=16
4>1T sparse4,096+6N compute/token+ EP (MoE)tp·pp·dp·ep

Read down the "binding wall" column: it changes every rung. That is the entire message of this track applied to training — you don't memorize configs, you find the wall and relieve it with the cheapest axis, then look again. The config is an output, never an input.

model size × cluster size → 7B / 8 FSDP 70B / 256 + TP (in-node) + FSDP 405B / 2048 + TP + PP (cross-node) + FSDP wall: bubble >1T MoE / 4096 + TP + PP + FSDP + EP (all-to-all) wall: 6N/token each rung keeps the axes below it and adds exactly one — the one the new wall demands

Interactive · the topology recommender

Give it a model and a cluster; it climbs the ladder for you — applying 07's order until the model fits, then reporting the binding wall and a rough MFU. Try the rungs above and watch the config it picks. Then break it: ask for 405B on 64 GPUs (won't fit — it tells you the minimum), or 7B on 4,096 (fits trivially, but you're past the DP cap of 07b, so it warns the extra GPUs are wasted). The recommender is just this lesson's loop in code.

Topology recommender / climb the ladder

Assumptions (±30%): 80 GB H100s; greedy order TP(≤8, in-node)→PP(cross-node)→FSDP across remaining dp; bf16 weights + 16N state + checkpointed activations; PP bubble at M=64; DP capped at B_crit=2M tokens, seq 8K (07b); TP tax from 07a’s 1/h with width inferred from N. Order-of-magnitude — sizes a config, doesn’t replace a profiler.

recommended config
per-GPU memory
binding wall
est. MFU

What carries forward