The scaling ceiling — how far "add more GPUs" actually gets you
07a gave you the per-step communication tax. This lesson runs it forward: what happens to that tax as the cluster grows from 8 GPUs to 8,000? The answer is two different stories depending on what you hold fixed — and a hard wall, set not by hardware but by optimization math, that stops data parallelism long before you run out of GPUs to buy. Knowing where that wall is tells you when scaling is nearly free and when every extra GPU is half wasted.
1 · Two questions that sound the same and aren't
| Strong scaling | Weak scaling | |
|---|---|---|
| What's fixed | the job (one model, one global batch) | the work per GPU |
| Adding GPUs… | finishes the same job faster | does a bigger job in the same time |
| Per-GPU compute | shrinks as 1/P | stays constant |
| Comm as a fraction | rises → efficiency decays | stays flat → efficiency holds |
| Where you meet it | splitting a model with TP/PP; fast iteration | frontier pretraining (bigger model + more tokens) |
The mistake that wastes the most money in practice is running a strong-scaling job (fixed model, "let's just throw 2× the GPUs at it") and expecting weak-scaling efficiency. The arithmetic below shows why that expectation is wrong, and exactly when.
2 · Why strong scaling decays — Amdahl, in bytes
Split a fixed model across P GPUs (more TP, more PP). Per-GPU compute falls as 1/P, but from 07a the communication per step does not fall in step — TP's per-layer all-reduces and PP's bubble are roughly fixed costs that you now pay against a smaller compute window. So the comm fraction climbs. Scaling efficiency — the speedup you actually get divided by the speedup you paid for — is:
This is Amdahl's law wearing a network: the serial-ish, non-shrinking part (communication) sets a ceiling, and past some P you're adding GPUs that mostly talk. A config that hit 50% MFU on 256 GPUs can sag to 30% on 2,048 for the same model — not because anything broke, but because the comm fraction grew exactly as this formula predicts.
3 · The wall that isn't hardware — critical batch size
"Just add data-parallel replicas" sounds free — 07a showed DP's gradient all-reduce is independent of replica count and hides behind the backward pass. So why not run dp = 4{,}000? Because data parallelism works by growing the global batch:
And there is a point — the critical batch size B_crit — beyond which a bigger batch stops buying you faster learning. Below B_crit, doubling the batch roughly halves the steps to a target loss (gradient noise dominates, more samples genuinely help). Above it, the gradient is already accurate, so doubling the batch barely reduces the step count — you process 2× the tokens per step and learn almost the same amount. Those extra tokens are wasted compute.
4 · Weak scaling is the escape hatch — and why frontier runs use it
Strong scaling decays and DP hits the critical-batch wall. So how do real runs reach 16,000 GPUs at decent MFU? They don't hold the model fixed — they grow it. Weak scaling keeps each GPU's compute constant by making the problem bigger as the cluster grows:
- Bigger model → more compute per token (6N grows), and from 07a a wider model pays less TP tax (the 1/h term). So model-parallel efficiency actually improves with scale.
- More tokens → the longer run amortizes fixed costs (checkpointing, warmup, restarts).
- Critical batch grows too → bigger models tolerate bigger batches, lifting the DP cap of §3.
This is the deep reason the field chases scale on all three axes at once (model, data, GPUs) rather than just buying more GPUs for a fixed model: weak scaling is the regime where efficiency holds, and the scaling-law economics reward growing the model anyway. The cluster, the model, and the token budget are co-designed.
Interactive · the scaling-efficiency curve
Set a model and a target global batch, then sweep the cluster size. The widget marks the DP cap (where the critical batch is reached) and plots efficiency in two regimes: cheap data-parallel scaling up to the cap, then decaying model-parallel scaling past it. Push the model size up and watch the whole curve lift and the cap move right — weak scaling, live. Drop the critical batch and watch the cheap zone collapse.
What carries forward
- Strong vs weak scaling are opposite regimes. Fixed job + more GPUs (strong) decays toward an Amdahl ceiling as the comm fraction climbs; constant per-GPU work (weak) holds efficiency flat.
- Scaling efficiency E = 1/(1+comm_frac), and comm_frac rises with model-parallel degree — so a fixed model on 8× the GPUs is well under 8× the throughput.
- Critical batch size hard-caps data parallelism: dp ≤ B_crit/(micro·accum·seq). Past it you either waste tokens or break 07a's overlap threshold — DP runs out of road.
- Above the DP cap you must split the model (TP/PP/EP), paying 07a's comm tax — which is why throughput bends then flattens, and where the cost-effective scale lives.
- Weak scaling is the escape: grow model + data + cluster together. Wider models pay less TP tax and tolerate bigger batches — the reason frontier runs co-design cluster, model, and token budget.