The scaling ceiling — how far "add more GPUs" actually gets you

07a gave you the per-step communication tax. This lesson runs it forward: what happens to that tax as the cluster grows from 8 GPUs to 8,000? The answer is two different stories depending on what you hold fixed — and a hard wall, set not by hardware but by optimization math, that stops data parallelism long before you run out of GPUs to buy. Knowing where that wall is tells you when scaling is nearly free and when every extra GPU is half wasted.

The one idea

There are two ways to "add GPUs," and they behave oppositely. Weak scaling — grow the model and data with the cluster — keeps each GPU's work constant, so efficiency stays flat: this is why frontier pretraining scales to thousands of GPUs. Strong scaling — pin the job and split it across more GPUs — shrinks each GPU's compute while comm stays or grows, so efficiency decays toward an Amdahl ceiling. And a third fact decides which regime you're forced into: the critical batch size caps how far data parallelism can take you before you must reach for the comm-hungry model-parallel axes.

1 · Two questions that sound the same and aren't

	Strong scaling	Weak scaling
What's fixed	the job (one model, one global batch)	the work per GPU
Adding GPUs…	finishes the same job faster	does a bigger job in the same time
Per-GPU compute	shrinks as 1/P	stays constant
Comm as a fraction	rises → efficiency decays	stays flat → efficiency holds
Where you meet it	splitting a model with TP/PP; fast iteration	frontier pretraining (bigger model + more tokens)

The mistake that wastes the most money in practice is running a strong-scaling job (fixed model, "let's just throw 2× the GPUs at it") and expecting weak-scaling efficiency. The arithmetic below shows why that expectation is wrong, and exactly when.

2 · Why strong scaling decays — Amdahl, in bytes

Split a fixed model across P GPUs (more TP, more PP). Per-GPU compute falls as 1/P, but from 07a the communication per step does not fall in step — TP's per-layer all-reduces and PP's bubble are roughly fixed costs that you now pay against a smaller compute window. So the comm fraction climbs. Scaling efficiency — the speedup you actually get divided by the speedup you paid for — is:

E(P) = compute / (compute + comm) = 1 / (1 + comm_frac(P)) comm_frac rises with P

This is Amdahl's law wearing a network: the serial-ish, non-shrinking part (communication) sets a ceiling, and past some P you're adding GPUs that mostly talk. A config that hit 50% MFU on 256 GPUs can sag to 30% on 2,048 for the same model — not because anything broke, but because the comm fraction grew exactly as this formula predicts.

The tell: throughput rises sub-linearly, then flattens

Plot tokens/sec against GPU count for a fixed model. Early on it's nearly a straight line (comm hidden, efficiency ~1). Then it bends — each doubling of GPUs buys less than 2× — and eventually flattens: you've hit the regime where added GPUs are absorbed by communication. The knee of that curve is your cost-effective scale for that model. Past it you're still going faster, just paying 2× the GPUs for 1.3× the speed.

3 · The wall that isn't hardware — critical batch size

"Just add data-parallel replicas" sounds free — 07a showed DP's gradient all-reduce is independent of replica count and hides behind the backward pass. So why not run dp = 4{,}000? Because data parallelism works by growing the global batch:

global_batch (tokens) = dp · micro_batch · grad_accum · seq

And there is a point — the critical batch size B_crit — beyond which a bigger batch stops buying you faster learning. Below B_crit, doubling the batch roughly halves the steps to a target loss (gradient noise dominates, more samples genuinely help). Above it, the gradient is already accurate, so doubling the batch barely reduces the step count — you process 2× the tokens per step and learn almost the same amount. Those extra tokens are wasted compute.

So data parallelism has a hard cap, set by optimization, not bandwidth

Useful data parallelism is bounded by dp ≤ B_crit / (micro·accum·seq). Past that, two bad options: (a) keep growing the batch beyond B_crit and burn tokens for nearly no convergence gain, or (b) shrink the local tokens to keep the batch fixed — which pushes you below 07a's peak·MFU/BW overlap threshold, so the all-reduce stops hiding and MFU drops. Either way, DP runs out of road. The only way to consume more GPUs is to split the model (TP/PP/EP) — back into the strong-scaling regime of §2, where efficiency decays.

Worked: where DP runs out on a typical pretraining run

Frontier runs target a global batch around B_crit ≈ 2M tokens (it grows over training, but order-of-magnitude). With seq = 8K and a per-rank micro·accum of 1 sequence, each DP rank consumes 8K tokens/step, so dp_max ≈ 2{,}000{,}000 / 8{,}000 = 256. Compose with the in-node tp = 8 ceiling (07a) and you can fill dp·tp = 2{,}048 GPUs at high efficiency on this model. Want to use 8,000? You must add PP (and eat its bubble) or train a bigger model (weak scaling) — which is exactly what labs do. The cluster size and the model size are not independent choices; B_crit ties them together.

4 · Weak scaling is the escape hatch — and why frontier runs use it

Strong scaling decays and DP hits the critical-batch wall. So how do real runs reach 16,000 GPUs at decent MFU? They don't hold the model fixed — they grow it. Weak scaling keeps each GPU's compute constant by making the problem bigger as the cluster grows:

Bigger model → more compute per token (6N grows), and from 07a a wider model pays less TP tax (the 1/h term). So model-parallel efficiency actually improves with scale.
More tokens → the longer run amortizes fixed costs (checkpointing, warmup, restarts).
Critical batch grows too → bigger models tolerate bigger batches, lifting the DP cap of §3.

This is the deep reason the field chases scale on all three axes at once (model, data, GPUs) rather than just buying more GPUs for a fixed model: weak scaling is the regime where efficiency holds, and the scaling-law economics reward growing the model anyway. The cluster, the model, and the token budget are co-designed.

Interactive · the scaling-efficiency curve

Set a model and a target global batch, then sweep the cluster size. The widget marks the DP cap (where the critical batch is reached) and plots efficiency in two regimes: cheap data-parallel scaling up to the cap, then decaying model-parallel scaling past it. Push the model size up and watch the whole curve lift and the cap move right — weak scaling, live. Drop the critical batch and watch the cheap zone collapse.

What carries forward

Strong vs weak scaling are opposite regimes. Fixed job + more GPUs (strong) decays toward an Amdahl ceiling as the comm fraction climbs; constant per-GPU work (weak) holds efficiency flat.
Scaling efficiency E = 1/(1+comm_frac), and comm_frac rises with model-parallel degree — so a fixed model on 8× the GPUs is well under 8× the throughput.
Critical batch size hard-caps data parallelism: dp ≤ B_crit/(micro·accum·seq). Past it you either waste tokens or break 07a's overlap threshold — DP runs out of road.
Above the DP cap you must split the model (TP/PP/EP), paying 07a's comm tax — which is why throughput bends then flattens, and where the cost-effective scale lives.
Weak scaling is the escape: grow model + data + cluster together. Wider models pay less TP tax and tolerate bigger batches — the reason frontier runs co-design cluster, model, and token budget.