all_lessons / ml_system_design / 07b · scaling ceiling lesson 7b / 20

The scaling ceiling — how far "add more GPUs" actually gets you

07a gave you the per-step communication tax. This lesson runs it forward: what happens to that tax as the cluster grows from 8 GPUs to 8,000? The answer is two different stories depending on what you hold fixed — and a hard wall, set not by hardware but by optimization math, that stops data parallelism long before you run out of GPUs to buy. Knowing where that wall is tells you when scaling is nearly free and when every extra GPU is half wasted.

The one idea
There are two ways to "add GPUs," and they behave oppositely. Weak scaling — grow the model and data with the cluster — keeps each GPU's work constant, so efficiency stays flat: this is why frontier pretraining scales to thousands of GPUs. Strong scaling — pin the job and split it across more GPUs — shrinks each GPU's compute while comm stays or grows, so efficiency decays toward an Amdahl ceiling. And a third fact decides which regime you're forced into: the critical batch size caps how far data parallelism can take you before you must reach for the comm-hungry model-parallel axes.

1 · Two questions that sound the same and aren't

Strong scalingWeak scaling
What's fixedthe job (one model, one global batch)the work per GPU
Adding GPUs…finishes the same job fasterdoes a bigger job in the same time
Per-GPU computeshrinks as 1/Pstays constant
Comm as a fractionrises → efficiency decaysstays flat → efficiency holds
Where you meet itsplitting a model with TP/PP; fast iterationfrontier pretraining (bigger model + more tokens)

The mistake that wastes the most money in practice is running a strong-scaling job (fixed model, "let's just throw 2× the GPUs at it") and expecting weak-scaling efficiency. The arithmetic below shows why that expectation is wrong, and exactly when.

2 · Why strong scaling decays — Amdahl, in bytes

Split a fixed model across P GPUs (more TP, more PP). Per-GPU compute falls as 1/P, but from 07a the communication per step does not fall in step — TP's per-layer all-reduces and PP's bubble are roughly fixed costs that you now pay against a smaller compute window. So the comm fraction climbs. Scaling efficiency — the speedup you actually get divided by the speedup you paid for — is:

E(P) = compute / (compute + comm) = 1 / (1 + comm_frac(P))     comm_frac rises with P

This is Amdahl's law wearing a network: the serial-ish, non-shrinking part (communication) sets a ceiling, and past some P you're adding GPUs that mostly talk. A config that hit 50% MFU on 256 GPUs can sag to 30% on 2,048 for the same model — not because anything broke, but because the comm fraction grew exactly as this formula predicts.

The tell: throughput rises sub-linearly, then flattens
Plot tokens/sec against GPU count for a fixed model. Early on it's nearly a straight line (comm hidden, efficiency ~1). Then it bends — each doubling of GPUs buys less than 2× — and eventually flattens: you've hit the regime where added GPUs are absorbed by communication. The knee of that curve is your cost-effective scale for that model. Past it you're still going faster, just paying 2× the GPUs for 1.3× the speed.

3 · The wall that isn't hardware — critical batch size

"Just add data-parallel replicas" sounds free — 07a showed DP's gradient all-reduce is independent of replica count and hides behind the backward pass. So why not run dp = 4{,}000? Because data parallelism works by growing the global batch:

global_batch (tokens) = dp · micro_batch · grad_accum · seq

And there is a point — the critical batch size B_crit — beyond which a bigger batch stops buying you faster learning. Below B_crit, doubling the batch roughly halves the steps to a target loss (gradient noise dominates, more samples genuinely help). Above it, the gradient is already accurate, so doubling the batch barely reduces the step count — you process 2× the tokens per step and learn almost the same amount. Those extra tokens are wasted compute.

So data parallelism has a hard cap, set by optimization, not bandwidth
Useful data parallelism is bounded by dp ≤ B_crit / (micro·accum·seq). Past that, two bad options: (a) keep growing the batch beyond B_crit and burn tokens for nearly no convergence gain, or (b) shrink the local tokens to keep the batch fixed — which pushes you below 07a's peak·MFU/BW overlap threshold, so the all-reduce stops hiding and MFU drops. Either way, DP runs out of road. The only way to consume more GPUs is to split the model (TP/PP/EP) — back into the strong-scaling regime of §2, where efficiency decays.
Worked: where DP runs out on a typical pretraining run
Frontier runs target a global batch around B_crit ≈ 2M tokens (it grows over training, but order-of-magnitude). With seq = 8K and a per-rank micro·accum of 1 sequence, each DP rank consumes 8K tokens/step, so dp_max ≈ 2{,}000{,}000 / 8{,}000 = 256. Compose with the in-node tp = 8 ceiling (07a) and you can fill dp·tp = 2{,}048 GPUs at high efficiency on this model. Want to use 8,000? You must add PP (and eat its bubble) or train a bigger model (weak scaling) — which is exactly what labs do. The cluster size and the model size are not independent choices; B_crit ties them together.

4 · Weak scaling is the escape hatch — and why frontier runs use it

Strong scaling decays and DP hits the critical-batch wall. So how do real runs reach 16,000 GPUs at decent MFU? They don't hold the model fixed — they grow it. Weak scaling keeps each GPU's compute constant by making the problem bigger as the cluster grows:

This is the deep reason the field chases scale on all three axes at once (model, data, GPUs) rather than just buying more GPUs for a fixed model: weak scaling is the regime where efficiency holds, and the scaling-law economics reward growing the model anyway. The cluster, the model, and the token budget are co-designed.

GPU count P → efficiency (MFU) → weak scaling — grow the model, efficiency flat strong scaling — fixed model, comm fraction climbs DP cap (B_crit) — must add model parallel past here cheap DP zone

Interactive · the scaling-efficiency curve

Set a model and a target global batch, then sweep the cluster size. The widget marks the DP cap (where the critical batch is reached) and plots efficiency in two regimes: cheap data-parallel scaling up to the cap, then decaying model-parallel scaling past it. Push the model size up and watch the whole curve lift and the cap move right — weak scaling, live. Drop the critical batch and watch the cheap zone collapse.

Scaling efficiency — strong vs weak

Assumptions (±30%): DP is free (efficiency ~constant) up to dp_max = B_crit/(micro·accum·seq) with tp=8 in-node; past the cap, extra GPUs come from PP whose bubble (P-1)/(M+P-1) and rising comm fraction dock efficiency; wider models pay less TP/PP tax (1/h from 07a). MFU base 50%.

DP cap (GPUs)
regime at this size
est. efficiency
useful speedup

What carries forward