The communication tax — when a collective hides behind compute

Lesson 07 named the parallelism axes and called their costs "overlappable" or "bandwidth-heavy." That word was doing a lot of work. This lesson cashes it out: every axis moves a computable number of bytes per step, and whether that traffic is free or fatal comes down to a single inequality — t_comm ≤ t_compute. Get that inequality into your hands and you can predict the MFU of a parallelism plan before you launch it.

The one idea

A collective is free if it finishes before the compute it overlaps with does, and pure overhead otherwise. So for every axis we compute two times — the bytes it moves ÷ the link bandwidth (t_comm), and the FLOPs it can hide behind ÷ the achieved compute rate (t_compute) — and compare. MFU is just how much of the comm you managed to bury. Everything below is filling in those two numbers for each axis.

1 · The cost of a collective is bytes ÷ bandwidth — and it barely depends on P

Lesson 02 (system_ml 02) gave the surprising result we lean on here: a well-implemented ring collective moves a fixed amount of data per GPU regardless of how many GPUs are in the ring. To all-reduce a buffer of S bytes, each GPU sends and receives ≈ 2S bytes total (a reduce-scatter then an all-gather, each ≈ S):

t_allreduce ≈ 2S / BW_link t_allgather ≈ t_reducescatter ≈ S / BW_link t_all2all ≈ S / BW_link

Why "independent of P" is the whole game

Naively, all-reducing across 256 GPUs sounds 256× worse than across 2. It isn't: the ring pipelines the buffer around the loop, so each link only ever carries ≈ 2S no matter how long the loop is. That is why data-parallel scaling is even possible — adding replicas does not inflate the per-step sync cost. The cost is set by the buffer size and the slowest link in the ring, not the GPU count. (Latency does creep up with P — more hops — but for the multi-megabyte buffers in training, bandwidth dominates and we ignore the latency term, per lesson 02's ±30% contract.)

So the entire communication tax reduces to two questions per axis: how big is S, and which link does it cross — the fast in-node NVLink (~900 GB/s) or the slow inter-node InfiniBand (~50 GB/s per GPU)? That 18× gap (lesson 02) is the single most important number on this page.

2 · The buffer size for each axis

Two kinds of buffers move in a training step. Parameter/gradient-sized buffers scale with the model (∝ N); activation-sized buffers scale with the work in flight (∝ b·s·h, the local microbatch). Each axis moves one or the other:

Axis	Collective	Buffer S	How often	Link
DP (DDP)	all-reduce grads	≈ 2N bytes (whole model)	once / step	inter-node
FSDP / ZeRO-3	all-gather params + reduce-scatter grads	≈ 2N per gather, per layer	every layer, fwd & bwd	inter-node
TP	all-reduce activations	≈ 2·b·s·h bytes	4× per layer	in-node (must)
PP	point-to-point send	≈ 2·b·s·h at the cut	once / microbatch	inter-node (ok)
EP (MoE)	all-to-all dispatch + combine	≈ 2·b·s·h bytes	2× per MoE layer	in-node ideal

The split is the lesson in one table: DP/FSDP move the model (huge buffer, but only at gradient boundaries you can hide behind a whole backward pass), while TP/PP/EP move activations (smaller buffer, but many times per layer, often on the critical path). That difference — buffer size vs. frequency — is why each axis lives where it does.

3 · The overlap inequality — the heart of the lesson

A collective is free only if it finishes inside a window of compute that doesn't depend on its result. Write that as one inequality:

overlap holds ⟺ t_comm ≤ t_compute_window

Now apply it to the two regimes and watch two clean, model-size-independent thresholds fall out.

DP gradient sync — the "enough local tokens" threshold

In DDP the gradient all-reduce can overlap the backward pass: gradients become ready layer-by-layer, and you reduce each as it lands while later layers still compute. The window is the whole backward pass.

Comm: all-reduce the gradient buffer, S = 2N → t_comm ≈ 2·(2N)/BW_net = 4N/BW_net.
Compute: the backward pass is 4N FLOPs per local token (lesson 02's 6N = 2N fwd + 4N bwd) → t_bwd ≈ 4N·t_local /(peak·MFU), where t_local = micro-batch × seq tokens processed on this GPU.

Set t_comm ≤ t_bwd and the model size N cancels:

4N/BW_net ≤ 4N·t_local/(peak·MFU) ⟹ t_local ≥ peak·MFU / BW_net

Worked: how many local tokens hide a gradient all-reduce on H100 + IB?

t_local ≥ (990e12 · 0.4) / 50e9 ≈ 7,900 tokens per GPU per step. So if each GPU processes a micro-batch×seq of only ~1K tokens, the all-reduce can't hide and MFU bleeds; push the local token count to ~8K+ (bigger micro-batch, or gradient accumulation) and the sync disappears under the backward pass. This is why large local batches and gradient accumulation aren't just convergence knobs — they are what makes data-parallel sync free. Note what's absent: the threshold doesn't depend on model size or GPU count, only on the hardware ratio peak/BW. (FSDP/ZeRO-3 raises the bar — it also all-gathers params every layer in both passes, ~1.5× DDP's volume — so it leans harder on per-layer prefetch to stay hidden; system_ml 05.)

TP activation sync — the "wide enough model" threshold

TP's all-reduces sit on the critical path — the next matmul needs the reduced activation — so they overlap poorly. Here the right question isn't "does it hide" but "how big is the tax as a fraction of compute." Work it per pass to keep both sides consistent: the forward does 2 all-reduces per layer (one after attention, one after the MLP), each moving 2S = 4·b·s·h bytes, against the layer's forward compute of 2·12h² = 24h² FLOPs/token (attention 4h² + MLP 8h² params, ×2). The backward mirrors both, so the ratio is the same:

tax_TP = t_comm/t_compute ≈ (8·b·s·h / BW) ÷ (24·h²·b·s / peak) = ⅓ · peak / (BW · h)

Worked: why TP=8 is fine in-node and lethal across nodes

Take h = 8192. In-node (NVLink 900 GB/s): tax ≈ ⅓ · 990e12/(900e9·8192) ≈ 4.5% — a tolerable overhead, MFU dips a little. Across nodes (IB 50 GB/s): tax ≈ ⅓ · 990e12/(50e9·8192) ≈ 0.8, i.e. comm takes ~80% as long as the compute — the GPU spends nearly as much time talking as computing, and it's on the critical path so it can't hide. That 18× ratio between the two answers is lesson 07's "tp ≤ 8" rule, now derived rather than asserted. Notice the 1/h: wider models pay less TP tax, which is why TP scales gracefully on big models and chokes on small ones.

4 · PP and EP — the other two taxes, briefly

Pipeline parallel moves the smallest buffer of all — one activation tensor at each stage cut, a cheap point-to-point send, not a collective. That's why PP survives crossing nodes (lesson 07). Its tax isn't bandwidth, it's the bubble: (pp−1)/(M+pp−1) of the GPUs idle during fill/drain. PP trades a bandwidth problem for a scheduling problem — pay it down with more microbatches M, not more bandwidth.

Expert parallel moves activations too, but through an all-to-all (every GPU ships each token to whichever GPU holds its chosen expert, then ships results back). All-to-all is latency-sensitive and bursty — it doesn't pipeline as cleanly as a ring — so EP wants to stay in-node when it can, and its tax rises with how scattered the routing is. The buffer is activation-sized (∝ b·s·h), so like TP it benefits from being hidden behind the expert FFN's compute (system_ml 09).

The placement rule falls straight out of the buffer sizes

Rank the axes by how often × how big × on critical path? — TP (4×/layer, on path) is the chattiest, so it gets the fastest link and stays in-node. EP (2×/MoE-layer, all-to-all) wants in-node too. PP (1×/microbatch, tiny p2p) tolerates the slow inter-node link. DP/FSDP (1×/step, but huge buffer hidden behind a whole backward) fills whatever's left. This ordering — TP innermost, DP outermost — isn't taste; it's the buffer-size table sorted by communication intensity.

Interactive · the overlap calculator

Pick a model width, a parallelism degree per axis, and the links. The widget computes each axis's t_comm and the compute window it hides behind, then reports the tax and a rough MFU. Two flips to find: push TP across a node (set TP>8) and watch its tax explode 18×; shrink the local tokens below the peak·MFU/BW threshold and watch the DP all-reduce stop hiding. The binding tax is highlighted — that's the one to relax.

Communication tax / overlap calculator

Assumptions (lesson 02 precision, ±30%): ring collectives, bandwidth-bound (latency term ignored); grad buffer = 2N bf16; TP = 2 activation all-reduces/layer/pass vs 24h²/token forward compute (the ⅓·peak/(BW·h) ratio); NVLink 900 GB/s in-node, IB 50 GB/s/GPU inter-node; TP>8 forces the inter-node link. MFU starts at 50% and is docked by exposed comm and the PP bubble (M=32).

hidden size h 8192 layers L 80 TP degree 8 PP degree 1 local tokens (micro·seq) 8192

TP tax (in-node?)

–

DP all-reduce hides?

–

PP bubble

–

est. MFU

–

What carries forward

A collective costs bytes ÷ bandwidth, and a ring makes it ≈ independent of GPU count — so the tax is set by buffer size and which link it crosses, not by P.
Two buffer families: DP/FSDP move the model (∝ 2N, rare, hidden behind a backward); TP/PP/EP move activations (∝ b·s·h, frequent, often on the critical path).
The overlap inequality t_comm ≤ t_compute is the whole story. It yields two model-size-independent thresholds: DP hides when t_local ≥ peak·MFU/BW_net (~8K tok/GPU on IB); TP tax ≈ ⅓·peak/(BW·h) — ~5% in-node, ~80% across nodes.
The 18× NVLink-vs-IB gap, made quantitative, is why TP stays in-node and PP may cross it — and why wider models (1/h) pay less TP tax.
MFU is just the fraction of comm you buried. Most "why is MFU 20%?" bugs are one exposed collective — find it with this inequality before touching code.