The communication tax — when a collective hides behind compute
Lesson 07 named the parallelism axes and called their costs "overlappable" or "bandwidth-heavy." That word was doing a lot of work. This lesson cashes it out: every axis moves a computable number of bytes per step, and whether that traffic is free or fatal comes down to a single inequality — t_comm ≤ t_compute. Get that inequality into your hands and you can predict the MFU of a parallelism plan before you launch it.
1 · The cost of a collective is bytes ÷ bandwidth — and it barely depends on P
Lesson 02 (system_ml 02) gave the surprising result we lean on here: a well-implemented ring collective moves a fixed amount of data per GPU regardless of how many GPUs are in the ring. To all-reduce a buffer of S bytes, each GPU sends and receives ≈ 2S bytes total (a reduce-scatter then an all-gather, each ≈ S):
So the entire communication tax reduces to two questions per axis: how big is S, and which link does it cross — the fast in-node NVLink (~900 GB/s) or the slow inter-node InfiniBand (~50 GB/s per GPU)? That 18× gap (lesson 02) is the single most important number on this page.
2 · The buffer size for each axis
Two kinds of buffers move in a training step. Parameter/gradient-sized buffers scale with the model (∝ N); activation-sized buffers scale with the work in flight (∝ b·s·h, the local microbatch). Each axis moves one or the other:
| Axis | Collective | Buffer S | How often | Link |
|---|---|---|---|---|
| DP (DDP) | all-reduce grads | ≈ 2N bytes (whole model) | once / step | inter-node |
| FSDP / ZeRO-3 | all-gather params + reduce-scatter grads | ≈ 2N per gather, per layer | every layer, fwd & bwd | inter-node |
| TP | all-reduce activations | ≈ 2·b·s·h bytes | 4× per layer | in-node (must) |
| PP | point-to-point send | ≈ 2·b·s·h at the cut | once / microbatch | inter-node (ok) |
| EP (MoE) | all-to-all dispatch + combine | ≈ 2·b·s·h bytes | 2× per MoE layer | in-node ideal |
The split is the lesson in one table: DP/FSDP move the model (huge buffer, but only at gradient boundaries you can hide behind a whole backward pass), while TP/PP/EP move activations (smaller buffer, but many times per layer, often on the critical path). That difference — buffer size vs. frequency — is why each axis lives where it does.
3 · The overlap inequality — the heart of the lesson
A collective is free only if it finishes inside a window of compute that doesn't depend on its result. Write that as one inequality:
Now apply it to the two regimes and watch two clean, model-size-independent thresholds fall out.
DP gradient sync — the "enough local tokens" threshold
In DDP the gradient all-reduce can overlap the backward pass: gradients become ready layer-by-layer, and you reduce each as it lands while later layers still compute. The window is the whole backward pass.
- Comm: all-reduce the gradient buffer, S = 2N → t_comm ≈ 2·(2N)/BW_net = 4N/BW_net.
- Compute: the backward pass is 4N FLOPs per local token (lesson 02's 6N = 2N fwd + 4N bwd) → t_bwd ≈ 4N·t_local /(peak·MFU), where t_local = micro-batch × seq tokens processed on this GPU.
Set t_comm ≤ t_bwd and the model size N cancels:
TP activation sync — the "wide enough model" threshold
TP's all-reduces sit on the critical path — the next matmul needs the reduced activation — so they overlap poorly. Here the right question isn't "does it hide" but "how big is the tax as a fraction of compute." Work it per pass to keep both sides consistent: the forward does 2 all-reduces per layer (one after attention, one after the MLP), each moving 2S = 4·b·s·h bytes, against the layer's forward compute of 2·12h² = 24h² FLOPs/token (attention 4h² + MLP 8h² params, ×2). The backward mirrors both, so the ratio is the same:
4 · PP and EP — the other two taxes, briefly
Pipeline parallel moves the smallest buffer of all — one activation tensor at each stage cut, a cheap point-to-point send, not a collective. That's why PP survives crossing nodes (lesson 07). Its tax isn't bandwidth, it's the bubble: (pp−1)/(M+pp−1) of the GPUs idle during fill/drain. PP trades a bandwidth problem for a scheduling problem — pay it down with more microbatches M, not more bandwidth.
Expert parallel moves activations too, but through an all-to-all (every GPU ships each token to whichever GPU holds its chosen expert, then ships results back). All-to-all is latency-sensitive and bursty — it doesn't pipeline as cleanly as a ring — so EP wants to stay in-node when it can, and its tax rises with how scattered the routing is. The buffer is activation-sized (∝ b·s·h), so like TP it benefits from being hidden behind the expert FFN's compute (system_ml 09).
Interactive · the overlap calculator
Pick a model width, a parallelism degree per axis, and the links. The widget computes each axis's t_comm and the compute window it hides behind, then reports the tax and a rough MFU. Two flips to find: push TP across a node (set TP>8) and watch its tax explode 18×; shrink the local tokens below the peak·MFU/BW threshold and watch the DP all-reduce stop hiding. The binding tax is highlighted — that's the one to relax.
What carries forward
- A collective costs bytes ÷ bandwidth, and a ring makes it ≈ independent of GPU count — so the tax is set by buffer size and which link it crosses, not by P.
- Two buffer families: DP/FSDP move the model (∝ 2N, rare, hidden behind a backward); TP/PP/EP move activations (∝ b·s·h, frequent, often on the critical path).
- The overlap inequality t_comm ≤ t_compute is the whole story. It yields two model-size-independent thresholds: DP hides when t_local ≥ peak·MFU/BW_net (~8K tok/GPU on IB); TP tax ≈ ⅓·peak/(BW·h) — ~5% in-node, ~80% across nodes.
- The 18× NVLink-vs-IB gap, made quantitative, is why TP stays in-node and PP may cross it — and why wider models (1/h) pay less TP tax.
- MFU is just the fraction of comm you buried. Most "why is MFU 20%?" bugs are one exposed collective — find it with this inequality before touching code.