Collectives — the seven primitives

Every distributed-training cost in this series is a sum of collective costs. This lesson is the vocabulary and the ring-AllReduce derivation that you'll reuse for the rest of the series.

The setting

N processes (one per GPU, by convention). Each one holds a buffer; they need to combine their buffers somehow. There are essentially seven shapes of "combine", and learning them once removes 80% of the magic from FSDP, TP, EP, and pipeline parallel. Vocabulary:

Collective	Input on rank i	Output on rank i	Where it appears
Broadcast	(only on root)	Root's tensor	Init weights, checkpoint reload
Reduce	x_i	Σ x_j on root only	Log aggregation
AllReduce	x_i	Σ x_j on every rank	DDP gradient sync, TP output
Gather	x_i	[x₀, …, x_N-1] on root	Log gather
AllGather	x_i (a shard)	Full concatenation on every rank	FSDP forward (reconstruct full layer)
ReduceScatter	x_i (full)	i-th piece of Σ x_j	FSDP backward (shard the gradient)
AllToAll	[x_i,0, …, x_i,N-1]	[x_0,i, …, x_N-1,i]	MoE routing, transpose

AllToAll is the one that catches people. Read its row carefully: rank i begins holding one chunk for each other rank, and ends holding one chunk from each other rank. It is a transpose of the (rank × chunk) matrix. Cost-wise it sends N-1 different messages instead of one; latency-bound rather than bandwidth-bound. Lesson 09 (MoE) is where it bites.

2D · the seven primitives, side by side

Each primitive is a different rearrangement of "rank × chunk" tiles. The animation below shows the input layout on the left, the output on the right, and the per-rank moves as colored arrows. Tab through them. Notice how AllGather and ReduceScatter are exact complements — they're two halves of one AllReduce.

The identity that explains FSDP

AllReduce(x) = AllGather( ReduceScatter(x) )

Stare at it. ReduceScatter reduces and then leaves each rank holding only its own slice of the sum. AllGather then redistributes those slices so every rank has the whole sum again. Two steps, same final answer as a one-step AllReduce.

Why care? Because if you only need your slice of the sum (because that's the slice you'll use next anyway — for instance because you only own that slice of the optimizer state, as in ZeRO-1), you can stop after ReduceScatter. You've now done half of an AllReduce. Inversely, if every rank starts with its own slice (because you sharded the weights, as in ZeRO-3) and you need the full thing for the next forward, you do an AllGather. The two halves of AllReduce can be charged separately by different optimizations. This is the trick lesson 05 turns into ZeRO.

Ring AllReduce — the bandwidth-optimal derivation

The naive AllReduce: send your tensor to a central root, sum, send back. Cost per rank: 2 · S bytes (one up, one down), but the root is hammered and the link to it is the bottleneck. With N ranks all sending simultaneously, the root's link receives N · S bytes — it scales linearly with cluster size. Bad.

Ring AllReduce does better. Arrange the N ranks in a logical ring: rank i sends only to rank (i+1) mod N. Each tensor is split into N chunks. Now two passes:

ReduceScatter pass (N-1 steps). At step t, rank i sends chunk (i-t) mod N to its neighbour and receives chunk (i-t-1) mod N, which it adds to the chunk it already owns. After N-1 steps, rank i has the fully reduced version of chunk (i+1) mod N (the chunk it'll be "responsible for" in the second pass). Every other rank has the fully reduced version of their chunk.
AllGather pass (N-1 steps). The chunk each rank now owns gets passed around the ring once. After N-1 more steps, every rank has every chunk fully reduced.

Per rank, in total: 2(N-1) messages, each of size S/N. Total bytes moved per rank:

bytes_per_rank = 2 · (N − 1) · S / N ≈ 2 · S (large N)

This is the punchline of the lesson. Per-rank traffic of a ring AllReduce is independent of cluster size for large N. Adding a 257th GPU doesn't change how many bytes each existing GPU has to send. The link load is balanced across all N ring edges.

Why the factor of 2

ReduceScatter contributes (N−1)/N · S bytes, and AllGather contributes the same. Adding them gives 2(N−1)/N · S. The "2" is doing real work: any AllReduce must move at least 2S(N−1)/N per rank (information-theoretic lower bound), and ring AllReduce hits it. Other topologies (tree, butterfly) trade against this — see below.

In-network reduction: NVLS & SHARP

That 2(N−1)/N · S lower bound assumes the reduction arithmetic happens on the GPUs. The lower bound dissolves if the switch itself can do arithmetic. Then the sum happens in the fabric, and a GPU no longer has to receive everyone else's data to compute it.

NVLink SHARP / NVLS. The NVSwitch can multicast and reduce in-switch. Each GPU sends its buffer once into the switch; the switch sums the streams and multicasts the reduced result back. So per rank: one send of S up, one receive of S down — roughly ~S each way instead of the ring's 2(N−1)/N · S. The 2(N−1)/N factor and its scan-around-the-ring serialization both vanish; the GPU moves roughly half the bytes of a flat ring, and it moves them in one shot.

InfiniBand SHARP does the same trick one tier up: the IB spine switches reduce inter-node AllReduce traffic in the fabric, so each node sends its partial once rather than ring-passing it around all the nodes.

Why a switch-resident collective hits near peak

This is the mechanism behind the claim lesson 03 leans on: a switch-resident AllReduce runs near peak NVLink bandwidth (~900 GB/s) while a flat ring realises only ~150 GB/s. The ring pays the 2(N−1)/N algorithm-bandwidth factor (every byte crosses a link ≈ twice) plus protocol and pipelining overhead; in-network reduction moves each byte ≈ once and avoids both. Lesson 03 derives the full peak-vs-effective cost model.

Caveats: it only helps reductions — AllReduce and ReduceScatter (and Broadcast for the multicast half). AllGather and AllToAll move distinct data and get no arithmetic speedup. It needs hardware and driver support (NVSwitch with SHARP, recent NCCL + CUDA). And it is exactly what NCCL's NVLS algorithm selects when it detects the capability — you'll see it chosen automatically for intra-node AllReduce on H100-class nodes.

Animated · ring AllReduce on N=4 ranks, step by step

Four ranks arranged in a ring. The tensor is split into 4 chunks. Each rank's row shows what it holds; rows light up as chunks accumulate sums. Press step 6 times: three ReduceScatter steps fold the chunks down to one fully-summed chunk per rank, then three AllGather steps redistribute. By the end, every rank holds the entire summed tensor. The counter shows bytes moved per rank — which never exceeds 2(N-1)/N · S.

Time, not just bytes

Bandwidth says how fast you can stream data once a connection is open. Latency says how long it takes to open one. Real cost is a mix:

time = α · (number_of_messages) + β · (total_bytes)

For a ring AllReduce of size S with bandwidth BW = 1/β and per-message latency α:

T_ring = 2(N−1) · α + 2(N−1)/N · S / BW

Two regimes pop out:

Big tensors. The S/BW term dominates; cost is ~2S/BW regardless of N. Ring is great.
Small tensors. The (N−1) · α latency term dominates; ring is O(N) in latency. Bad. Use a tree (latency O(log N)) or, better, batch many small reduces together (this is exactly what DDP's bucket size knob in lesson 04 controls).

The crossover sets the "bucket size" you'll see in PyTorch DDP and similar: bundle gradients until the bucket is big enough that β · S dominates α · N. Default in PyTorch DDP is 25 MB.

Hierarchical AllReduce — the production trick

Inside a node, NVLink + NVSwitch give each GPU ~900 GB/s of aggregate bidirectional bandwidth (the per-pair share depends on how much other traffic is using the switch). Between nodes, Infiniband gives one IB port ~50 GB/s. Asymmetry of ~18× per link. A ring that spans both intra- and inter-node bandwidth is governed by the slow link.

Hierarchical AllReduce respects the asymmetry:

Intra-node Reduce. All 8 GPUs in a node ring-reduce their copies into one rank's local sum. Uses NVLink only.
Inter-node AllReduce. The chosen rank from each node participates in a ring across nodes (over IB). One participant per node, so IB sees N_nodes ranks, not N_total.
Intra-node Broadcast. The chosen rank propagates the global sum back to its 7 neighbours over NVLink.

Total inter-node bytes per node: 2 · S · (N_nodes - 1) / N_nodes, instead of 2 · S · (N - 1) / N for flat ring. With 8 GPUs per node and 64 nodes, that's a ~8× reduction in IB traffic for one node. NCCL picks this automatically when it detects the topology. (It also has tree, double-binary tree, and several others — but ring vs hierarchical is the core mental model.)

Animated · ring vs tree vs hierarchical, side by side

Three topologies racing the same payload over 8 ranks (or 16, or 32). The ring travels along a circle; the tree fans in and out in log N levels; the hierarchical does two intra-node reductions and one inter-node ring. Press play. The counter shows messages-completed per algorithm.

What changes in TP, FSDP, MoE

Naming the collective each parallelism uses is the single most useful mnemonic in this series. From the table at the top:

DDP (lesson 04): one AllReduce of all gradients per step. Hidden behind backward via overlap.
FSDP / ZeRO-3 (lesson 05): AllGather to reconstruct a full layer (forward), AllGather + ReduceScatter (backward). Same total bytes as an AllReduce, just split. Same factor-of-2 cost; the win is memory, not communication.
Tensor parallel (lesson 06): two AllReduces per transformer layer per forward (one in attention, one in MLP). High frequency, modest size — pinned intra-node.
Pipeline parallel (lesson 07): point-to-point Send/Recv only (not technically a "collective" — just NCCL P2P). The bubble cost dwarfs the communication cost.
Expert parallel (lesson 09): two AllToAlls per MoE forward, two per backward. AllToAll is the latency-sensitive collective in the table.

Interactive · ring vs tree vs hierarchical

Three collectives, same payload. Watch the total-time bars change as you scale ranks, message size, and the latency/bandwidth ratio. The crossover where tree beats ring is the message size at which gradient bucketing makes sense; the gap between flat ring and hierarchical is the value of "stay on one node when you can."

AllReduce cost · ring vs hierarchical vs tree

Cost in microseconds for a single AllReduce of S bytes on N ranks, parametrised by α (per-message latency, μs) and β = 1/BW (s/byte). Try N=512, S=14 GB (a 7B model's grads in bf16) — flat-ring is ~280 ms, hierarchical ~80 ms.

ranks N: 64 payload S (MB): 100

intra-node BW (GB/s): 900 inter-node BW (GB/s): 50 GPUs/node: 8 α (μs): 10

flat ring

—

tree

—

hierarchical

—

winner

—

The two-line vocabulary check

If you hear …	… what's actually happening
"All gradients are summed across ranks"	AllReduce
"Reconstruct the full weight from shards"	AllGather
"Sum the grads, but only keep your slice"	ReduceScatter
"Each token goes to its expert"	AllToAll
"Activation crosses a pipeline stage"	Send/Recv (point-to-point)
"Broadcast the initial weights"	Broadcast (rank 0 → all)

Takeaway

Per-rank AllReduce cost is 2(N−1)/N · S / BW: bandwidth-independent of N for large S. Tree wins for small messages (latency). Hierarchical wins when intra- and inter-node bandwidth differ — which they always do. Bucketing exists to push small AllReduces into the bandwidth-bound regime where ring is optimal.