Interconnect — the bandwidth pyramid

The reason TP wants to stay on one node, the reason FSDP needs HSDP, and the reason PP can stretch across racks is a single chart of bandwidth-per-link. This lesson is that chart.

Four numbers to memorize

Almost every distributed-training argument is settled by comparing two numbers from this table. Memorize them and you can do the math in your head.

Layer	Hardware	Bandwidth (H100 numbers, rough)	Typical latency
Within an SM	SMEM / registers	~10 TB/s	~ns
Within a GPU	HBM3	3.35 TB/s	~100 ns
Within a node (8 GPUs)	NVLink 4 + NVSwitch	~900 GB/s bidir per GPU	~1 μs
Across nodes (1 NIC)	Infiniband NDR (400 Gb/s)	~50 GB/s	~2 μs RDMA
Across nodes (8 NICs/node, balanced)	Aggregated IB	~400 GB/s aggregate	same per pair

Two ratios fall out immediately:

HBM : NVLink ≈ 3.7×. So if a kernel needs to read its operands from a neighbour over NVLink instead of its own HBM, it slows down by roughly that factor — painful, but tolerable given NVLink is still ~18× faster than the next tier down (IB).
NVLink : IB-per-NIC ≈ 18×. This is the gap that makes "stay intra-node" a hard rule.

Peak vs effective bandwidth — read this once, refer back forever

Peak is the raw link rate the hardware can clock — ~900 GB/s per GPU. Effective is what a real collective achieves end-to-end, and it is reliably below peak for three concrete reasons, not hand-waving:

Algorithm bandwidth factor. A ring AllReduce moves 2(N−1)/N · S bytes per rank (lesson 02 derives this), i.e. ≈ 2S for large N: every byte of the tensor crosses a link roughly twice (once in the ReduceScatter pass, once in the AllGather pass). So even a perfectly clocked ring delivers ≈ peak/2 of useful AllReduce throughput before any inefficiency.
Protocol / latency overhead (the α term). The ring is cut into N chunks of S/N; small chunks don't amortise per-message latency, and NCCL's low-latency wire protocols (LL, LL128) trade raw bandwidth for lower α. At realistic chunk sizes you lose another slice of peak.
Imperfect pipelining. The chunked ring schedule has fill/drain edges — the first chunk hasn't reached the far rank while the last is still being launched — so links are not saturated for the whole transfer.

Stack those up and a real flat-ring AllReduce lands at ≈ 50–70% of peak after the algorithm-bandwidth factor — i.e. ~150 GB/s of realised AllReduce on a ~900 GB/s fabric. That ~150 GB/s is the number to use for DDP gradient sync (lesson 04). A switch-resident collective sidesteps the worst of all three: NVLink SHARP / NVLS reduces in the NVSwitch crossbar, so each byte crosses the fabric ≈ once and many pairs move data simultaneously without sharing links — the realised bandwidth approaches the ~900 GB/s peak. This is the number to use for the small, switch-resident AllReduces that TP (lesson 06) fires inside a node.

This is a real distinction, not a fudge: the same NVLink fabric delivers ~150 GB/s to a flat-ring grad AllReduce and ~900 GB/s to a switch-resident TP AllReduce. When a later lesson quotes one of these numbers, it's pointing back here. Memorise the pair: 900 peak / 150 effective-ring, plus 50 GB/s per IB NIC.

3D · the bandwidth pyramid, isometric

Each tier of the memory hierarchy is a tilted slab; slab area is proportional to bandwidth (in GB/s, log-scaled because the range is enormous). Hover or click a slab to see the numbers. The "drop" from one slab to the next is the cost a kernel pays whenever it has to fetch from one tier deeper.

What "intra-node" buys you

NVLink + NVSwitch on a DGX/HGX node gives all-pairs connectivity at full NVLink bandwidth. Eight GPUs, each with ~900 GB/s of bidirectional bandwidth — and crucially, no oversubscription: rank 0 ↔ rank 1 and rank 2 ↔ rank 3 can both run at full speed simultaneously because the NVSwitch is a non-blocking crossbar.

That property is what makes tensor parallel viable. A TP-sharded transformer layer does two AllReduces per layer per forward pass — one at the end of the attention block, one at the end of the MLP block (lesson 06 builds the column/row mechanism that produces them; here we just need their size). Each AllReduce moves one block's output activation, a tensor of shape (B, T, d) in bf16, so

S_TP = B · T · d · 2 bytes

Plug in Llama-70B's example numbers — B=4, T=4096, d=8192 — and S_TP = 4 · 4096 · 8192 · 2 ≈ 270 MB. We derive the canonical TP message size here, and lesson 06 references back to it. For an 80-layer model that's 2 AllReduces × 80 layers × 2 (forward + backward) = 320 AllReduces of ~270 MB each per step. Because a TP AllReduce is switch-resident rather than a flat ring, it runs near peak — see the peak-vs-effective note above — so at ~900 GB/s effective the whole step's 320 AllReduces cost ≈ 95 ms of comm. Roughly comparable to the kernel time. The compute can usually be hidden behind this.

The same calculation at inter-node bandwidth: 50 GB/s instead of 900 GB/s. The same 320 AllReduces now take ~18× as long: ~1.7 s. That is more than 10× the per-step compute. TP across nodes is hopeless for transformer training, full stop.

The "stay on one node" rule, derived

TP frequency is per-layer × 2 × 2 (forward + backward). Per-layer message size is roughly B · T · d · 2 bytes (the output of an attention block or MLP, in bf16). Step time budget is the per-rank forward time. TP across nodes blows the time budget by more than an order of magnitude. Hence: TP ≤ 8 in production, contained to one HGX/DGX node.

2D · the pod topology, click src/dst

The standard 4-node × 8-GPU pod. Within a node, all 8 GPUs hang off an NVSwitch (gold); across nodes, two leaf IB switches connect everything (blue dashed). Click a source GPU then a destination GPU; an animated packet shows the path it takes, and the KPIs report the bottleneck bandwidth along that path. Different-node traffic always goes through the IB leg, no matter how close the GPUs are visually.

What inter-node bandwidth can afford

Pipeline parallel sends activations across stage boundaries: one tensor per micro-batch, per layer-boundary, per forward pass. Even with many micro-batches, that's far less traffic than TP's per-layer AllReduce. For 4-stage PP with 16 micro-batches, you send 16 activation tensors over the inter-stage link per forward — say 16 × 16 MB = 256 MB. At 50 GB/s, that's ~5 ms, hideable behind compute. Hence PP across nodes is fine.

Data parallel sends one AllReduce per step (the gradient sync). The volume is 2 · params bytes — for a 7B model that's 28 GB ring-asymptotic. At inter-node 50 GB/s that's ~560 ms. Without overlap that's a step-killer; with overlap (lesson 04), the AllReduce hides behind backward. Most importantly: DDP does one of these per step, not per layer. The frequency is the saving grace.

FSDP sits in the middle. It does an AllGather per layer (forward) plus an AllGather + ReduceScatter per layer (backward). Per-layer message size is params_per_layer · 2 bytes. For a 7B model with 32 layers, that's 7e9 / 32 · 2 = 437 MB per layer. At inter-node 50 GB/s that's 9 ms per AllGather — and you do 3 of those per layer per step, across 32 layers. The math doesn't work inter-node. Hence HSDP: shard intra-node, replicate inter-node.

The decision table

Strategy	Comm pattern	Comm volume per step	OK intra?	OK inter?
TP (lesson 06)	2 AllReduce per layer × (fwd+bwd)	large · per-layer	yes	no
FSDP (lesson 05)	3 collectives per layer × (fwd+bwd)	moderate · per-layer	yes	HSDP only
DDP (lesson 04)	1 AllReduce per step (whole model)	2 · params bytes	yes	yes (with overlap)
PP (lesson 07)	Send/Recv per micro-batch per stage	small · per stage	yes	yes
EP (lesson 09)	2 AllToAll per MoE layer × (fwd+bwd)	large · latency-dominated	yes	painful

Read this table as a layout heuristic. TP on the innermost (fastest) axis. EP next to TP if your model is MoE. PP across nodes. DP/FSDP outermost, since DP's one AllReduce per step is the most overlap-friendly. This is the recipe used by Megatron-LM, Llama 3's training stack, DeepSpeed, and friends; lesson 12 makes it explicit.

Topology details that occasionally matter

NVSwitch ≠ NVLink mesh. NVSwitch is a crossbar; the 8 GPUs are fully connected, not in a ring. NCCL can build a ring on top of it for AllReduce, but pair-to-pair traffic doesn't share links.
Beyond 8 GPUs/node. Some systems (GH200 racks, GB200 NVL72) push the "node" boundary much further — 72 GPUs in one NVLink domain. This shifts what counts as "intra-node" and changes the math. Lesson 12 covers how the recipe responds.
PCIe. Without NVLink, GPU-to-GPU traffic goes over PCIe (~50 GB/s gen5). A "consumer GPU + PCIe" cluster is essentially "all inter-node": TP and FSDP don't work at scale. NVLink is the line that separates research-cluster economics from frontier-cluster economics.
RoCE vs IB. Both can give 200–400 Gb/s per port; RoCE runs over Ethernet, IB over a separate fabric. Different cost / failure profile, same physics for our purposes.
NIC topology inside a node. Each GPU usually has its own dedicated NIC on a balanced HGX. If two GPUs share a NIC, the per-pair inter-node bandwidth halves. Check your machine spec.

Interactive · which parallelism survives this fabric?

Set node count, intra-node BW, inter-node BW, and the model size. The widget computes the per-step communication cost of each parallelism strategy and overlays it on a "step time" line. Anything above the line is bottlenecked by communication.

Topology & which collective algorithm wins

The bandwidth pyramid tells you the per-link cost. But on a real cluster the shape of the network — and which NCCL algorithm runs on it — decides whether you actually get that bandwidth.

Rail-optimized fat-tree

A modern GPU cluster is not a flat IB fabric. It's rail-optimized: each GPU's NIC connects to its own "rail" — a dedicated leaf switch shared by the same-numbered GPU across every node. So GPU 3 in node 0 and GPU 3 in node 7 both hang off rail-3's leaf switch and get a one-hop path between them. NCCL knows this and places its ring/tree neighbours rail-aligned, so the inter-node legs of a collective ride single-hop rail links.

The catch is the layer above: the spine that connects rails to each other is usually oversubscribed (fewer uplinks than downlinks). Cross-rail or cross-pod traffic must climb to the spine and is slower than a same-rail hop. This is exactly why mapping parallelism groups onto the physical topology matters: you want a DP or PP communication group to fall on a single rail where possible, and you pay a spine tax whenever a group straddles rails. A bad rank-to-GPU assignment can silently route a hot collective across the oversubscribed spine.

NCCL algorithm selection — latency vs bandwidth

NCCL doesn't run one AllReduce algorithm; it picks one per call. The choice is the α/β trade-off from lesson 02: cost = α · (#messages) + β · (bytes), where α is per-hop latency and β = 1/BW.

TREE wins for small messages and large N. A tree reaches every rank in ~\log N hops, so its latency term is α · O(\log N) instead of the ring's α · O(N). When the payload is tiny, latency dominates and the ring's 2(N−1) sequential steps lose badly.
RING is bandwidth-optimal for large messages: it hits the 2(N−1)/N · S lower bound and balances load across every link. Once β · S dominates, ring's worse latency is irrelevant.
NVLS (NVLink SHARP / in-network reduction, see lesson 02) wins when the switch can do the arithmetic: each GPU sends once and the NVSwitch reduces in-fabric, roughly halving bytes moved and running near peak. NCCL selects it for intra-node reductions when the hardware supports it.

Algorithm	Best regime	Cost (latency / bandwidth)
TREE	small messages, large N (latency-bound)	α · O(\log N) + β · 2S
RING	large messages (bandwidth-bound)	α · 2(N−1) + β · 2(N−1)/N · S
NVLS	switch supports in-network reduction	α · O(1) + β · ~S (near peak)

NCCL auto-tunes this: it estimates each algorithm's cost from a built-in model of your topology and the message size, then picks the cheapest. You can override the decision with NCCL_ALGO (e.g. NCCL_ALGO=Ring, Tree, or NVLS) and the wire encoding with NCCL_PROTO (Simple, LL, LL128 — low-latency protocols that trade a little bandwidth for much lower α on small messages). The practical upshot: the same AllReduce call can be latency-optimal or bandwidth-optimal depending on message size and where the ranks physically sit, and that's a property of the topology, not just the code.

Placement is a tuning knob

Two clusters with identical bandwidth tables can differ by 2× on the same collective purely because of rank-to-GPU placement: rail-aligned neighbours hit single-hop links and let NCCL choose its cheapest algorithm; mis-placed groups push traffic onto the oversubscribed spine and force a worse one.

Takeaway

Four bandwidth numbers explain almost every parallelism rule in this series. NVLink is to IB what HBM is to NVLink — roughly an order of magnitude per step down the pyramid. Strategies that communicate per layer (TP, FSDP) want NVLink; strategies that communicate per step (DDP) or per stage (PP) tolerate IB.