Interconnect — the bandwidth pyramid
The reason TP wants to stay on one node, the reason FSDP needs HSDP, and the reason PP can stretch across racks is a single chart of bandwidth-per-link. This lesson is that chart.
Four numbers to memorize
Almost every distributed-training argument is settled by comparing two numbers from this table. Memorize them and you can do the math in your head.
| Layer | Hardware | Bandwidth (H100 numbers, rough) | Typical latency |
|---|---|---|---|
| Within an SM | SMEM / registers | ~10 TB/s | ~ns |
| Within a GPU | HBM3 | 3.35 TB/s | ~100 ns |
| Within a node (8 GPUs) | NVLink 4 + NVSwitch | ~900 GB/s bidir per GPU | ~1 μs |
| Across nodes (1 NIC) | Infiniband NDR (400 Gb/s) | ~50 GB/s | ~2 μs RDMA |
| Across nodes (8 NICs/node, balanced) | Aggregated IB | ~400 GB/s aggregate | same per pair |
Two ratios fall out immediately:
- HBM : NVLink ≈ 3.7×. So if a kernel needs to read its operands from a neighbour over NVLink instead of its own HBM, it slows down by roughly that factor — painful, but tolerable given NVLink is still ~18× faster than the next tier down (IB).
- NVLink : IB-per-NIC ≈ 18×. This is the gap that makes "stay intra-node" a hard rule.
3D · the bandwidth pyramid, isometric
Each tier of the memory hierarchy is a tilted slab; slab area is proportional to bandwidth (in GB/s, log-scaled because the range is enormous). Hover or click a slab to see the numbers. The "drop" from one slab to the next is the cost a kernel pays whenever it has to fetch from one tier deeper.
What "intra-node" buys you
NVLink + NVSwitch on a DGX/HGX node gives all-pairs connectivity at full NVLink bandwidth. Eight GPUs, each with ~900 GB/s of bidirectional bandwidth — and crucially, no oversubscription: rank 0 ↔ rank 1 and rank 2 ↔ rank 3 can both run at full speed simultaneously because the NVSwitch is a non-blocking crossbar.
That property is what makes tensor parallel viable. Recall from lesson 06 (next): a TP-sharded transformer layer does two AllReduces per layer per forward pass (one in attention, one in MLP). For a 70B model with 80 layers, that's 320 AllReduces in a forward + backward. At NVLink bandwidth and modest message size (~50 MB per AllReduce for the MLP output), each AllReduce takes ~100 μs. 320 × 100 μs ≈ 32 ms. Roughly comparable to the kernel time. The compute can usually be hidden behind this.
The same calculation at inter-node bandwidth: 50 GB/s instead of 900 GB/s. The same 320 AllReduces now take ~18× as long: ~580 ms. That is 100× the per-step compute. TP across nodes is hopeless for transformer training, full stop.
2D · the pod topology, click src/dst
The standard 4-node × 8-GPU pod. Within a node, all 8 GPUs hang off an NVSwitch (gold); across nodes, two leaf IB switches connect everything (blue dashed). Click a source GPU then a destination GPU; an animated packet shows the path it takes, and the KPIs report the bottleneck bandwidth along that path. Different-node traffic always goes through the IB leg, no matter how close the GPUs are visually.
What inter-node bandwidth can afford
Pipeline parallel sends activations across stage boundaries: one tensor per micro-batch, per layer-boundary, per forward pass. Even with many micro-batches, that's far less traffic than TP's per-layer AllReduce. For 4-stage PP with 16 micro-batches, you send 16 activation tensors over the inter-stage link per forward — say 16 × 16 MB = 256 MB. At 50 GB/s, that's ~5 ms, hideable behind compute. Hence PP across nodes is fine.
Data parallel sends one AllReduce per step (the gradient sync). The volume is 2 · params bytes — for a 7B model that's 28 GB ring-asymptotic. At inter-node 50 GB/s that's ~560 ms. Without overlap that's a step-killer; with overlap (lesson 04), the AllReduce hides behind backward. Most importantly: DDP does one of these per step, not per layer. The frequency is the saving grace.
FSDP sits in the middle. It does an AllGather per layer (forward) plus an AllGather + ReduceScatter per layer (backward). Per-layer message size is params_per_layer · 2 bytes. For a 7B model with 32 layers, that's 7e9 / 32 · 2 = 437 MB per layer. At inter-node 50 GB/s that's 9 ms per AllGather — and you do 3 of those per layer per step, across 32 layers. The math doesn't work inter-node. Hence HSDP: shard intra-node, replicate inter-node.
The decision table
| Strategy | Comm pattern | Comm volume per step | OK intra? | OK inter? |
|---|---|---|---|---|
| TP (lesson 06) | 2 AllReduce per layer × (fwd+bwd) | large · per-layer | yes | no |
| FSDP (lesson 05) | 3 collectives per layer × (fwd+bwd) | moderate · per-layer | yes | HSDP only |
| DDP (lesson 04) | 1 AllReduce per step (whole model) | 2 · params bytes | yes | yes (with overlap) |
| PP (lesson 07) | Send/Recv per micro-batch per stage | small · per stage | yes | yes |
| EP (lesson 09) | 2 AllToAll per MoE layer × (fwd+bwd) | large · latency-dominated | yes | painful |
Read this table as a layout heuristic. TP on the innermost (fastest) axis. EP next to TP if your model is MoE. PP across nodes. DP/FSDP outermost, since DP's one AllReduce per step is the most overlap-friendly. This is the recipe used by Megatron-LM, Llama 3's training stack, DeepSpeed, and friends; lesson 10 makes it explicit.
Topology details that occasionally matter
- NVSwitch ≠ NVLink mesh. NVSwitch is a crossbar; the 8 GPUs are fully connected, not in a ring. NCCL can build a ring on top of it for AllReduce, but pair-to-pair traffic doesn't share links.
- Beyond 8 GPUs/node. Some systems (GH200 racks, GB200 NVL72) push the "node" boundary much further — 72 GPUs in one NVLink domain. This shifts what counts as "intra-node" and changes the math. Lesson 10 covers how the recipe responds.
- PCIe. Without NVLink, GPU-to-GPU traffic goes over PCIe (~50 GB/s gen5). A "consumer GPU + PCIe" cluster is essentially "all inter-node": TP and FSDP don't work at scale. NVLink is the line that separates research-cluster economics from frontier-cluster economics.
- RoCE vs IB. Both can give 200–400 Gb/s per port; RoCE runs over Ethernet, IB over a separate fabric. Different cost / failure profile, same physics for our purposes.
- NIC topology inside a node. Each GPU usually has its own dedicated NIC on a balanced HGX. If two GPUs share a NIC, the per-pair inter-node bandwidth halves. Check your machine spec.
Interactive · which parallelism survives this fabric?
Set node count, intra-node BW, inter-node BW, and the model size. The widget computes the per-step communication cost of each parallelism strategy and overlays it on a "step time" line. Anything above the line is bottlenecked by communication.