system_ml / 03 · interconnect lesson 3 / 19

Interconnect — the bandwidth pyramid

The reason TP wants to stay on one node, the reason FSDP needs HSDP, and the reason PP can stretch across racks is a single chart of bandwidth-per-link. This lesson is that chart.

Four numbers to memorize

Almost every distributed-training argument is settled by comparing two numbers from this table. Memorize them and you can do the math in your head.

LayerHardwareBandwidth (H100 numbers, rough)Typical latency
Within an SMSMEM / registers~10 TB/s~ns
Within a GPUHBM33.35 TB/s~100 ns
Within a node (8 GPUs)NVLink 4 + NVSwitch~900 GB/s bidir per GPU~1 μs
Across nodes (1 NIC)Infiniband NDR (400 Gb/s)~50 GB/s~2 μs RDMA
Across nodes (8 NICs/node, balanced)Aggregated IB~400 GB/s aggregatesame per pair

Two ratios fall out immediately:

SMEM / registers · ~10 TB/s inside one SM · the kernel's working set HBM3 · 3.35 TB/s one GPU's working memory · the roofline's denominator NVLink + NVSwitch · ~900 GB/s bidir per GPU intra-node · TP & expert-parallel territory Infiniband NDR · 50 GB/s/NIC, ~400 aggregate inter-node · PP & DP territory factor ~3× per layer · the steps you cross govern your bottleneck

3D · the bandwidth pyramid, isometric

Each tier of the memory hierarchy is a tilted slab; slab area is proportional to bandwidth (in GB/s, log-scaled because the range is enormous). Hover or click a slab to see the numbers. The "drop" from one slab to the next is the cost a kernel pays whenever it has to fetch from one tier deeper.

Bandwidth pyramid · click a tier
Width ∝ log(bandwidth). Color = relative latency (green = ns, blue = μs). The wider the slab, the more bytes you can move per second; the slabs below NVLink are where the production rules ("intra-node only") come from.
selected tier
— click a slab —
bandwidth
latency
vs HBM

What "intra-node" buys you

NVLink + NVSwitch on a DGX/HGX node gives all-pairs connectivity at full NVLink bandwidth. Eight GPUs, each with ~900 GB/s of bidirectional bandwidth — and crucially, no oversubscription: rank 0 ↔ rank 1 and rank 2 ↔ rank 3 can both run at full speed simultaneously because the NVSwitch is a non-blocking crossbar.

That property is what makes tensor parallel viable. Recall from lesson 06 (next): a TP-sharded transformer layer does two AllReduces per layer per forward pass (one in attention, one in MLP). For a 70B model with 80 layers, that's 320 AllReduces in a forward + backward. At NVLink bandwidth and modest message size (~50 MB per AllReduce for the MLP output), each AllReduce takes ~100 μs. 320 × 100 μs ≈ 32 ms. Roughly comparable to the kernel time. The compute can usually be hidden behind this.

The same calculation at inter-node bandwidth: 50 GB/s instead of 900 GB/s. The same 320 AllReduces now take ~18× as long: ~580 ms. That is 100× the per-step compute. TP across nodes is hopeless for transformer training, full stop.

The "stay on one node" rule, derived
TP frequency is per-layer × 2 × 2 (forward + backward). Per-layer message size is roughly B · T · d · 2 bytes (the output of an attention block or MLP, in bf16). Step time budget is the per-rank forward time. TP across nodes blows the time budget by more than an order of magnitude. Hence: TP ≤ 8 in production, contained to one HGX/DGX node.

2D · the pod topology, click src/dst

The standard 4-node × 8-GPU pod. Within a node, all 8 GPUs hang off an NVSwitch (gold); across nodes, two leaf IB switches connect everything (blue dashed). Click a source GPU then a destination GPU; an animated packet shows the path it takes, and the KPIs report the bottleneck bandwidth along that path. Different-node traffic always goes through the IB leg, no matter how close the GPUs are visually.

Pod connectivity · pick source and destination
Click a GPU twice (src, then dst) to animate the packet. Same-node hops stay on NVLink (one hop through NVSwitch). Cross-node hops do NVLink-NIC-IB-NIC-NVLink.
source → dest
path type
bottleneck BW
transfer time

What inter-node bandwidth can afford

Pipeline parallel sends activations across stage boundaries: one tensor per micro-batch, per layer-boundary, per forward pass. Even with many micro-batches, that's far less traffic than TP's per-layer AllReduce. For 4-stage PP with 16 micro-batches, you send 16 activation tensors over the inter-stage link per forward — say 16 × 16 MB = 256 MB. At 50 GB/s, that's ~5 ms, hideable behind compute. Hence PP across nodes is fine.

Data parallel sends one AllReduce per step (the gradient sync). The volume is 2 · params bytes — for a 7B model that's 28 GB ring-asymptotic. At inter-node 50 GB/s that's ~560 ms. Without overlap that's a step-killer; with overlap (lesson 04), the AllReduce hides behind backward. Most importantly: DDP does one of these per step, not per layer. The frequency is the saving grace.

FSDP sits in the middle. It does an AllGather per layer (forward) plus an AllGather + ReduceScatter per layer (backward). Per-layer message size is params_per_layer · 2 bytes. For a 7B model with 32 layers, that's 7e9 / 32 · 2 = 437 MB per layer. At inter-node 50 GB/s that's 9 ms per AllGather — and you do 3 of those per layer per step, across 32 layers. The math doesn't work inter-node. Hence HSDP: shard intra-node, replicate inter-node.

The decision table

StrategyComm patternComm volume per stepOK intra?OK inter?
TP (lesson 06)2 AllReduce per layer × (fwd+bwd)large · per-layeryesno
FSDP (lesson 05)3 collectives per layer × (fwd+bwd)moderate · per-layeryesHSDP only
DDP (lesson 04)1 AllReduce per step (whole model)2 · params bytesyesyes (with overlap)
PP (lesson 07)Send/Recv per micro-batch per stagesmall · per stageyesyes
EP (lesson 09)2 AllToAll per MoE layer × (fwd+bwd)large · latency-dominatedyespainful

Read this table as a layout heuristic. TP on the innermost (fastest) axis. EP next to TP if your model is MoE. PP across nodes. DP/FSDP outermost, since DP's one AllReduce per step is the most overlap-friendly. This is the recipe used by Megatron-LM, Llama 3's training stack, DeepSpeed, and friends; lesson 10 makes it explicit.

Topology details that occasionally matter

Interactive · which parallelism survives this fabric?

Set node count, intra-node BW, inter-node BW, and the model size. The widget computes the per-step communication cost of each parallelism strategy and overlays it on a "step time" line. Anything above the line is bottlenecked by communication.

Per-step comm cost as a fraction of step time
Stacked bars show comm cost for each strategy on a hypothetical 7B-class model with 32 layers, batch 16, sequence 4k. Anything > 100% is comm-bound on this fabric.
TP (intra)
FSDP (intra)
FSDP (across all)
DDP (across all)
Takeaway
Four bandwidth numbers explain almost every parallelism rule in this series. NVLink is to IB what HBM is to NVLink — roughly an order of magnitude per step down the pyramid. Strategies that communicate per layer (TP, FSDP) want NVLink; strategies that communicate per step (DDP) or per stage (PP) tolerate IB.