Data parallel — the simplest fan-out
Every rank holds the full model. Every rank gets a different micro-batch. The only collective is one AllReduce of gradients per step. The whole lesson is how that AllReduce hides behind backward.
The setup in one paragraph
Pick N GPUs. Replicate the model on each one. Split your effective batch B into N micro-batches of size B/N. Each rank does forward + backward on its own micro-batch independently — no communication. At the end of backward, each rank holds its own local gradient. To match the math of single-GPU SGD with batch B, we need every rank to step on the average gradient:
That averaging is one AllReduce. The optimizer step is rank-local: every rank has the same averaged gradient, the same weights, and a deterministic optimizer, so they all arrive at the same post-step weights. Lockstep, no drift.
DistributedSampler shards data indices per rank but doesn't manage per-op RNG state — handle that yourself when correctness depends on it).
2D · the lockstep, four ranks sharing gradients
Four ranks, each holding the full model. They forward and backward on different data. Then, an animated AllReduce sweeps gradients into an average and broadcasts that average back. At the end of one step, every rank is back in sync. Watch the "diverged" indicator turn green at the AllReduce moment.
The basic cost, before any tricks
Per step: one AllReduce of all gradients. Gradients are bf16, so volume per rank moved through the ring is:
For a 7B model: ~28 GB per rank per step. At intra-node ring bandwidth of ~150 GB/s (effective, well below the NVLink peak because NCCL serialises ring steps): ~190 ms. Step time on a 7B forward+backward at batch 4 × sequence 4k is roughly 100 ms. So an unhidden AllReduce doubles the step time. Hidden behind backward — see below — it costs essentially nothing.
The overlap trick (the real lesson)
Gradients become available during backward, not after it. Backward proceeds layer by layer from the loss backward to the input: layer L's gradient is computed first, then L-1, then L-2, …, then layer 0. Each layer's gradient takes some compute time to materialise.
Naive approach: wait for all gradients, then AllReduce the whole flat buffer. Wall-clock = T_bwd + T_ar.
Overlap approach: as soon as a layer's gradient is ready, fire its AllReduce now on a separate CUDA stream. The main stream continues computing earlier layers' gradients. NCCL's collective runs concurrently with the compute kernels:
main stream: [bwd L=80] [bwd L=79] [bwd L=78] ... [bwd L=1]
nccl stream: [AR L=80] [AR L=79] ... [AR L=2] [AR L=1]
↑ fires when L=80 grad is ready
If the AllReduce of layer L finishes before backward of layer L-1 finishes, the comm is free. If it doesn't, the gap is exposed and counts. The condition for "free" is:
For inter-node DDP on a large model, this usually holds layer-by-layer if backward is bandwidth-rich enough; for intra-node DDP on a small model, it usually does not, because there's almost no compute to hide behind. The right thing in either case is to bucket.
Bucketing — the size knob you'll actually tune
Per-tensor AllReduce has a per-message latency floor (NCCL launch overhead ≈ tens of microseconds). If you fire one AllReduce per layer, on a 100-layer model that's 100 × tens-of-μs = milliseconds of latency tax, before any bytes flow. So we bundle consecutive layers' gradients into a flat bucket, and AllReduce the bucket as a unit.
Two competing pressures:
- Bigger buckets → fewer NCCL launches, higher effective bandwidth (the asymptotic ring rate kicks in), less latency tax.
- Smaller buckets → AllReduce can start earlier in backward (you don't wait for many layers to fill a big bucket), so there's more compute left to hide behind.
PyTorch DDP's default bucket size is 25 MB. The right value depends on bandwidth, message-latency, and model layer-count; it's a one-line constructor knob (bucket_cap_mb) and you can profile to find your sweet spot. The widget below makes the tradeoff visible.
Animated · backward + bucketed AllReduce, scrubbable
Same idea as the simulator below, but you can see it. The top lane is backward compute proceeding layer-by-layer from L-1 down to 0. The bottom lane is the NCCL stream; as each bucket fills, an AllReduce fires. Press play, then drag the bucket-size slider while it's playing — watch the orange bars resize and reposition in real time.
What "scaling efficiency" really measures
Scaling efficiency = throughput at N GPUs / (N × throughput at 1 GPU). 100% is impossible — there's always some unhidden communication. 95% is excellent. 80% is mediocre. Below 70% means you should look hard at:
- NCCL bucket size (above).
- Whether AllReduce is on a separate stream and overlapping (an
nsysprofile will show NCCL on a different stream than compute). - Whether some ranks are stragglers (per-rank step time should be flat ± 1%).
- Whether you're inter-node DDP on a model that should have been FSDP or 3D-parallel.
2D · step time vs bucket size — find the sweet spot
The same simulator the timeline above ran, swept across all bucket sizes from 1 MB to 2 GB. The curve is non-monotonic: too small → latency tax dominates (many tiny AllReduces, each paying α launch overhead); too large → one big AllReduce at the end of backward with no overlap. The valley between is your bucket-size choice. The vertical line shows where PyTorch's 25 MB default lands; drag the inputs to see the valley shift.
Gradient accumulation — fewer AllReduces, smaller effective batch per micro-step
If your effective batch is too small to be compute-bound (think tiny per-GPU batch on a small model), you can do K backward passes without sync, accumulating gradients on each rank locally, and only AllReduce on the K-th. PyTorch's no_sync() context manager does this:
for k in range(K - 1):
with model.no_sync():
loss = forward(model, micro_batch_k); loss.backward()
loss = forward(model, micro_batch_K); loss.backward() # this one syncs
optimizer.step()
This reduces AllReduce frequency by K×, lets each rank see effectively K · B/N tokens per step, and gives a numerically equivalent (modulo precision) gradient to running batch B · K. The cost is K backward passes' worth of activations on each rank — unless you activation-checkpoint.
The wall DDP eventually hits
DDP scales throughput linearly with GPUs (in the overlap regime), but each rank still holds the full model state. Recall lesson 01: a 70B model takes 1.1 TB of training state per rank under Adam. No amount of DDP fixes that — every rank has its own copy. DDP cannot solve the memory wall. When you hit it, the response is FSDP / ZeRO (lesson 05), which shards the state across ranks at the cost of more communication.
DDP is also unkind to inter-node bandwidth when scaled large: 4 · params bytes per step, all crossing the ring. At 1000 GPUs and a 7B model, the AllReduce still asymptotes to ~28 GB per rank but the per-link load on a flat ring grows. NCCL's hierarchical AllReduce (lesson 02) saves you there.
- The model state doesn't fit on one rank → use FSDP / ZeRO instead.
- You can't hide the AllReduce behind backward → check bucket size, check whether you should have been FSDP-HSDP.
- The AllReduce dominates the inter-node link → use HSDP or compose with PP across nodes.
Interactive · bucket size and overlap
One pane shows a simulated backward: layer-by-layer compute (blue) and the corresponding AllReduce (orange) on a separate stream. With a small bucket size, you see many small AllReduces — high latency tax. With a huge bucket size, you see one big AllReduce that fires only after backward is essentially done — no overlap. The sweet spot is in between. The KPIs show total step time, exposed comm, and effective scaling efficiency.