Data parallel — the simplest fan-out

Every rank holds the full model. Every rank gets a different micro-batch. The only collective is one AllReduce of gradients per step. The whole lesson is how that AllReduce hides behind backward.

The setup in one paragraph

Pick N GPUs. Replicate the model on each one. Split your effective batch B into N micro-batches of size B/N. Each rank does forward + backward on its own micro-batch independently — no communication. At the end of backward, each rank holds its own local gradient. To match the math of single-GPU SGD with batch B, we need every rank to step on the average gradient:

g_global = (1/N) · Σ_i g_local⁽ⁱ⁾

That averaging is one AllReduce. The optimizer step is rank-local: every rank has the same averaged gradient, the same weights, and a deterministic optimizer, so they all arrive at the same post-step weights. Lockstep, no drift.

A subtle correctness detail

AllReduce with op=SUM is exact (floating-point associativity issues aside). To get an average, you can either AllReduce + divide locally, or AllReduce with op=AVG (NCCL supports it). In practice NCCL does sum-then-divide; the tiny non-associativity error is bounded and ignored. RNG also matters: dropout, sampling, and so on must be seeded the same per rank if you want deterministic forward results across replicas (PyTorch's DistributedSampler shards data indices per rank but doesn't manage per-op RNG state — handle that yourself when correctness depends on it).

2D · the lockstep, four ranks sharing gradients

Four ranks, each holding the full model. They forward and backward on different data. Then, an animated AllReduce sweeps gradients into an average and broadcasts that average back. At the end of one step, every rank is back in sync. Watch the "diverged" indicator turn green at the AllReduce moment.

The basic cost, before any tricks

Per step: one AllReduce of all gradients. Gradients are bf16, so volume per rank moved through the ring is:

bytes_per_step = 2 · (N − 1)/N · |params| · 2 ≈ 4 · |params| (large N)

For a 7B model: ~28 GB per rank per step. At intra-node ring bandwidth of ~150 GB/s (effective, well below the NVLink peak because NCCL serialises ring steps — this is the flat-ring figure from lesson 03's peak-vs-effective note, the canonical source for these numbers): ~190 ms. Step time on a 7B forward+backward at batch 4 × sequence 4k is roughly 100 ms. So an unhidden AllReduce doubles the step time. Hidden behind backward — see below — it costs essentially nothing.

The overlap trick (the real lesson)

Gradients become available during backward, not after it. Backward proceeds layer by layer from the loss backward to the input: layer L's gradient is computed first, then L-1, then L-2, …, then layer 0. Each layer's gradient takes some compute time to materialise.

Naive approach: wait for all gradients, then AllReduce the whole flat buffer. Wall-clock = T_bwd + T_ar.

Overlap approach: as soon as a layer's gradient is ready, fire its AllReduce now on a separate CUDA stream. The main stream continues computing earlier layers' gradients. NCCL's collective runs concurrently with the compute kernels:

main stream:    [bwd L=80] [bwd L=79] [bwd L=78] ...  [bwd L=1]
nccl stream:                [AR L=80]  [AR L=79]  ...  [AR L=2] [AR L=1]
                            ↑ fires when L=80 grad is ready

If the AllReduce of layer L finishes before backward of layer L-1 finishes, the comm is free. If it doesn't, the gap is exposed and counts. The condition for "free" is:

T_AR(layer) ≤ T_bwd(earlier layers since AR fired)

For inter-node DDP on a large model, this usually holds layer-by-layer if backward is bandwidth-rich enough; for intra-node DDP on a small model, it usually does not, because there's almost no compute to hide behind. The right thing in either case is to bucket.

Bucketing — the size knob you'll actually tune

Per-tensor AllReduce has a per-message latency floor (NCCL launch overhead ≈ tens of microseconds). If you fire one AllReduce per layer, on a 100-layer model that's 100 × tens-of-μs = milliseconds of latency tax, before any bytes flow. So we bundle consecutive layers' gradients into a flat bucket, and AllReduce the bucket as a unit.

Two competing pressures:

Bigger buckets → fewer NCCL launches, higher effective bandwidth (the asymptotic ring rate kicks in), less latency tax.
Smaller buckets → AllReduce can start earlier in backward (you don't wait for many layers to fill a big bucket), so there's more compute left to hide behind.

PyTorch DDP's default bucket size is 25 MB. The right value depends on bandwidth, message-latency, and model layer-count; it's a one-line constructor knob (bucket_cap_mb) and you can profile to find your sweet spot. The widget below makes the tradeoff visible.

Animated · backward + bucketed AllReduce, scrubbable

Same idea as the simulator below, but you can see it. The top lane is backward compute proceeding layer-by-layer from L-1 down to 0. The bottom lane is the NCCL stream; as each bucket fills, an AllReduce fires. Press play, then drag the bucket-size slider while it's playing — watch the orange bars resize and reposition in real time.

What "scaling efficiency" really measures

Scaling efficiency = throughput at N GPUs / (N × throughput at 1 GPU). 100% is impossible — there's always some unhidden communication. 95% is excellent. 80% is mediocre. Below 70% means you should look hard at:

NCCL bucket size (above).
Whether AllReduce is on a separate stream and overlapping (an nsys profile will show NCCL on a different stream than compute).
Whether some ranks are stragglers (per-rank step time should be flat ± 1%).
Whether you're inter-node DDP on a model that should have been FSDP or 3D-parallel.

2D · step time vs bucket size — find the sweet spot

The same simulator the timeline above ran, swept across all bucket sizes from 1 MB to 2 GB. The curve is non-monotonic: too small → latency tax dominates (many tiny AllReduces, each paying α launch overhead); too large → one big AllReduce at the end of backward with no overlap. The valley between is your bucket-size choice. The vertical line shows where PyTorch's 25 MB default lands; drag the inputs to see the valley shift.

Gradient accumulation — fewer AllReduces, smaller effective batch per micro-step

If your effective batch is too small to be compute-bound (think tiny per-GPU batch on a small model), you can do K backward passes without sync, accumulating gradients on each rank locally, and only AllReduce on the K-th. PyTorch's no_sync() context manager does this:

for k in range(K - 1):
    with model.no_sync():
        loss = forward(model, micro_batch_k); loss.backward()
loss = forward(model, micro_batch_K); loss.backward()  # this one syncs
optimizer.step()

This reduces AllReduce frequency by K×, lets each rank see effectively K · B/N tokens per step, and gives a numerically equivalent (modulo precision) gradient to running batch B · K. The cost is K backward passes' worth of activations on each rank — unless you activation-checkpoint.

The wall DDP eventually hits

DDP scales throughput linearly with GPUs (in the overlap regime), but each rank still holds the full model state. Recall lesson 01: a 70B model takes 1.1 TB of training state per rank under Adam. No amount of DDP fixes that — every rank has its own copy. DDP cannot solve the memory wall. When you hit it, the response is FSDP / ZeRO (lesson 05), which shards the state across ranks at the cost of more communication.

DDP is also unkind to inter-node bandwidth when scaled large: 4 · params bytes per step, all crossing the ring. At 1000 GPUs and a 7B model, the AllReduce still asymptotes to ~28 GB per rank but the per-link load on a flat ring grows. NCCL's hierarchical AllReduce (lesson 02) saves you there.

When DDP "stops working"

The model state doesn't fit on one rank → use FSDP / ZeRO instead.
You can't hide the AllReduce behind backward → check bucket size, check whether you should have been FSDP-HSDP.
The AllReduce dominates the inter-node link → use HSDP or compose with PP across nodes.

Interactive · bucket size and overlap

One pane shows a simulated backward: layer-by-layer compute (blue) and the corresponding AllReduce (orange) on a separate stream. With a small bucket size, you see many small AllReduces — high latency tax. With a huge bucket size, you see one big AllReduce that fires only after backward is essentially done — no overlap. The sweet spot is in between. The KPIs show total step time, exposed comm, and effective scaling efficiency.

DDP step timeline · bucket size vs overlap

Each layer takes T_layer to backward. Each AllReduce takes α + S/BW where S is the bucket size. The orange bars are AllReduces fired on a separate stream; gaps after backward ends are exposed comm — pure overhead.

layers: 32 per-layer bwd (ms): 6

params/layer (MB): 200 bucket size (MB): 100 AllReduce BW (GB/s): 150 α (μs): 30

backward time

—

comm time

—

exposed comm

—

scaling efficiency

—

What breaks bucketing

Bucketing assumes a fixed graph. After iteration 1, DDP records the reverse order in which params become ready during backward, groups them into buckets, and fires each bucket's AllReduce the moment it fills. That autograd-hook machinery is fast but brittle — three production knobs decide whether it works at all.

find_unused_parameters=True — needed when control flow skips some params (a branch not taken, a head not used this step). DDP expects every param that was registered to receive a gradient; if one never does, its bucket never fills and its AllReduce never fires — so the collective is incomplete and the step hangs (every rank waits forever on a NCCL op that will never be matched). The flag fixes this by traversing the autograd graph each iteration to find which params actually got used, then marking the rest ready. That traversal costs real time — don't enable it unless you must.
static_graph=True — your promise that the set of used params (and the backward order) is identical every iteration. DDP then skips the per-iteration rebuild, can replay a cached bucket schedule, and unlocks compatibility with activation checkpointing (which otherwise confuses the hook-firing logic). It breaks the moment the used-param set changes between iterations — the regime where it fails is exactly dynamic control flow, which is what find_unused_parameters is for.
gradient_as_bucket_view=True — makes each param's .grad a view into the flat bucket buffer rather than a separate tensor. Backward writes gradients straight into the bucket, so there's no copy-into-bucket step and no second allocation. Removes one copy per param and cuts peak memory by roughly the gradient footprint.

The hang is silent

A missing AllReduce doesn't error — it deadlocks. If a DDP run wedges at step boundaries with no traceback, the first suspect is an unused parameter (a conditional branch, an auxiliary head turned off) that needs find_unused_parameters=True — or, if the graph really is static, a static_graph=True setting that was violated by a change in which params got used.

Takeaway

DDP is a one-AllReduce-per-step strategy that scales linearly until you hit either the memory wall (parameters won't replicate) or the inter-node link (AllReduce can't be hidden). Both walls are real and both are addressed by the lessons ahead — FSDP for the first, hierarchical/HSDP for the second.