system_ml / 04 · data parallel lesson 4 / 19

Data parallel — the simplest fan-out

Every rank holds the full model. Every rank gets a different micro-batch. The only collective is one AllReduce of gradients per step. The whole lesson is how that AllReduce hides behind backward.

The setup in one paragraph

Pick N GPUs. Replicate the model on each one. Split your effective batch B into N micro-batches of size B/N. Each rank does forward + backward on its own micro-batch independently — no communication. At the end of backward, each rank holds its own local gradient. To match the math of single-GPU SGD with batch B, we need every rank to step on the average gradient:

g_global  =  (1/N) · Σi g_local(i)

That averaging is one AllReduce. The optimizer step is rank-local: every rank has the same averaged gradient, the same weights, and a deterministic optimizer, so they all arrive at the same post-step weights. Lockstep, no drift.

A subtle correctness detail
AllReduce with op=SUM is exact (floating-point associativity issues aside). To get an average, you can either AllReduce + divide locally, or AllReduce with op=AVG (NCCL supports it). In practice NCCL does sum-then-divide; the tiny non-associativity error is bounded and ignored. RNG also matters: dropout, sampling, and so on must be seeded the same per rank if you want deterministic forward results across replicas (PyTorch's DistributedSampler shards data indices per rank but doesn't manage per-op RNG state — handle that yourself when correctness depends on it).

2D · the lockstep, four ranks sharing gradients

Four ranks, each holding the full model. They forward and backward on different data. Then, an animated AllReduce sweeps gradients into an average and broadcasts that average back. At the end of one step, every rank is back in sync. Watch the "diverged" indicator turn green at the AllReduce moment.

DDP lockstep · forward → backward → AllReduce → step
Each rank box shows: W (weights), G (local gradient). Different shades on G represent local-gradient values. After AllReduce, all G's converge to the average (orange).
phase
ranks agree on G?
ranks agree on W?
collective active

The basic cost, before any tricks

Per step: one AllReduce of all gradients. Gradients are bf16, so volume per rank moved through the ring is:

bytes_per_step  =  2 · (N − 1)/N · |params| · 2  ≈  4 · |params| (large N)

For a 7B model: ~28 GB per rank per step. At intra-node ring bandwidth of ~150 GB/s (effective, well below the NVLink peak because NCCL serialises ring steps): ~190 ms. Step time on a 7B forward+backward at batch 4 × sequence 4k is roughly 100 ms. So an unhidden AllReduce doubles the step time. Hidden behind backward — see below — it costs essentially nothing.

The overlap trick (the real lesson)

Gradients become available during backward, not after it. Backward proceeds layer by layer from the loss backward to the input: layer L's gradient is computed first, then L-1, then L-2, …, then layer 0. Each layer's gradient takes some compute time to materialise.

Naive approach: wait for all gradients, then AllReduce the whole flat buffer. Wall-clock = T_bwd + T_ar.

Overlap approach: as soon as a layer's gradient is ready, fire its AllReduce now on a separate CUDA stream. The main stream continues computing earlier layers' gradients. NCCL's collective runs concurrently with the compute kernels:

main stream:    [bwd L=80] [bwd L=79] [bwd L=78] ...  [bwd L=1]
nccl stream:                [AR L=80]  [AR L=79]  ...  [AR L=2] [AR L=1]
                            ↑ fires when L=80 grad is ready

If the AllReduce of layer L finishes before backward of layer L-1 finishes, the comm is free. If it doesn't, the gap is exposed and counts. The condition for "free" is:

TAR(layer)  ≤  Tbwd(earlier layers since AR fired)

For inter-node DDP on a large model, this usually holds layer-by-layer if backward is bandwidth-rich enough; for intra-node DDP on a small model, it usually does not, because there's almost no compute to hide behind. The right thing in either case is to bucket.

Bucketing — the size knob you'll actually tune

Per-tensor AllReduce has a per-message latency floor (NCCL launch overhead ≈ tens of microseconds). If you fire one AllReduce per layer, on a 100-layer model that's 100 × tens-of-μs = milliseconds of latency tax, before any bytes flow. So we bundle consecutive layers' gradients into a flat bucket, and AllReduce the bucket as a unit.

Two competing pressures:

PyTorch DDP's default bucket size is 25 MB. The right value depends on bandwidth, message-latency, and model layer-count; it's a one-line constructor knob (bucket_cap_mb) and you can profile to find your sweet spot. The widget below makes the tradeoff visible.

Animated · backward + bucketed AllReduce, scrubbable

Same idea as the simulator below, but you can see it. The top lane is backward compute proceeding layer-by-layer from L-1 down to 0. The bottom lane is the NCCL stream; as each bucket fills, an AllReduce fires. Press play, then drag the bucket-size slider while it's playing — watch the orange bars resize and reposition in real time.

Backward + AllReduce overlap timeline · play and tune buckets
When a layer's gradient is ready, it's appended to the current bucket. Bucket flushes (AllReduces) when it exceeds bucket size. Exposed comm (red shading) is wall-clock you pay over backward time.
step time
backward time
exposed comm
scaling efficiency

What "scaling efficiency" really measures

Scaling efficiency = throughput at N GPUs / (N × throughput at 1 GPU). 100% is impossible — there's always some unhidden communication. 95% is excellent. 80% is mediocre. Below 70% means you should look hard at:

2D · step time vs bucket size — find the sweet spot

The same simulator the timeline above ran, swept across all bucket sizes from 1 MB to 2 GB. The curve is non-monotonic: too small → latency tax dominates (many tiny AllReduces, each paying α launch overhead); too large → one big AllReduce at the end of backward with no overlap. The valley between is your bucket-size choice. The vertical line shows where PyTorch's 25 MB default lands; drag the inputs to see the valley shift.

Bucket sweep · the step-time valley
Each x is a bucket size; each y is the simulated total step time. The minimum is the sweet spot; defaults are decent but profiling can find a 5–15% improvement.
optimal bucket
optimal step time
default (25 MB) step time
gap vs optimal

Gradient accumulation — fewer AllReduces, smaller effective batch per micro-step

If your effective batch is too small to be compute-bound (think tiny per-GPU batch on a small model), you can do K backward passes without sync, accumulating gradients on each rank locally, and only AllReduce on the K-th. PyTorch's no_sync() context manager does this:

for k in range(K - 1):
    with model.no_sync():
        loss = forward(model, micro_batch_k); loss.backward()
loss = forward(model, micro_batch_K); loss.backward()  # this one syncs
optimizer.step()

This reduces AllReduce frequency by , lets each rank see effectively K · B/N tokens per step, and gives a numerically equivalent (modulo precision) gradient to running batch B · K. The cost is K backward passes' worth of activations on each rank — unless you activation-checkpoint.

The wall DDP eventually hits

DDP scales throughput linearly with GPUs (in the overlap regime), but each rank still holds the full model state. Recall lesson 01: a 70B model takes 1.1 TB of training state per rank under Adam. No amount of DDP fixes that — every rank has its own copy. DDP cannot solve the memory wall. When you hit it, the response is FSDP / ZeRO (lesson 05), which shards the state across ranks at the cost of more communication.

DDP is also unkind to inter-node bandwidth when scaled large: 4 · params bytes per step, all crossing the ring. At 1000 GPUs and a 7B model, the AllReduce still asymptotes to ~28 GB per rank but the per-link load on a flat ring grows. NCCL's hierarchical AllReduce (lesson 02) saves you there.

When DDP "stops working"
  1. The model state doesn't fit on one rank → use FSDP / ZeRO instead.
  2. You can't hide the AllReduce behind backward → check bucket size, check whether you should have been FSDP-HSDP.
  3. The AllReduce dominates the inter-node link → use HSDP or compose with PP across nodes.

Interactive · bucket size and overlap

One pane shows a simulated backward: layer-by-layer compute (blue) and the corresponding AllReduce (orange) on a separate stream. With a small bucket size, you see many small AllReduces — high latency tax. With a huge bucket size, you see one big AllReduce that fires only after backward is essentially done — no overlap. The sweet spot is in between. The KPIs show total step time, exposed comm, and effective scaling efficiency.

DDP step timeline · bucket size vs overlap
Each layer takes T_layer to backward. Each AllReduce takes α + S/BW where S is the bucket size. The orange bars are AllReduces fired on a separate stream; gaps after backward ends are exposed comm — pure overhead.
backward time
comm time
exposed comm
scaling efficiency
Takeaway
DDP is a one-AllReduce-per-step strategy that scales linearly until you hit either the memory wall (parameters won't replicate) or the inter-node link (AllReduce can't be hidden). Both walls are real and both are addressed by the lessons ahead — FSDP for the first, hierarchical/HSDP for the second.