FSDP / ZeRO — sharding what DDP replicates

Take the redundancy out of DDP. The three things every rank kept identical copies of — optimizer state, gradients, parameters — get sharded one at a time. Each stage trades a little more communication for a lot less memory.

The observation that starts it all

DDP makes N copies of optimizer state, N copies of gradients, and N copies of parameters. After the gradient AllReduce, all N ranks compute the same optimizer step on the same averaged gradient and arrive at the same new weights. That's redundant compute on identical data. If rank 0 does the optimizer step for the first 1/N of the weights, rank 1 does the next 1/N, …, every rank still ends up with its own slice of new weights — and if we agree to share those slices on demand, we never had to hold all N copies in the first place.

That's the entire idea. The "ZeRO" name is from Rajbhandari et al. (DeepSpeed, 2019); "FSDP" (PyTorch's name for the ZeRO-3 mechanic, plus a layer of usability) became the canonical implementation in production.

The three stages

State pieces, in increasing order of "how often you need them":

Optimizer state (fp32 master weights + Adam m/v) — touched only at the optimizer step, once per training step.
Gradients — touched at backward, once per training step.
Parameters — touched at every forward and backward, once per layer.

The optimizer state is the easiest to shard because you only need your shard to do your step. Gradients are next — you need the full grad to AllReduce, but if you ReduceScatter instead, every rank ends up with the slice of the summed gradient it needs. Parameters are the hardest, because every forward pass actually needs the full layer's weights — so we have to "reconstruct" the layer just-in-time and free it right after.

Stage	Param	Grad	Opt	Memory / rank	Comm / step vs DDP
0 (DDP)	full	full	full	2P + 2P + 12P = 16P	1.0× (one AllReduce)
1	full	full	1/N	2P + 2P + 12P/N	~1.0× (still one AllReduce)
2	full	1/N	1/N	2P + 2P/N + 12P/N	~1.0× (ReduceScatter + Broadcast ≡ AllReduce)
3 (FSDP)	1/N	1/N	1/N	(2P + 2P + 12P)/N = 16P/N	~1.5× (extra AllGather of params per layer)

P = number of parameters. Bytes-per-param: 2 (bf16 weight) + 2 (bf16 grad) + 4 (fp32 master) + 8 (Adam m,v) = 16 in mixed precision (we packaged "master + m + v" as the 12-byte "opt state" above).

FSDP's mechanic — what one forward step actually does

For each layer, in order:

AllGather the layer's parameters. Every rank holds 1/N of the layer's weights; before forward, they cooperatively reconstruct the full layer. Cost: |layer| · (N-1)/N bytes per rank, asymptotically |layer|.
Compute the layer's forward. Local computation, output saved for backward (or recomputed via activation checkpointing).
Free the AllGathered weights. We're done with the full layer; throw it away and reclaim HBM. We keep our 1/N shard.

Backward, in reverse layer order:

AllGather the layer's parameters again. We need the weights to compute the input-side gradient (the chain rule pulls the weights through the operation). Same cost as above.
Compute the layer's backward. Produces a gradient tensor of the same shape as the full layer.
ReduceScatter the gradient. Every rank ends up with the summed gradient of its 1/N slice. Same cost as AllReduce's two halves combined.
Free the AllGathered weights. Same as in forward.

So per layer: 2 AllGathers + 1 ReduceScatter. Versus DDP's 1 AllReduce (which is itself ≡ AllGather + ReduceScatter). FSDP is roughly 1.5× the communication. In exchange, each rank's memory drops from 16P to 16P/N.

The 1.5× is a communication-volume figure, not necessarily 1.5× wall-clock. FSDP overlaps the extra collectives with compute by prefetching: while layer L computes, it AllGathers layer L+1's weights (forward prefetch), and symmetrically in backward (backward prefetch). The limit_all_gathers knob (the "rate limiter") bounds how many AllGathers run ahead so prefetch doesn't blow up peak memory. When the network is fast enough to hide the AllGather behind the matmuls, the realized step time is far below 1.5× DDP — the extra bytes are real, but they need not be on the critical path.

Animated · one FSDP layer step, scrubbable

Below is a frame-by-frame view of what each of N=4 ranks holds across one forward+backward of a 4-layer transformer. The bottom strip is the per-rank HBM curve over time: notice the spikes at AllGather and the troughs after Free. The flat baseline is the permanent shard each rank keeps; everything above it is the AllGathered, transient full layer.

2D · the four ZeRO stages laid out across ranks

The same model, the same N=8 ranks, four different ways to slice its training state. Toggle a stage. Solid squares = stored on this rank; faded squares = held by some other rank. The total colored area on a given row is what that rank actually pays for in HBM.

Why is it only 1.5×, not 2× or 3×?

Two reasons. First, the AllReduce that DDP "saves" is itself decomposable as AllGather + ReduceScatter — same bytes total. Second, FSDP's extra AllGather (the second one, during backward) only adds one additional collective; the other two are ZeRO-2-and-below's "normal" cost rewritten as primitives.

What the unit of sharding is

FSDP doesn't shard the model parameter-by-parameter. It shards at the granularity of a FSDP unit, which is typically a transformer block. The unit is the thing that gets AllGathered and freed as a whole. Choosing the unit is a knob:

Bigger units (e.g. the whole model) → fewer collectives, but each unit must fit on one rank temporarily when AllGathered. Defeats the purpose of FSDP for very large models.
Smaller units (every nn.Linear) → many tiny collectives, latency-dominated, low effective bandwidth.
Per transformer block (the production default) → good balance. The unit is hundreds of MB to a few GB, large enough that AllGather is bandwidth-bound, small enough that you can free it after forward.

FSDP1 vs FSDP2 — per-parameter sharding is the 2024+ default

Everything above describes FSDP1, which flattens each unit's parameters into one big 1D buffer and shards that flat tensor. That flat-parameter design has sharp edges: every parameter in a unit must share one dtype (awkward for mixed-dtype modules), optimizer state lives on opaque flat slices rather than real parameters, and no_sync behaves subtly. FSDP2 (torch.distributed._composable.fsdp.fully_shard) instead shards each parameter individually as a DTensor (sharded on dim 0). This is the 2024+ default: it allows mixed dtypes per unit, gives clean per-parameter optimizer state, and has better-defined no_sync/gradient-accumulation semantics. The memory and communication accounting in this lesson is identical for both — FSDP2 just fixes the flat-parameter ergonomics.

HSDP — the production compromise

Pure FSDP across all N GPUs sends collectives across the slow inter-node link constantly. Recall from lesson 03: per-layer AllGather at 50 GB/s inter-node for a 70B model is ~5–10 ms; do 80 layers × 3 collectives = ~1.5 s. Useless.

HSDP (Hybrid Sharded Data Parallel) shards within a node and replicates across nodes:

Intra-node: FSDP across the 8 GPUs of a node. AllGathers use NVLink at 900 GB/s. Memory per GPU is 16P/8 = 2P.
Inter-node: DDP replication across nodes. One AllReduce of gradients per step (across the shards — each rank only holds 1/8 of the gradient, so the inter-node AllReduce is 8× smaller too).

The memory math is exact, not a vibe. Shard across a group of size G (one node, G=8) and replicate across the N/G groups. Per-rank training state is then

16P/G (HSDP) vs 16P/N (full FSDP),

so HSDP holds N/G× more sharded state than full FSDP. With G=8, N=512 that's 64× — full FSDP would shard each rank down to 16P/512, HSDP only to 16P/8. The saving relative to DDP is purely a function of G, and it is independent of how many nodes you add. That is the trade: you spend memory (a factor N/G more state per rank) to buy bandwidth (the per-layer AllGather becomes a fast intra-node NVLink collective instead of a slow inter-node one). Comm lands at ~1.1× DDP. It is the actual production layout used for most ≤ ~30B models.

ZeRO++ — attacking the inter-node bytes directly

HSDP fixes the inter-node problem by not sharding across nodes — it pays memory to avoid the slow link. ZeRO++ (Wang et al., 2023) takes the other route: keep sharding across all N ranks, but shrink the bytes that actually cross the wire. Recall the FSDP bill: 2 AllGather + 1 ReduceScatter per layer, the 1.5× figure. ZeRO++ has one trick per collective.

qwZ — quantized-weight AllGather. The forward AllGather moves bf16 weights. But the gathered copy is transient and only feeds a matmul, so it tolerates lower precision: gather in int8 (or fp8) instead, dequantize on arrival. That cuts the forward AllGather volume from 2P to 1P bytes — a 2× reduction on that collective (4× at fp4). The fp32 shard each rank keeps is untouched; only the on-the-wire representation is quantized, so it is not the same as training in int8.

hpZ — hierarchical weight partitioning. The second AllGather (in backward) is the one ZeRO++ removes from the inter-node path entirely. Keep a full bf16 replica of the weights within each node (sharded 1/8 across the node's GPUs, but the node collectively holds the whole model). Then the backward AllGather is satisfied from intra-node NVLink — zero inter-node bytes for that collective. This is exactly HSDP's memory-for-bandwidth trade applied to one of the three collectives rather than all of them: you spend the extra per-node weight replica, you buy back the slow backward gather.

qgZ — quantized-gradient ReduceScatter. The remaining collective is the gradient ReduceScatter. Quantize gradients to int4 for the inter-node hops, with a hierarchical (intra-node first, then inter-node) all-to-all-style reduction that keeps quantization error from compounding across many reduction steps. That shrinks the 1P ReduceScatter toward 0.25P on the wire.

Stack all three and the inter-node bill drops from the FSDP 1.5× (≈ 2P gather + 2P gather + 1P scatter ≈ 5P/3 per the earlier accounting) to roughly 1P qwZ-gather + 0 inter-node (hpZ) + 0.25P qgZ-scatter — a ~4× cut on cross-node traffic, which is the bandwidth that was killing pure FSDP at scale. The cost is precision risk (quantized comm) and the hpZ per-node replica; it is the answer when you want full 16P/N sharding and tolerable inter-node comm, i.e. the models too big for HSDP's 16P/G to fit.

3D · the HSDP topology, isometric

Four nodes, eight GPUs each — the canonical 32-GPU box. The eight GPUs within a node form an FSDP group (they cooperatively hold one full copy of the model, sharded 1/8). The four nodes form a DDP group (each node holds a complete replica). Click a GPU to highlight its FSDP group and its DDP peers across the other nodes; click a different shard color to walk through all eight slices.

HSDP topology · click a GPU

Within a node: NVLink (gold edges) connects the 8 FSDP shard-mates. Across nodes: IB (blue) connects DDP replicas at corresponding shard positions. The IB AllReduce only carries 1/8 of the gradient because each rank only owns 1/8.

nodes: 4 rotation: 0° show:

selected GPU

— click one —

FSDP group size

—

DDP group size

—

grads/step on IB

—

Sharding vs activation checkpointing — orthogonal axes

Two memory-saving techniques, often confused, addressing two different sources:

Technique	What it saves	What it costs
FSDP / ZeRO	Parameter + gradient + optimizer state	~1.5× communication
Activation checkpointing	Activations stored for backward	~33% extra compute (recompute forward during backward)

Production training of any large model uses both. FSDP-3 + activation checkpointing brings memory per rank to ~16P/N + small, which is what's been letting Llama-3 and friends fit on H100 clusters.

FSDP-specific gotcha: gradient accumulation under no_sync (which skips the per-micro-step reshard/ReduceScatter to save communication) keeps the unsharded, full-size gradients resident across micro-steps, so peak memory climbs back toward the DDP 2P grad cost instead of 2P/N — the opposite of what FSDP bought you. If you're memory-bound, accumulate without no_sync and pay the extra ReduceScatters (more on this tradeoff in lesson 11).

Compute–memory tradeoff, exactly

The right way to think about ZeRO stages is as a knob that smoothly converts memory pressure into communication pressure. Until the model fits, increasing ZeRO stage is free (the comm bound isn't hit). Once it fits, increasing the stage further is wasted communication — you'd be better off using the freed memory for a bigger batch or longer sequence.

Rough decision tree:

Single-GPU training state fits → use DDP. No ZeRO needed.
Optimizer state pushes you over → ZeRO-1.
Gradients also too big → ZeRO-2.
Parameters themselves too big → ZeRO-3 / FSDP.
Activations also too big → add activation checkpointing.
A single transformer layer too big to AllGather even temporarily → composing with TP (lesson 06).

Interactive · ZeRO stage vs memory vs comm

Set the model size, optimizer, world size N. The widget plots memory per rank and per-step communication across the four ZeRO stages, plus HSDP. The "you are here" marker on the memory axis is your chosen GPU's HBM budget. Watch the bars cross under the budget as you move up the stages.

Takeaway

ZeRO is a knob: at each click up, one more redundancy is shaved off in exchange for slightly more communication. FSDP (ZeRO-3) gives you N× memory reduction at ~1.5× DDP's comm. HSDP shards only within a group of size G (16P/G per rank vs FSDP's 16P/N), trading N/G× more resident state for a near-DDP, all-intra-node comm bill. Pure FSDP across nodes is wasteful; HSDP is the production default for everything that's not tensor-parallel.