system_ml / 10 · 3D parallelism lesson 10 / 19

3D / nD parallelism — composition rules

Six strategies, one cluster. Stack them so the hottest communication lands on the fastest fabric, and the optimization stops being about any one strategy and becomes about which one runs where.

The orthogonality observation

Each strategy in Part II shards a different axis of the work:

StrategyWhat it shardsComm patternComm scale
TPFeature dim of weightsAllReduce per layer × 2Large · high-frequency
FSDP / ZeRO-3Param/grad/opt slicesAllGather + ReduceScatter per layerLarge · high-frequency
SPSequence in LN/Dropout zonesSame total as TP (reshaped)Same as TP
CPSequence in attentionK/V ring rotationLinear in seq · scales with depth
EPExperts (MoE)AllToAll × 2 per layerLarge · latency-bound
PPLayersSend/Recv per stage boundarySmall · per-stage
DPData batch1 AllReduce of grads per stepLarge · once per step

Because they shard different axes, you can use them simultaneously. A rank that's TP=8 × PP=4 × DP=16 has identity (t, p, d): it owns the t-th feature slice of the p-th pipeline stage of the d-th data replica. t ∈ [0,8), p ∈ [0,4), d ∈ [0,16) — 512 ranks total.

The decision tree

Here is the recipe, in priority order — the order ranks are physically assigned to dimensions:

  1. TP first, intra-node only. TP fires the most-frequent, largest collectives. Pin it to NVLink. TP = 8 on an HGX/DGX node is the default; TP = 16 only on systems with extended NVLink domains (GH200, GB200 NVL72).
  2. EP next, ideally intra-node. AllToAll latency wants NVLink. Important caveat: TP and EP compete for the same intra-node slots — typical MoE recipes (DeepSeek-V3, Mixtral training) keep TP small or 1 inside experts to leave the node bandwidth for EP. Some frontier MoE runs (DeepSeek-V3) instead push EP across nodes (EP=64) and rely on a custom IB-aware AllToAll; this is the "I have read the topology carefully" exception, not the default.
  3. SP/CP "alongside" TP. SP is on whenever TP > 1 (free). CP is on for sequences ≥ 32k.
  4. PP across nodes. Activation Send/Recv is small, infrequent, IB-tolerant.
  5. DP / FSDP outermost. One gradient sync per step is the most-overlap-friendly.

The 70B-on-512-GPUs example:

           rank id = (t, p, d) ∈ [0,8) × [0,4) × [0,16)
                       ▲       ▲       ▲
                       │       │       │
                  TP=8 │  PP=4 │  DP=16
                       │       │       │
                  intra-node   across   across
                  (NVLink)     nodes    DCN
                               (IB)     (IB)

Memory accounting per rank under this layout: model state divided by TP × PP = 32. For a 70B model in mixed precision, that's 1120 GB / 32 = 35 GB of param/grad/opt state per rank. With FSDP-HSDP layered on top of DP (shard intra-DP-group, replicate inter), you can push that further if memory's still tight.

3D · the (TP, PP, DP) cube, isometric

Each rank in the 512-GPU example lives at exactly one lattice point (t, p, d). The cube below has TP along one axis, PP along another, DP along the third. Click a cell to see which model slice that GPU holds — which layers (from PP), which intra-layer feature shard (from TP), which data replica (from DP). Toggle "compress DP" to render the full 16-replica depth (small dots) or collapse to a representative slice for legibility.

(TP, PP, DP) cube · click a GPU
TP=8 (feature shards · blue axis) · PP=4 (pipeline stages · orange axis) · DP=16 (data replicas · green axis). Same color across DP = same model slice replicated. Same color across TP = different feature shards of the same layer.
total GPUs
selected (t,p,d)
— click —
model slice
DP peers (same slice)

What about activation memory?

Different parallelism shards different parts of activations:

For long-context training, the layout often pivots to favor SP + CP + a smaller TP — because attention activation memory is what bites first.

Animated · one forward step through the 3D-parallel stack

Watch what one forward pass actually does in a TP × PP × DP layout. TP fires intra-stage AllReduces twice per layer (after attention and after MLP). PP sends activations to the next stage at every stage boundary. DP doesn't fire at all on the forward — it only kicks in during backward as a single gradient AllReduce. The bottom strip shows which fabric is "lit up" at each instant: NVLink for TP, IB for PP / DP.

Forward step · which fabric is on fire when
Boxes = pipeline stages. Within each stage, layers are processed in order; each layer does compute + 2× TP AllReduce. Stage boundaries do a PP send/recv. The DP AllReduce only fires after the (not shown) backward.
phase
collective
fabric
running cost

Why not TP across nodes?

Recall the cost from lesson 06: TP fires 4 AllReduces per layer per step, each ~B · T · d · 2 bytes. At NVLink 900 GB/s, 80 layers × 4 = 320 AllReduces × 270 MB = ~95 ms of comm per step. At IB 50 GB/s: ~1.7 s. The same parallelism that's free intra-node is a 17× slowdown inter-node. That's the rule. If your model needs TP > (GPUs per node), you have two bad options:

Why not skip TP, just FSDP?

For a model where every layer fits on one GPU with FSDP+checkpointing, you should skip TP. FSDP is simpler (no per-block AllReduces, no GQA-vs-TP-size knot, no rank-aware kernels). The rule of thumb at this writing: pure FSDP-HSDP is fine up to ~30B params. Above that, the per-layer params (or activations under longer context) usually force TP into the mix.

A 4D example with EP

An illustrative MoE layout (DeepSeek-V3-style; 671B total, 37B active, 256 routed experts, 61 layers):

       (TP=1, EP=8, PP=16, DP=8) = 1024 GPUs · 128 nodes  [teaching layout]

       TP=1   per-expert MLP is small, don't fragment it further
       EP=8   shard 256 experts across 8 GPUs of one node (32 per GPU)
       PP=16  shard ~61 layers across 16 stages
       DP=8   replicate the (EP×PP) group 8× for batch throughput

(The actual published DeepSeek-V3 training uses a different shape — EP=64 spanning multiple nodes with an IB-aware AllToAll, plus DualPipe pipelining — but the structure above is the easier mental model for a first read.)

The hottest comm is the AllToAll on the EP axis — pinned to the fastest available fabric. The next-hottest is the gradient AllReduce on the DP axis — once per step, can tolerate IB. PP comm is point-to-point activations — IB-tolerant. TP is unused because each expert is small enough to fit on one GPU; larger-expert MoEs (Mixtral 8x22B, Grok) flip this and use TP inside each expert.

3D parallelism is not the same as 3D model

A confusing thing: "3D parallel" doesn't refer to the model having three dimensions; it refers to the layout having three orthogonal parallelism axes. We're already up to 5D (TP, PP, DP, EP, CP) for MoE long-context training, and "nD" is sometimes the name in modern frameworks. The principle hasn't changed: each axis is an independent sharding decision, the ranks are a lattice.

Production frameworks

2D · decision tree for picking the layout

Below is the decision tree from the rules above, drawn out. Pick a model size and a cluster shape (GPUs per node × nodes); the tree walks itself, highlighting the leaf that matches and showing the rationale at each branch. Hover any node to see its check; click to "lock in" a path.

Pick a leaf · which TP/PP/DP/FSDP combination fits?
Branch conditions are heuristics, not gospel — at the boundaries (e.g. exactly 30B) reality dithers. The leaf shows the recommended (TP, PP, FSDP scope, DP) tuple.
recommended layout
total GPUs used
memory / rank (rough)
rationale

Interactive · layout designer

Pick a model and a cluster. The widget lets you set TP, PP, DP, FSDP-on-DP, EP. It checks TP · PP · DP = total GPUs, computes memory per rank and rough per-step comm load, and flags violations (TP > GPUs/node, FSDP across all DP at IB, etc.).

Layout designer · "does this fit?"
Memory and comm estimates are rough but directionally honest. Try the canonical 70B layout (TP=8, PP=4, DP=16, GPUs=512) and watch what happens when you push TP=16 (above intra-node limit) or move FSDP across all of DP.
DP derived
memory / rank
comm / step
verdict
Takeaway
Composition is about matching comm frequency to interconnect. The fastest fabric (NVLink) goes to the most chatty strategy (TP, then EP). The slower fabric (IB) accommodates DP (once per step) and PP (point-to-point, small). The whole 3D-or-nD layout is a single optimization: minimize total bandwidth-time × frequency, subject to memory fitting. Get this right and the rest of the system tunes itself.