3D / nD parallelism — composition rules
Six strategies, one cluster. Stack them so the hottest communication lands on the fastest fabric, and the optimization stops being about any one strategy and becomes about which one runs where.
The orthogonality observation
Each strategy in Part II shards a different axis of the work:
| Strategy | What it shards | Comm pattern | Comm scale |
|---|---|---|---|
| TP | Feature dim of weights | AllReduce per layer × 2 | Large · high-frequency |
| FSDP / ZeRO-3 | Param/grad/opt slices | AllGather + ReduceScatter per layer | Large · high-frequency |
| SP | Sequence in LN/Dropout zones | Same total as TP (reshaped) | Same as TP |
| CP | Sequence in attention | K/V ring rotation | Linear in seq · scales with depth |
| EP | Experts (MoE) | AllToAll × 2 per layer | Large · latency-bound |
| PP | Layers | Send/Recv per stage boundary | Small · per-stage |
| DP | Data batch | 1 AllReduce of grads per step | Large · once per step |
Because they shard different axes, you can use them simultaneously. A rank that's TP=8 × PP=4 × DP=16 has identity (t, p, d): it owns the t-th feature slice of the p-th pipeline stage of the d-th data replica. t ∈ [0,8), p ∈ [0,4), d ∈ [0,16) — 512 ranks total.
The decision tree
Here is the recipe, in priority order — the order ranks are physically assigned to dimensions:
- TP first, intra-node only. TP fires the most-frequent, largest collectives. Pin it to NVLink. TP = 8 on an HGX/DGX node is the default; TP = 16 only on systems with extended NVLink domains (GH200, GB200 NVL72).
- EP next, ideally intra-node. AllToAll latency wants NVLink. Important caveat: TP and EP compete for the same intra-node slots — typical MoE recipes (DeepSeek-V3, Mixtral training) keep TP small or 1 inside experts to leave the node bandwidth for EP. Some frontier MoE runs (DeepSeek-V3) instead push EP across nodes (EP=64) and rely on a custom IB-aware AllToAll; this is the "I have read the topology carefully" exception, not the default.
- SP/CP "alongside" TP. SP is on whenever TP > 1 (free). CP is on for sequences ≥ 32k.
- PP across nodes. Activation Send/Recv is small, infrequent, IB-tolerant.
- DP / FSDP outermost. One gradient sync per step is the most-overlap-friendly.
The 70B-on-512-GPUs example:
rank id = (t, p, d) ∈ [0,8) × [0,4) × [0,16)
▲ ▲ ▲
│ │ │
TP=8 │ PP=4 │ DP=16
│ │ │
intra-node across across
(NVLink) nodes DCN
(IB) (IB)
Memory accounting per rank under this layout: model state divided by TP × PP = 32. For a 70B model in mixed precision, that's 1120 GB / 32 = 35 GB of param/grad/opt state per rank. With FSDP-HSDP layered on top of DP (shard intra-DP-group, replicate inter), you can push that further if memory's still tight.
3D · the (TP, PP, DP) cube, isometric
Each rank in the 512-GPU example lives at exactly one lattice point (t, p, d). The cube below has TP along one axis, PP along another, DP along the third. Click a cell to see which model slice that GPU holds — which layers (from PP), which intra-layer feature shard (from TP), which data replica (from DP). Toggle "compress DP" to render the full 16-replica depth (small dots) or collapse to a representative slice for legibility.
What about activation memory?
Different parallelism shards different parts of activations:
- TP shards the feature dim of matmul activations.
- SP shards the LN/dropout activations along sequence.
- CP shards the attention activations along sequence.
- PP only stores activations for layers in its stage (which is the implicit shard).
For long-context training, the layout often pivots to favor SP + CP + a smaller TP — because attention activation memory is what bites first.
Animated · one forward step through the 3D-parallel stack
Watch what one forward pass actually does in a TP × PP × DP layout. TP fires intra-stage AllReduces twice per layer (after attention and after MLP). PP sends activations to the next stage at every stage boundary. DP doesn't fire at all on the forward — it only kicks in during backward as a single gradient AllReduce. The bottom strip shows which fabric is "lit up" at each instant: NVLink for TP, IB for PP / DP.
Why not TP across nodes?
Recall the cost from lesson 06: TP fires 4 AllReduces per layer per step, each ~B · T · d · 2 bytes. At NVLink 900 GB/s, 80 layers × 4 = 320 AllReduces × 270 MB = ~95 ms of comm per step. At IB 50 GB/s: ~1.7 s. The same parallelism that's free intra-node is a 17× slowdown inter-node. That's the rule. If your model needs TP > (GPUs per node), you have two bad options:
- Live with the inter-node TP cost. Only viable if step time is dominated by some other resource (e.g. very compute-heavy MoE that hides the TP cost).
- Switch to PP for that extra factor of sharding. Memory: same. Cost: pipeline bubble. Often the right choice.
Why not skip TP, just FSDP?
For a model where every layer fits on one GPU with FSDP+checkpointing, you should skip TP. FSDP is simpler (no per-block AllReduces, no GQA-vs-TP-size knot, no rank-aware kernels). The rule of thumb at this writing: pure FSDP-HSDP is fine up to ~30B params. Above that, the per-layer params (or activations under longer context) usually force TP into the mix.
A 4D example with EP
An illustrative MoE layout (DeepSeek-V3-style; 671B total, 37B active, 256 routed experts, 61 layers):
(TP=1, EP=8, PP=16, DP=8) = 1024 GPUs · 128 nodes [teaching layout]
TP=1 per-expert MLP is small, don't fragment it further
EP=8 shard 256 experts across 8 GPUs of one node (32 per GPU)
PP=16 shard ~61 layers across 16 stages
DP=8 replicate the (EP×PP) group 8× for batch throughput
(The actual published DeepSeek-V3 training uses a different shape — EP=64 spanning multiple nodes with an IB-aware AllToAll, plus DualPipe pipelining — but the structure above is the easier mental model for a first read.)
The hottest comm is the AllToAll on the EP axis — pinned to the fastest available fabric. The next-hottest is the gradient AllReduce on the DP axis — once per step, can tolerate IB. PP comm is point-to-point activations — IB-tolerant. TP is unused because each expert is small enough to fit on one GPU; larger-expert MoEs (Mixtral 8x22B, Grok) flip this and use TP inside each expert.
3D parallelism is not the same as 3D model
A confusing thing: "3D parallel" doesn't refer to the model having three dimensions; it refers to the layout having three orthogonal parallelism axes. We're already up to 5D (TP, PP, DP, EP, CP) for MoE long-context training, and "nD" is sometimes the name in modern frameworks. The principle hasn't changed: each axis is an independent sharding decision, the ranks are a lattice.
Production frameworks
- Megatron-LM (NVIDIA). Pioneered TP + PP + DP. Now supports SP, CP, EP. Reference implementation that most others learned from.
- DeepSpeed (Microsoft). ZeRO + MoE machinery. Composable with Megatron via Megatron-DeepSpeed.
- FSDP (PyTorch native). Production-ready FSDP-HSDP. Better ergonomics than the Megatron stack at the cost of fewer knobs.
- NeMo (NVIDIA). Higher-level wrapper around Megatron with off-the-shelf recipes for 3D parallel + long context + MoE.
- TorchTitan (Meta). Newer FSDP-first stack; targets the "I want all axes but in clean PyTorch" niche.
2D · decision tree for picking the layout
Below is the decision tree from the rules above, drawn out. Pick a model size and a cluster shape (GPUs per node × nodes); the tree walks itself, highlighting the leaf that matches and showing the rationale at each branch. Hover any node to see its check; click to "lock in" a path.
Interactive · layout designer
Pick a model and a cluster. The widget lets you set TP, PP, DP, FSDP-on-DP, EP. It checks TP · PP · DP = total GPUs, computes memory per rank and rough per-step comm load, and flags violations (TP > GPUs/node, FSDP across all DP at IB, etc.).