3D / nD parallelism — composition rules

Six strategies, one cluster. Stack them so the hottest communication lands on the fastest fabric, and the optimization stops being about any one strategy and becomes about which one runs where.

The orthogonality observation

Each strategy in Part II shards a different axis of the work:

Strategy	What it shards	Comm pattern	Comm scale
TP	Feature dim of weights	AllReduce per layer × 2	Large · high-frequency
FSDP / ZeRO-3	Param/grad/opt slices	AllGather + ReduceScatter per layer	Large · high-frequency
SP	Sequence in LN/Dropout zones	Same total as TP (reshaped)	Same as TP
CP	Sequence in attention	K/V ring rotation	Linear in seq · scales with depth
EP	Experts (MoE)	AllToAll × 2 per layer	Large · latency-bound
PP	Layers	Send/Recv per stage boundary	Small · per-stage
DP	Data batch	1 AllReduce of grads per step	Large · once per step

Because they shard different axes, you can use them simultaneously. A rank that's TP=8 × PP=4 × DP=16 has identity (t, p, d): it owns the t-th feature slice of the p-th pipeline stage of the d-th data replica. t ∈ [0,8), p ∈ [0,4), d ∈ [0,16) — 512 ranks total.

The decision tree

Here is the recipe, in priority order — the order ranks are physically assigned to dimensions:

TP first, intra-node only. TP fires the most-frequent, largest collectives. Pin it to NVLink. TP = 8 on an HGX/DGX node is the default; TP = 16 only on systems with extended NVLink domains (GH200, GB200 NVL72).
EP next, ideally intra-node. AllToAll latency wants NVLink. Important caveat: TP and EP compete for the same intra-node slots — typical MoE recipes (DeepSeek-V3, Mixtral training) keep TP small or 1 inside experts to leave the node bandwidth for EP. Some frontier MoE runs (DeepSeek-V3) instead push EP across nodes (EP=64) and rely on a custom IB-aware AllToAll; this is the "I have read the topology carefully" exception, not the default.
SP/CP "alongside" TP. SP is on whenever TP > 1 (free). CP is on for sequences ≥ 32k.
PP across nodes. Activation Send/Recv is small, infrequent, IB-tolerant.
DP / FSDP outermost. One gradient sync per step is the most-overlap-friendly.

The 70B-on-512-GPUs example:

           rank id = (t, p, d) ∈ [0,8) × [0,4) × [0,16)
                       ▲       ▲       ▲
                       │       │       │
                  TP=8 │  PP=4 │  DP=16
                       │       │       │
                  intra-node   across   across
                  (NVLink)     nodes    DCN
                               (IB)     (IB)

Memory accounting per rank under this layout: model state divided by TP × PP = 32. For a 70B model in mixed precision, that's 1120 GB / 32 = 35 GB of param/grad/opt state per rank. With FSDP-HSDP layered on top of DP (shard intra-DP-group, replicate inter), you can push that further if memory's still tight.

3D · the (TP, PP, DP) cube, isometric

Each rank in the 512-GPU example lives at exactly one lattice point (t, p, d). The cube below has TP along one axis, PP along another, DP along the third. Click a cell to see which model slice that GPU holds — which layers (from PP), which intra-layer feature shard (from TP), which data replica (from DP). Toggle "compress DP" to render the full 16-replica depth (small dots) or collapse to a representative slice for legibility.

(TP, PP, DP) cube · click a GPU

TP=8 (feature shards · blue axis) · PP=4 (pipeline stages · orange axis) · DP=16 (data replicas · green axis). Same color across DP = same model slice replicated. Same color across TP = different feature shards of the same layer.

TP: 8 PP: 4 DP: 16 rotation: 0° DP view:

total GPUs

—

selected (t,p,d)

— click —

model slice

—

DP peers (same slice)

—

What about activation memory?

Different parallelism shards different parts of activations:

TP shards the feature dim of matmul activations.
SP shards the LN/dropout activations along sequence.
CP shards the attention activations along sequence.
PP only stores activations for layers in its stage (which is the implicit shard).

For long-context training, the layout often pivots to favor SP + CP + a smaller TP — because attention activation memory is what bites first.

Animated · one forward step through the 3D-parallel stack

Watch what one forward pass actually does in a TP × PP × DP layout. TP fires intra-stage AllReduces twice per layer (after attention and after MLP). PP sends activations to the next stage at every stage boundary. DP doesn't fire at all on the forward — it only kicks in during backward as a single gradient AllReduce. The bottom strip shows which fabric is "lit up" at each instant: NVLink for TP, IB for PP / DP.

Why not TP across nodes?

Recall the cost from lesson 06: TP fires 4 AllReduces per layer per step, each ~B · T · d · 2 bytes. At NVLink 900 GB/s, 80 layers × 4 = 320 AllReduces × 270 MB = ~95 ms of comm per step. At IB 50 GB/s: ~1.7 s. The same parallelism that's free intra-node is a 17× slowdown inter-node. That's the rule. If your model needs TP > (GPUs per node), you have two bad options:

Live with the inter-node TP cost. Only viable if step time is dominated by some other resource (e.g. very compute-heavy MoE that hides the TP cost).
Switch to PP for that extra factor of sharding. Memory: same. Cost: pipeline bubble. Often the right choice.

Why not skip TP, just FSDP?

For a model where every layer fits on one GPU with FSDP+checkpointing, you should skip TP. FSDP is simpler (no per-block AllReduces, no GQA-vs-TP-size knot, no rank-aware kernels). The rule of thumb at this writing: pure FSDP-HSDP is fine up to ~30B params. Above that, the per-layer params (or activations under longer context) usually force TP into the mix.

A 4D example with EP

An illustrative MoE layout (DeepSeek-V3-style; 671B total, 37B active, 256 routed experts, 61 layers):

       (TP=1, EP=8, PP=16, DP=8) = 1024 GPUs · 128 nodes  [teaching layout]

       TP=1   per-expert MLP is small, don't fragment it further
       EP=8   shard 256 experts across 8 GPUs of one node (32 per GPU)
       PP=16  shard ~61 layers across 16 stages
       DP=8   replicate the (EP×PP) group 8× for batch throughput

(The actual published DeepSeek-V3 training uses a different shape — EP=64 spanning multiple nodes with an IB-aware AllToAll, plus DualPipe pipelining — but the structure above is the easier mental model for a first read.)

The hottest comm is the AllToAll on the EP axis — pinned to the fastest available fabric. The next-hottest is the gradient AllReduce on the DP axis — once per step, can tolerate IB. PP comm is point-to-point activations — IB-tolerant. TP is unused because each expert is small enough to fit on one GPU; larger-expert MoEs (Mixtral 8x22B, Grok) flip this and use TP inside each expert.

3D parallelism is not the same as 3D model

A confusing thing: "3D parallel" doesn't refer to the model having three dimensions; it refers to the layout having three orthogonal parallelism axes. We're already up to 5D (TP, PP, DP, EP, CP) for MoE long-context training, and "nD" is sometimes the name in modern frameworks. The principle hasn't changed: each axis is an independent sharding decision, the ranks are a lattice.

Production frameworks

Megatron-LM (NVIDIA). Pioneered TP + PP + DP. Now supports SP, CP, EP. Reference implementation that most others learned from.
DeepSpeed (Microsoft). ZeRO + MoE machinery. Composable with Megatron via Megatron-DeepSpeed.
FSDP (PyTorch native). Production-ready FSDP-HSDP. Better ergonomics than the Megatron stack at the cost of fewer knobs.
NeMo (NVIDIA). Higher-level wrapper around Megatron with off-the-shelf recipes for 3D parallel + long context + MoE.
TorchTitan (Meta). Newer FSDP-first stack; targets the "I want all axes but in clean PyTorch" niche.

2D · decision tree for picking the layout

Below is the decision tree from the rules above, drawn out. Pick a model size and a cluster shape (GPUs per node × nodes); the tree walks itself, highlighting the leaf that matches and showing the rationale at each branch. Hover any node to see its check; click to "lock in" a path.

Interactive · layout designer

Pick a model and a cluster. The widget lets you set TP, PP, DP, FSDP-on-DP, EP. It checks TP · PP · DP = total GPUs, computes memory per rank and rough per-step comm load, and flags violations (TP > GPUs/node, FSDP across all DP at IB, etc.).

Layout designer · "does this fit?"

Memory and comm estimates are rough but directionally honest. Try the canonical 70B layout (TP=8, PP=4, DP=16, GPUs=512) and watch what happens when you push TP=16 (above intra-node limit) or move FSDP across all of DP.

params (B): 70 total GPUs: 512 GPUs/node: 8

TP: 8 PP: 4 FSDP-on-DP scope:

DP derived

—

memory / rank

—

comm / step

—

verdict

—

Takeaway

Composition is about matching comm frequency to interconnect. The fastest fabric (NVLink) goes to the most chatty strategy (TP, then EP). The slower fabric (IB) accommodates DP (once per step) and PP (point-to-point, small). The whole 3D-or-nD layout is a single optimization: minimize total bandwidth-time × frequency, subject to memory fitting. Get this right and the rest of the system tunes itself.