Expert parallel — MoE and AllToAll
Mixture-of-Experts replaces a single dense MLP with E experts and a router that picks k of them per token. Sharding the experts across ranks means tokens have to travel, twice per layer, and the dominant cost stops being matmul.
What an MoE layer is
In a dense transformer, every token passes through the same MLP weights. In an MoE layer, the MLP is replaced by:
- A router: a tiny linear that maps the token's hidden state to a score per expert.
- E expert MLPs, each typically the same size as the original dense MLP.
- A top-k selection: for each token, pick the top k experts by router score and route the token only to those. k is usually 1 (Switch Transformer) or 2 (Mixtral); larger models often pick more (DeepSeek-V3 uses k=8 routed experts plus shared experts).
Total parameters: E × larger than a dense model of the same MLP shape. Total FLOPs per token: unchanged — each token still touches only k experts' worth of MLP weights. So you get a model with much more "capacity" (more learnable parameters) at the same inference cost per token. This is the whole reason MoE is on the trajectory of every frontier lab: it lets capacity grow without FLOPs growing.
How sharding expert weights forces communication
The experts are huge — collectively, the MLP layer is now E × the size. Holding all E experts on every rank is impossible. So we shard them: rank i holds expert i (or a small group of experts), the parameters for which never leave its HBM.
But each rank's input tokens are distributed differently: every rank started with its own micro-batch (DP) or sequence shard (CP). Each token's router output says "go to expert 3". Expert 3 lives on rank 3. So the token has to travel from its origin rank to the rank holding its assigned expert.
This is the dispatch AllToAll. Rank i packages up: "here's the chunk of tokens I have that need to go to rank 0, here's the chunk for rank 1, here's mine, …". After AllToAll, every rank holds the tokens routed to its expert from every origin rank. It runs the local expert MLP on those tokens. Then the combine AllToAll sends each token's output back to its origin rank so it can re-enter the residual stream.
Two AllToAlls per MoE layer per forward pass. Two more in backward. Four per layer per step. Compare with TP (also four AllReduces per layer per step) — they are roughly the same frequency. But the collective is different.
Animated · the two AllToAlls, token by token
Below: N = 8 ranks (= 8 experts, one per rank). Every rank starts with its own batch of tokens, each token's router has already chosen k = 2 experts. Watch the dispatch AllToAll fly tokens from origin rank to expert-host rank, then the expert MLPs run, then the combine AllToAll flies the results back. The tokens are colored by origin rank, so you can see which rank they came from after the dispatch.
2D · routing matrix (tokens × experts)
Below is the heatmap that turns up when you log the router output during training. Rows = individual tokens, columns = experts. The intensity is the routing probability after top-k; only k cells per row are nonzero. The vertical red line per column is the capacity cap — anything above it gets dropped. Increase the capacity factor to relax the cap, increase the skew to see the bottleneck rank fill up first.
3D · experts as shards across ranks (isometric)
Each rank is a cube; on top sit its hosted experts (small flat slabs). The gold arrows show the dispatch AllToAll; the blue arrows show the combine AllToAll. The load bar under each rank is its expert's load relative to capacity — overflowing bars are clipped at the red line. Click a rank to highlight its hosted expert(s) and the routing arrows in and out.
Why AllToAll is the painful one
AllReduce is bandwidth-bound; for large messages it asymptotes to a constant cost. AllToAll is latency-bound: each rank sends N - 1 distinct messages, one per other rank, each carrying a different chunk. The per-message overhead (α in lesson 02) is paid N - 1 times per rank per call.
Worse: the chunk sizes are unequal because the router is unpredictable. If 80% of tokens this batch chose expert 3, rank 3 receives 4× the workload of average. The slowest rank determines the AllToAll completion time. This is the load imbalance problem and it has its own machinery (below).
Bandwidth math: each token of size d bytes × 2 (bf16) gets sent twice per layer (dispatch + combine, ignoring backward). For B · T = 16384 tokens, d = 4096, that's ~134 MB per AllToAll. Twice per layer × 32 layers = ~8.5 GB of AllToAll traffic per forward, on top of the inter-rank routing latency. On NVLink: ~10 ms. On IB: ~170 ms. EP is intra-node, with rare exceptions.
Load balancing — a loss term, not a flag
If the router learns to send 90% of tokens to one expert (a common failure mode early in training), that one rank is bottleneck while the others sit idle. Two mechanisms keep things balanced:
- Auxiliary loss. Add a term to the training loss that penalises imbalanced routing. The standard form (Shazeer et al. 2017, Fedus et al. 2021): for each expert e, let f_e = fraction of tokens routed to e, and p_e = mean router probability assigned to e. The aux loss is α · E · Σ f_e · p_e. Minimised at uniform routing.
- Expert capacity. Cap each expert at C tokens per batch. Overflow tokens are dropped from the expert and processed by the residual only. C = (B · T · k / E) · capacity_factor where capacity_factor ∈ [1.0, 2.0]. Higher capacity factor means fewer drops but more wasted FLOPs (since most experts won't fill their slot).
Tradeoff: capacity factor 1.0 → tight pipelines, max drop rate (some loss in quality). 1.25–1.5 typical. Mixtral and most earlier MoEs use the standard aux-loss recipe; DeepSeek-V3 introduced an auxiliary-loss-free approach (per-expert bias terms that adjust during training) — a modern variant that avoids the aux-loss term's mild quality drag.
Composing EP with the rest
EP shards the experts' parameters; nothing forces it to also shard the per-token compute. In a typical layout EP is the dimension around the experts, like TP. Common compositions:
- EP = N_experts (one expert per rank). Pure expert parallel. AllToAlls span all ranks.
- EP × DP. Replicate the (EP, experts) group across DP replicas; each replica has its own AllToAll cluster. Standard.
- EP × TP. Inside an expert, also shard along feature dim. Useful when each expert is itself huge. Adds an AllReduce inside the expert MLP, on top of the AllToAlls.
EP is bandwidth-hungry like TP — pin it to NVLink. If you have 8 experts and an 8-GPU node, EP fits naturally on one node.
The serving cost (because this is also an inference question)
At serve time, an MoE model has E × the weights to hold in HBM (every expert must be loaded even though only k are used per token). For DeepSeek-V3 (671B total params, 37B active per token), serving requires holding all 671B in HBM somewhere — which means multiple replicas of a many-GPU shard. The "effective compute" per token is small (37B active), but the HBM cost is large. Inference for MoE wants:
- EP at serving time too, to spread the HBM cost across GPUs.
- Token batching across all experts so each expert's matmul has enough work to be compute-bound.
- Expert-aware scheduling: try to keep batches' expert routing diverse, so all ranks have work each forward.
Interactive · routing imbalance and capacity
Set E, k, capacity factor, and a "routing skew" parameter (0 = uniform, 1 = all tokens to one expert). Watch the per-expert load bars and the drop rate. The takeaway: even modest skew makes the slowest rank the bottleneck, and the capacity-factor / aux-loss tradeoff is visible.