Expert parallel — MoE and AllToAll

Mixture-of-Experts replaces a single dense MLP with E experts and a router that picks k of them per token. Sharding the experts across ranks means tokens have to travel, twice per layer, and the dominant cost stops being matmul.

What an MoE layer is

In a dense transformer, every token passes through the same MLP weights. In an MoE layer, the MLP is replaced by:

A router: a tiny linear that maps the token's hidden state to a score per expert.
E expert MLPs, each typically the same size as the original dense MLP.
A top-k selection: for each token, pick the top k experts by router score and route the token only to those. k is usually 1 (Switch Transformer) or 2 (Mixtral); larger models often pick more (DeepSeek-V3 uses k=8 routed experts plus shared experts).

Total parameters: E × larger than a dense model of the same MLP shape. Total FLOPs per token: unchanged — each token still touches only k experts' worth of MLP weights. So you get a model with much more "capacity" (more learnable parameters) at the same inference cost per token. This is the whole reason MoE is on the trajectory of every frontier lab: it lets capacity grow without FLOPs growing.

How sharding expert weights forces communication

The experts are huge — collectively, the MLP layer is now E × the size. Holding all E experts on every rank is impossible. So we shard them: rank i holds expert i (or a small group of experts), the parameters for which never leave its HBM.

But each rank's input tokens are distributed differently: every rank started with its own micro-batch (DP) or sequence shard (CP). Each token's router output says "go to expert 3". Expert 3 lives on rank 3. So the token has to travel from its origin rank to the rank holding its assigned expert.

This is the dispatch AllToAll. Rank i packages up: "here's the chunk of tokens I have that need to go to rank 0, here's the chunk for rank 1, here's mine, …". After AllToAll, every rank holds the tokens routed to its expert from every origin rank. It runs the local expert MLP on those tokens. Then the combine AllToAll sends each token's output back to its origin rank so it can re-enter the residual stream.

Two AllToAlls per MoE layer per forward pass. Two more in backward. Four per layer per step. Compare with TP (also four AllReduces per layer per step) — they are roughly the same frequency. But the collective is different.

Animated · the two AllToAlls, token by token

Below: N = 8 ranks (= 8 experts, one per rank). Every rank starts with its own batch of tokens, each token's router has already chosen k = 2 experts. Watch the dispatch AllToAll fly tokens from origin rank to expert-host rank, then the expert MLPs run, then the combine AllToAll flies the results back. The tokens are colored by origin rank, so you can see which rank they came from after the dispatch.

2D · routing matrix (tokens × experts)

Below is the heatmap that turns up when you log the router output during training. Rows = individual tokens, columns = experts. The intensity is the routing probability after top-k; only k cells per row are nonzero. The vertical red line per column is the capacity cap — anything above it gets dropped. Increase the capacity factor to relax the cap, increase the skew to see the bottleneck rank fill up first.

3D · experts as shards across ranks (isometric)

Each rank is a cube; on top sit its hosted experts (small flat slabs). The gold arrows show the dispatch AllToAll; the blue arrows show the combine AllToAll. The load bar under each rank is its expert's load relative to capacity — overflowing bars are clipped at the red line. Click a rank to highlight its hosted expert(s) and the routing arrows in and out.

Why AllToAll is the painful one

AllReduce is bandwidth-bound; for large messages it asymptotes to a constant cost. AllToAll is latency-bound: each rank sends N - 1 distinct messages, one per other rank, each carrying a different chunk. The per-message overhead (α in lesson 02) is paid N - 1 times per rank per call.

Worse: the chunk sizes are unequal because the router is unpredictable. If 80% of tokens this batch chose expert 3, rank 3 receives 4× the workload of average. The slowest rank determines the AllToAll completion time. This is the load imbalance problem and it has its own machinery (below).

Bandwidth math: each token of size d bytes × 2 (bf16) gets sent twice per layer (dispatch + combine, ignoring backward). For B · T = 16384 tokens, d = 4096, that's ~134 MB per AllToAll. Twice per layer × 32 layers = ~8.5 GB of AllToAll traffic per forward, on top of the inter-rank routing latency. On NVLink: ~10 ms. On IB: ~170 ms. The ~17× gap is why the default instinct is to keep EP intra-node — but it is a preference, not a hard rule (see below).

Load balancing — a loss term, not a flag

If the router learns to send 90% of tokens to one expert (a common failure mode early in training), that one rank is bottleneck while the others sit idle. Two mechanisms keep things balanced:

Auxiliary loss. Add a term to the training loss that penalises imbalanced routing. The standard form (Shazeer et al. 2017, Fedus et al. 2021): for each expert e, let f_e = fraction of tokens routed to e, and p_e = mean router probability assigned to e. The aux loss is α · E · Σ f_e · p_e. Minimised at uniform routing.
Expert capacity. Cap each expert at C tokens per batch. Overflow tokens are dropped from the expert and processed by the residual only. C = (B · T · k / E) · capacity_factor where capacity_factor ∈ [1.0, 2.0]. Higher capacity factor means fewer drops but more wasted FLOPs (since most experts won't fill their slot).

Tradeoff: capacity factor 1.0 → tight pipelines, max drop rate (some loss in quality). 1.25–1.5 typical. Mixtral and most earlier MoEs use the standard aux-loss recipe; DeepSeek-V3 introduced an auxiliary-loss-free approach (per-expert bias terms that adjust during training) — a modern variant that avoids the aux-loss term's mild quality drag.

Modern caveat — "wasted FLOPs from capacity" is worst-case, not inherent

The capacity-factor picture above assumes each expert runs a fixed-size matmul and you pad to it, so the padding above the true token count is wasted compute. That is the pre-Megablocks framing. Modern MoE kernels are dropless: a grouped-GEMM (Megablocks-style block-sparse matmul) batches the variable per-expert token counts into a single kernel without padding every expert to a fixed capacity. So you neither drop tokens nor pad — the "wasted FLOPs" are mostly an artifact of the older fixed-capacity execution, not a law of MoE. Capacity factor is still the right mental model for the imbalance/quality tradeoff, but on a grouped-GEMM path it no longer directly buys you idle compute.

Composing EP with the rest

EP shards the experts' parameters; nothing forces it to also shard the per-token compute. In a typical layout EP is the dimension around the experts, like TP. Common compositions:

EP = N_experts (one expert per rank). Pure expert parallel. AllToAlls span all ranks.
EP × DP. Replicate the (EP, experts) group across DP replicas; each replica has its own AllToAll cluster. Standard.
EP × TP. Inside an expert, also shard along feature dim. Useful when each expert is itself huge. Adds an AllReduce inside the expert MLP, on top of the AllToAlls.

EP is bandwidth-hungry like TP, so the cheap default is to pin it to NVLink: if you have 8 experts and an 8-GPU node, EP fits naturally on one node. But this is no longer a hard ceiling. DeepSeek-V3 has 256 routed experts and deliberately runs EP across nodes (EP of 64+) over InfiniBand, hiding the cross-node AllToAll behind compute with a custom dispatch/combine kernel library (DeepEP) and a communication–computation overlap schedule. The intra-node intuition still tells you where the cheap bandwidth is; it does not mean cross-node EP is off the table once the expert count makes it necessary.

The serving cost (because this is also an inference question)

At serve time, an MoE model has E × the weights to hold in HBM (every expert must be loaded even though only k are used per token). For DeepSeek-V3 (671B total params, 37B active per token), serving requires holding all 671B in HBM somewhere — which means multiple replicas of a many-GPU shard. The "effective compute" per token is small (37B active), but the HBM cost is large. Inference for MoE wants:

EP at serving time too, to spread the HBM cost across GPUs.
Token batching across all experts so each expert's matmul has enough work to be compute-bound.
Expert-aware scheduling: try to keep batches' expert routing diverse, so all ranks have work each forward.

Interactive · routing imbalance and capacity

Set E, k, capacity factor, and a "routing skew" parameter (0 = uniform, 1 = all tokens to one expert). Watch the per-expert load bars and the drop rate. The takeaway: even modest skew makes the slowest rank the bottleneck, and the capacity-factor / aux-loss tradeoff is visible.

Expert-choice routing

Everything above assumes token-choice routing: each token picks its top-k experts. That is the source of every headache in this lesson — because tokens choose independently, nothing stops them piling onto one expert, so you need a capacity factor, a load-balance aux loss, and a drop policy for overflow.

Expert-choice routing (Zhou et al. 2022) inverts the selection. Instead of tokens picking experts, each expert picks its top-C tokens from the batch by router score, where C = B · T · k / E is the per-expert budget. The consequences flip every problem:

Perfect load balance by construction — every expert takes exactly C tokens, so no rank is ever the straggler in the AllToAll. The imbalance factor is 1.0 by definition.
Zero drops — there is no overflow, because capacity is the selection mechanism, not a cap bolted on after.
No aux loss — balance isn't something the router has to be coaxed toward; it's structural.

The cost is on the token side. Because experts choose, a given token may be picked by zero experts (no expert wanted it — it passes through on the residual only) or by many (variable compute per token). And critically, an expert's choice depends on the whole batch at once — the selection is non-causal. That's fine in training, where the full sequence is present, but it breaks autoregressive decode, where token t+1 doesn't exist yet when you route token t. So expert-choice is a training-time technique; serving still runs token-choice.

	Token-choice	Expert-choice
Who selects	each token picks top-k experts	each expert picks top-C tokens
Load balance	needs aux loss + capacity	perfect by construction
Token drops	overflow tokens dropped	none
Aux loss	required (or aux-free bias)	not needed
Compute per token	fixed (k experts)	variable (0..many experts)
Causal?	yes — usable at decode	no — training only

Practical caveat — MoE is bandwidth-hungry but not always feasible

AllToAll cost scales with N - 1 messages per call, so doubling EP doubles the latency floor. For modest expert counts the latency tax means there is little reason to push EP past one node. But "frontier MoE never goes wide" is no longer true: DeepSeek-V3's 256 experts run at EP of 64+ spanning nodes, and only stay efficient because a custom kernel library (DeepEP) overlaps the cross-node AllToAll with compute. The lesson is that very high EP is a kernel-engineering problem, not a hard wall — but you still don't get it for free with stock NCCL collectives.

Takeaway

MoE moves the parallelism cost from "shard weights" to "shard tokens": every layer, every token has to find its expert (dispatch) and come back (combine). Cost is dominated by AllToAll latency, not bandwidth — so EP wants the lowest-latency interconnect (NVLink), and load imbalance is a real correctness problem solved at the loss-function level, not the schedule level.