Mixture-of-Experts — sparse activation
Replace one big FFN with N small FFNs and a router that picks 2 of them per token. Total parameters scale; active parameters don't. The result: dense-model-quality at MoE-model-cost. The cost is memory and an AllToAll on every layer.
The motivating arithmetic
From lesson 08: training FLOPs = 6 · N · D. Inference FLOPs per token = 2N. Both scale linearly in N, so bigger model = more compute. The MoE trick: not all of N is "active" per token. If only K of N experts run per token, active parameters are roughly N · (K/N_experts) + N_shared.
| Model | Total params | Active params per token | Inference cost |
|---|---|---|---|
| LLaMA-2 70B (dense) | 70B | 70B | 1.0 (baseline) |
| Mixtral 8×7B | 47B | ~13B | ~0.2 (5× cheaper to inference) |
| DBRX 16×12B | 132B | 36B | ~0.5 |
| DeepSeek-V3 | 671B | 37B | ~0.5 |
The bet: a 671B-parameter model with 37B active will outperform a 37B-dense model (more total capacity for memorisation) while costing the same at inference. This is true empirically. The cost shifts from compute to memory and communication.
The MoE block — replace the FFN
In a standard transformer, every layer's FFN is the same applied to every token. In MoE, the FFN is split into N parallel experts plus a router:
Each expert is a regular FFN (or SwiGLU). The router is a tiny linear layer. For Mixtral-8×7B: 8 experts of ~5.5B params each (in the MoE layers), router is ~50K params (negligible), top-k = 2 → ~13B params active per token.
Load balancing — why this is hard
If the router learns to always pick the same 2 experts, you have 8 parameters that contribute and 6 that are dead weight. Load imbalance is the central engineering challenge.
The standard fix: add an auxiliary loss that penalises imbalance.
where f_i is the fraction of tokens routed to expert i, and P_i is the mean router probability assigned to expert i. This loss is minimised when both f and P are uniform; it penalises both "the router strongly prefers experts" and "the assignments concentrate on few experts". Typical α = 0.01.
Variants:
- Switch Transformer (Fedus et al. 2021): top-1 routing, simple aux loss.
- GShard / Mixtral: top-2 routing, aux loss above.
- Expert Choice (Zhou et al. 2022): the experts pick the top tokens instead of tokens picking the experts. Guarantees balance without aux loss; trickier with variable-length sequences.
- DeepSeek-MoE: top-k with N "shared" experts always active + K "routed" experts. Stabilises training.
Capacity factor and token dropping
In a batch, you route every token to its top-k experts. Some experts get assigned many tokens; some get few. To make the implementation fixed-shape (for GPU efficiency), you pre-allocate a fixed capacity per expert:
If more tokens want to go to an expert than fit, the excess is dropped: their output is set to 0 (or routed to the residual without expert contribution). Capacity factor 1.0 = no slack; 1.25 means 25% extra buffer.
| Capacity factor | Throughput | Token drop rate (with aux loss) | Quality |
|---|---|---|---|
| 1.0 | Fastest | Low (~1-5%) if balanced; much higher if router collapses | Near-baseline when well-balanced; degrades fast if not |
| 1.25 | ~80% of 1.0 | Very low (~<1%) | Baseline |
| 2.0 | ~50% of 1.0 | ~0% | Baseline |
Modern MoE training: 1.25 with strong load-balancing loss. Inference can often use ~1.0 because the load balance learned at training generalises.
Expert parallelism — an AllToAll on every layer
If you have 8 experts and 8 GPUs, distribute one expert per GPU. The forward pass at each MoE layer becomes:
- AllToAll #1: each GPU sends each token to the GPU that holds its target expert. Tokens shuffle across the network.
- Expert compute: each GPU runs its expert on its assigned tokens.
- AllToAll #2: send tokens back to their original GPU for the next layer.
The dominant cost is the AllToAll communication. For 8 experts across a node connected via NVLink (~900 GB/s), this is feasible. Across multiple nodes (InfiniBand at 400 GB/s), it's a significant overhead.
Memory cost — the trade-off
| Component | Dense 70B | MoE 8×7B (Mixtral) |
|---|---|---|
| Params (bf16) | 140 GB | 94 GB |
| Optimizer state (AdamW) | 1.12 TB | 752 GB |
| Active params for forward FLOPs | 70B | ~13B |
| Inference forward FLOPs / token | 140 GFLOPs | 26 GFLOPs |
| KV cache | same per layer | same per layer |
So MoE saves compute (active params × tokens) but consumes total parameters (so total memory and optimizer state are bigger). At inference, you still need to hold all experts on chip (or stream them) because any token might route to any expert.
Production MoE serving uses expert parallelism: shard experts across GPUs, route requests via AllToAll. The serving infrastructure is more complex than dense; the per-token cost is much lower.
Routing dynamics — what experts learn
Empirically, experts specialise in surprising ways:
- Some experts handle syntactic patterns ("article-noun-verb" structure).
- Some handle code vs natural language.
- Some handle specific languages or domains.
- Routing is mostly positional: certain experts activate at the first few tokens of a sequence, others at later positions.
Anthropic / DeepSeek interpretability work shows that the per-expert specialisation is imperfect but real — experts have learned features but with significant overlap and redundancy.
The interview probes
- "Why does MoE need a load-balancing loss?" Without one, the router gradient prefers whichever expert is most-trained, which causes a positive feedback loop: most-used experts get more gradient → improve faster → get used more. After a few thousand steps, the model degenerates to using 1-2 experts. The auxiliary loss explicitly pushes back against concentration.
- "How do MoE FLOPs compare to dense?" Per-token forward FLOPs: dense = 2N. MoE = 2 · N_active. For Mixtral 8×7B: 2 · 13B = 26 GFLOPs per token vs LLaMA-2 70B's 140 GFLOPs. ~5× cheaper inference. Training is the same per-token cost.
- "What's the bottleneck in MoE training?" AllToAll communication between experts. On a single node (NVLink), it's tolerable. Multi-node MoE training was impractical until DeepSpeed-MoE (2022) and PyTorch's native expert parallelism added overlap and compression tricks.
- "Why top-2, not top-1?" Switch Transformer used top-1 and showed that even one expert per token works. Top-2 gives a smoother training signal (each token's loss has gradient flowing through 2 experts → less variance) and slightly better quality. Top-2 is standard now.
- "What's the 'token dropping' problem?" Fixed-capacity experts can't accept more tokens than their capacity. Excess tokens get their expert output zeroed; they pass through the residual only. Capacity factor > 1.0 buffers this. High drop rate = quality degradation.
MoE vs dense — the decision matrix
| Use case | Dense or MoE | Why |
|---|---|---|
| Maximum inference throughput | MoE | 5× cheaper per token at similar quality |
| Maximum quality at fixed memory | Dense | MoE wastes memory on inactive experts |
| Small batch, latency-sensitive | Dense | MoE's AllToAll dominates at small batch |
| Edge / single-GPU deployment | Dense | MoE memory exceeds single-GPU even at small active count |
| Multilingual / multi-domain | MoE | Experts can specialise; reduces interference |
| Fine-tuning / RLHF | Either — MoE harder | MoE post-training is more delicate (keep aux loss, watch router drift). Mixtral-Instruct, DeepSeek-Chat ship in production, so it works — just with more care than dense. |
Interactive · MoE cost calculator
Where things go subtly wrong
| Bug | Symptom | Diagnosis |
|---|---|---|
| Router collapse | One expert handles 80% of tokens; loss is fine but model effectively has 1 expert. | Aux loss weight too small. Increase α to 0.01–0.1. Plot per-expert load every few hundred steps. |
| Token drop catastrophe | Quality drops sharply at high batch sizes. | Capacity factor too low for the batch. Either increase capacity or reduce batch. |
| Router gradient instability | Loss oscillates because router keeps re-routing tokens between similar experts. | Add routing noise during training (epsilon-greedy on top-k). Or use Expert Choice. Or warmup the router LR more slowly. |
| Expert parallelism mismatch | Training stalls every step. | AllToAll deadlock from inconsistent token-count broadcasts. Always pad to capacity and synchronise. |
| Fine-tuning destroys load balance | Fine-tuned model's router becomes lopsided. | Keep aux loss on during fine-tuning. Some recipes also freeze router weights during fine-tune. |
Interview prompts you should be ready for
- "How is Mixtral 8×7B cheaper to serve than LLaMA-2 70B at similar quality?" (Mixtral total 47B, active per token 13B. Inference cost ∝ active params: ~5× cheaper than 70B dense. Memory is similar to dense 47B model. Wins on compute, ties on memory.)
- "Derive the load-balancing loss." (For each expert i: f_i = fraction of tokens routed to expert i; P_i = mean of softmax probability assigned to expert i. L_balance = α · N · Σ f_i · P_i. The product is minimised when f and P are both uniform (each = 1/N) → uniform load. The factor of N keeps the magnitude independent of expert count.)
- "What's the AllToAll cost in expert-parallel MoE?" (Each token's hidden state (d bytes) must move to the rank holding its expert and back. Total comm per layer = 2 · B · L · d · k bytes (k experts per token, two-way trip). For B=64, L=4k, d=4096, k=2: ~4 GB per layer. With many layers and slow inter-node fabric, this dominates step time.)
- "Why is fine-tuning MoE harder?" (Routing is non-differentiable through top-k; gradients only flow to selected experts. With a small fine-tune dataset, you might never route to some experts → they stay unchanged. Load balance can degrade quickly. Mitigations: keep aux loss, freeze router, or use lighter-touch methods (LoRA on selected experts only).)
- "Could MoE work without load-balancing loss?" (Expert Choice (Zhou 2022) inverts routing: experts pick tokens (top-k tokens per expert), guaranteeing balanced load. But it requires global info (all tokens visible to all experts) which is awkward in distributed settings. Token Choice + aux loss is the standard.)
- "Why do MoE models have shared experts (DeepSeek style)?" (Shared experts are always active for every token; routed experts are selected per token. The shared experts provide a stable baseline (every token gets some computation), while routed experts add specialisation. Improves training stability and lowers variance.)