deep_learning / 12 · MoE lesson 12 / 12

Mixture-of-Experts — sparse activation

Replace one big FFN with N small FFNs and a router that picks 2 of them per token. Total parameters scale; active parameters don't. The result: dense-model-quality at MoE-model-cost. The cost is memory and an AllToAll on every layer.

The motivating arithmetic

From lesson 08: training FLOPs = 6 · N · D. Inference FLOPs per token = 2N. Both scale linearly in N, so bigger model = more compute. The MoE trick: not all of N is "active" per token. If only K of N experts run per token, active parameters are roughly N · (K/N_experts) + N_shared.

ModelTotal paramsActive params per tokenInference cost
LLaMA-2 70B (dense)70B70B1.0 (baseline)
Mixtral 8×7B47B~13B~0.2 (5× cheaper to inference)
DBRX 16×12B132B36B~0.5
DeepSeek-V3671B37B~0.5

The bet: a 671B-parameter model with 37B active will outperform a 37B-dense model (more total capacity for memorisation) while costing the same at inference. This is true empirically. The cost shifts from compute to memory and communication.

The MoE block — replace the FFN

In a standard transformer, every layer's FFN is the same applied to every token. In MoE, the FFN is split into N parallel experts plus a router:

router_logits = x · W_r   — W_r: (d, n_experts) top_k_experts = top_k(softmax(router_logits))   — pick k experts per token y = Σ_{i ∈ top_k} gate_i · Expert_i(x)   — weighted sum of expert outputs

Each expert is a regular FFN (or SwiGLU). The router is a tiny linear layer. For Mixtral-8×7B: 8 experts of ~5.5B params each (in the MoE layers), router is ~50K params (negligible), top-k = 2 → ~13B params active per token.

x ──┬─► W_r ─► router_logits ─► softmax ─► top_k indices + gates │ │ ├──► Expert 1 ─┐ │ ├──► Expert 2 ─┤ ├──► Expert 3 ─┼─► weighted sum (only k=2 selected) ──► y │ │ ├──► Expert 7 ─┤ └──► Expert 8 ─┘

Load balancing — why this is hard

If the router learns to always pick the same 2 experts, you have 8 parameters that contribute and 6 that are dead weight. Load imbalance is the central engineering challenge.

The standard fix: add an auxiliary loss that penalises imbalance.

L_balance = α · n_experts · Σ_i f_i · P_i

where f_i is the fraction of tokens routed to expert i, and P_i is the mean router probability assigned to expert i. This loss is minimised when both f and P are uniform; it penalises both "the router strongly prefers experts" and "the assignments concentrate on few experts". Typical α = 0.01.

Variants:

Capacity factor and token dropping

In a batch, you route every token to its top-k experts. Some experts get assigned many tokens; some get few. To make the implementation fixed-shape (for GPU efficiency), you pre-allocate a fixed capacity per expert:

capacity = (tokens_per_batch / n_experts) × capacity_factor

If more tokens want to go to an expert than fit, the excess is dropped: their output is set to 0 (or routed to the residual without expert contribution). Capacity factor 1.0 = no slack; 1.25 means 25% extra buffer.

Capacity factorThroughputToken drop rate (with aux loss)Quality
1.0FastestLow (~1-5%) if balanced; much higher if router collapsesNear-baseline when well-balanced; degrades fast if not
1.25~80% of 1.0Very low (~<1%)Baseline
2.0~50% of 1.0~0%Baseline

Modern MoE training: 1.25 with strong load-balancing loss. Inference can often use ~1.0 because the load balance learned at training generalises.

Expert parallelism — an AllToAll on every layer

If you have 8 experts and 8 GPUs, distribute one expert per GPU. The forward pass at each MoE layer becomes:

  1. AllToAll #1: each GPU sends each token to the GPU that holds its target expert. Tokens shuffle across the network.
  2. Expert compute: each GPU runs its expert on its assigned tokens.
  3. AllToAll #2: send tokens back to their original GPU for the next layer.

The dominant cost is the AllToAll communication. For 8 experts across a node connected via NVLink (~900 GB/s), this is feasible. Across multiple nodes (InfiniBand at 400 GB/s), it's a significant overhead.

The communication cost
Each AllToAll transfers ~B · L · d bytes per layer (one float per token per dim). For Mixtral with B=64, L=4k, d=4096, dtype=bf16: ~2 GB per AllToAll, two per layer, 32 layers → ~130 GB of comm per step. At 900 GB/s NVLink: ~150 ms per step just for routing. For long pretraining runs, the AllToAll dominates.

Memory cost — the trade-off

ComponentDense 70BMoE 8×7B (Mixtral)
Params (bf16)140 GB94 GB
Optimizer state (AdamW)1.12 TB752 GB
Active params for forward FLOPs70B~13B
Inference forward FLOPs / token140 GFLOPs26 GFLOPs
KV cachesame per layersame per layer

So MoE saves compute (active params × tokens) but consumes total parameters (so total memory and optimizer state are bigger). At inference, you still need to hold all experts on chip (or stream them) because any token might route to any expert.

Production MoE serving uses expert parallelism: shard experts across GPUs, route requests via AllToAll. The serving infrastructure is more complex than dense; the per-token cost is much lower.

Routing dynamics — what experts learn

Empirically, experts specialise in surprising ways:

Anthropic / DeepSeek interpretability work shows that the per-expert specialisation is imperfect but real — experts have learned features but with significant overlap and redundancy.

The interview probes

  1. "Why does MoE need a load-balancing loss?" Without one, the router gradient prefers whichever expert is most-trained, which causes a positive feedback loop: most-used experts get more gradient → improve faster → get used more. After a few thousand steps, the model degenerates to using 1-2 experts. The auxiliary loss explicitly pushes back against concentration.
  2. "How do MoE FLOPs compare to dense?" Per-token forward FLOPs: dense = 2N. MoE = 2 · N_active. For Mixtral 8×7B: 2 · 13B = 26 GFLOPs per token vs LLaMA-2 70B's 140 GFLOPs. ~5× cheaper inference. Training is the same per-token cost.
  3. "What's the bottleneck in MoE training?" AllToAll communication between experts. On a single node (NVLink), it's tolerable. Multi-node MoE training was impractical until DeepSpeed-MoE (2022) and PyTorch's native expert parallelism added overlap and compression tricks.
  4. "Why top-2, not top-1?" Switch Transformer used top-1 and showed that even one expert per token works. Top-2 gives a smoother training signal (each token's loss has gradient flowing through 2 experts → less variance) and slightly better quality. Top-2 is standard now.
  5. "What's the 'token dropping' problem?" Fixed-capacity experts can't accept more tokens than their capacity. Excess tokens get their expert output zeroed; they pass through the residual only. Capacity factor > 1.0 buffers this. High drop rate = quality degradation.

MoE vs dense — the decision matrix

Use caseDense or MoEWhy
Maximum inference throughputMoE5× cheaper per token at similar quality
Maximum quality at fixed memoryDenseMoE wastes memory on inactive experts
Small batch, latency-sensitiveDenseMoE's AllToAll dominates at small batch
Edge / single-GPU deploymentDenseMoE memory exceeds single-GPU even at small active count
Multilingual / multi-domainMoEExperts can specialise; reduces interference
Fine-tuning / RLHFEither — MoE harderMoE post-training is more delicate (keep aux loss, watch router drift). Mixtral-Instruct, DeepSeek-Chat ship in production, so it works — just with more care than dense.

Interactive · MoE cost calculator

Dense vs MoE compute / memory
Compare dense 70B vs MoE configurations. Inference FLOPs scale with active params; memory scales with total params.
total params
active params per token
inference cost ratio
memory (bf16)
Reading

Where things go subtly wrong

BugSymptomDiagnosis
Router collapse One expert handles 80% of tokens; loss is fine but model effectively has 1 expert. Aux loss weight too small. Increase α to 0.01–0.1. Plot per-expert load every few hundred steps.
Token drop catastrophe Quality drops sharply at high batch sizes. Capacity factor too low for the batch. Either increase capacity or reduce batch.
Router gradient instability Loss oscillates because router keeps re-routing tokens between similar experts. Add routing noise during training (epsilon-greedy on top-k). Or use Expert Choice. Or warmup the router LR more slowly.
Expert parallelism mismatch Training stalls every step. AllToAll deadlock from inconsistent token-count broadcasts. Always pad to capacity and synchronise.
Fine-tuning destroys load balance Fine-tuned model's router becomes lopsided. Keep aux loss on during fine-tuning. Some recipes also freeze router weights during fine-tune.

Interview prompts you should be ready for

  1. "How is Mixtral 8×7B cheaper to serve than LLaMA-2 70B at similar quality?" (Mixtral total 47B, active per token 13B. Inference cost ∝ active params: ~5× cheaper than 70B dense. Memory is similar to dense 47B model. Wins on compute, ties on memory.)
  2. "Derive the load-balancing loss." (For each expert i: f_i = fraction of tokens routed to expert i; P_i = mean of softmax probability assigned to expert i. L_balance = α · N · Σ f_i · P_i. The product is minimised when f and P are both uniform (each = 1/N) → uniform load. The factor of N keeps the magnitude independent of expert count.)
  3. "What's the AllToAll cost in expert-parallel MoE?" (Each token's hidden state (d bytes) must move to the rank holding its expert and back. Total comm per layer = 2 · B · L · d · k bytes (k experts per token, two-way trip). For B=64, L=4k, d=4096, k=2: ~4 GB per layer. With many layers and slow inter-node fabric, this dominates step time.)
  4. "Why is fine-tuning MoE harder?" (Routing is non-differentiable through top-k; gradients only flow to selected experts. With a small fine-tune dataset, you might never route to some experts → they stay unchanged. Load balance can degrade quickly. Mitigations: keep aux loss, freeze router, or use lighter-touch methods (LoRA on selected experts only).)
  5. "Could MoE work without load-balancing loss?" (Expert Choice (Zhou 2022) inverts routing: experts pick tokens (top-k tokens per expert), guaranteeing balanced load. But it requires global info (all tokens visible to all experts) which is awkward in distributed settings. Token Choice + aux loss is the standard.)
  6. "Why do MoE models have shared experts (DeepSeek style)?" (Shared experts are always active for every token; routed experts are selected per token. The shared experts provide a stable baseline (every token gets some computation), while routed experts add specialisation. Improves training stability and lowers variance.)
Takeaway
MoE trades memory for compute. Total params increase (more capacity to memorise) but active params per token stay small (cheap inference). The cost is an AllToAll on every layer plus an auxiliary load-balancing loss to prevent router collapse. Modern recipes: top-2 routing, capacity factor ~1.25, aux loss α ≈ 0.01, sometimes shared experts. MoE wins when memory is plentiful and inference compute is the bottleneck; loses when memory or single-GPU latency matters. The interview signal is being able to do the back-of-envelope cost comparison and articulate when MoE is the right tool.
Where this track took you
Twelve lessons: backprop → optimizers → norms → attention → positional encodings → transformer block → tokenization → scaling laws → calibration → A/B testing → init → MoE. From "what is a gradient?" to "should I deploy MoE for this workload?". Each is a standalone interview question; collectively they're the math you re-derive every time someone asks "but why?". The next layer of depth lives in the other tracks — RL post-training, distributed systems, GPU kernels, serving, generative. They all assume what you've just learned.