Scaling laws — Kaplan, Chinchilla, μP
How loss falls with compute, data, and parameters — and which of those you should spend extra on. The interview's favourite back-of-envelope question and the right answer for "should I make the model 2× bigger or train it 2× longer".
The compute identity — derive 6N
Training compute (in FLOPs) for an autoregressive model:
where N is total parameters and D is total training tokens. Why 6? A forward pass through a transformer is ~2N FLOPs per token (each parameter is touched once in a multiply-add, and matmul = 2 FLOPs per MAC). The backward pass is ~2× the forward (one matmul to get gradients w.r.t. inputs, one for gradients w.r.t. weights). Total: 2 (forward) + 4 (backward) = 6 FLOPs per parameter per token.
Kaplan et al. (2020) — the first scaling law
OpenAI's original result: loss as a function of each one of compute, parameters, data, held independently constant:
The takeaways at the time:
- Loss falls as a power law in each variable.
- The exponents are small (~0.05–0.1), so doubling compute reduces loss by only ~10%.
- Kaplan recommended: for fixed compute, prefer bigger model, less data. The compute-optimal allocation was roughly N ∝ C^{0.73}, D ∝ C^{0.27}.
Most large models trained 2020–2022 (GPT-3, OPT, BLOOM) followed Kaplan: huge models trained on relatively little data. It was wrong.
Chinchilla (Hoffmann et al. 2022) — the correction
DeepMind redid the experiments with a wider grid and got a different answer:
So for fixed compute, you should split it ~equally between making the model bigger and feeding it more data. Chinchilla itself: 70B parameters, 1.4T tokens — exactly the 20:1 ratio. It outperformed Gopher (280B, 300B tokens) using 4× fewer parameters but 4.6× more data, at the same total compute.
The Chinchilla loss formula:
This gives you a 2D surface. Setting ∂L/∂N · ∂N/∂C + ∂L/∂D · ∂D/∂C = 0 under the constraint C = 6ND gives the optimal allocation. The 20:1 D/N ratio comes from Chinchilla's experimental Approaches 1 and 2 (fixed-FLOP sweeps and IsoFLOP curves) at their training compute (~5.8e23 FLOPs). The parametric Approach 3 (the formula above) actually predicts a D/N that drifts up with compute — at Chinchilla's scale it gives D/N ≈ 90, not 20. This is a known internal disagreement in the paper. Modern practice cites 20:1 as a heuristic; the formula itself implies "even more tokens" at higher compute.
Post-Chinchilla — the "overtrain" regime
Chinchilla-optimal is the right answer for minimising training loss given a fixed training compute budget. But that's the wrong objective for production. Production wants:
- Low inference cost. Inference compute = (forward FLOPs per token) × (tokens generated) ≈ 2 · N. Smaller model = cheaper per query.
- Low memory footprint. Especially KV cache and weight memory at serving time.
- Reasonable training cost. Compute is a one-time spend; inference is forever.
So in 2026 the practical recipe is overtrain: train a smaller model on much more data than Chinchilla-optimal. LLaMA-3 8B: trained on ~15T tokens (~1800 tokens/param, ~90× over Chinchilla). The training loss is slightly higher than a 70B at Chinchilla optimal, but the model is ~9× cheaper to serve.
| Model | Params | Tokens | Tokens/Param | Recipe |
|---|---|---|---|---|
| GPT-3 | 175B | 300B | 1.7 | Kaplan (severely undertrained by Chinchilla standards) |
| Chinchilla | 70B | 1.4T | 20 | Compute-optimal |
| LLaMA-2 70B | 70B | 2T | 29 | Slightly overtrained |
| LLaMA-3 8B | 8B | 15T | 1875 | Massively overtrained (production-optimal) |
| LLaMA-3 70B | 70B | 15T | 214 | 10× over Chinchilla |
μP — hyperparameter transfer across scales
Maximal Update Parametrisation (Yang et al. 2021–22). The problem: typically the optimal learning rate changes with model size, so to find LR for a 70B model you'd have to do a hyperparameter sweep at 70B scale (expensive). μP solves this by rescaling parameters and learning rates so that the optimal hyperparameters are the same across all scales.
The key rescalings:
| Component | Standard parameterization | μP |
|---|---|---|
| Input/embedding init | O(1) | O(1) |
| Hidden linear init | ~1/√d | ~1/√d (same) |
| Output / LM head init | ~1/√d | 1/d (smaller) |
| LR on hidden layers | constant in width | scales as 1/d |
| LR on output layer | constant | constant |
With these rescalings, the optimal LR at width d is identical to the optimal LR at width 2d. So you tune at d=512 (cheap) and use the same LR at d=8192 (expensive). This is μTransfer.
Emergent abilities — what scaling laws don't predict
Scaling laws give a smooth loss curve. But individual tasks are not smooth: some abilities emerge sharply at certain scales. Examples:
- 3-digit arithmetic. Near-zero accuracy at 6B params; near-100% at 175B.
- Word unscrambling. Step function around 10B params.
- Chain-of-thought benefit. Helpful only above ~60B params.
The interpretation is debated. Schaeffer et al. (2023, "Are emergent abilities a mirage?") argued that "emergence" is partly an artifact of non-linear evaluation metrics: per-token loss is smooth, but accuracy ("got the whole answer right") is a step function. When the per-token error drops below ~1/sequence_length, the whole answer becomes correct. Other emergent abilities seem more robust.
When scaling laws break
| Regime | What breaks | Why |
|---|---|---|
| Data-constrained | Adding compute by training longer fails | Repeating epochs of the same data has diminishing returns; eventually causes overfitting. |
| Low-resource languages | Loss curves are steeper / shallower | Tokenizer and data distribution are biased against the language; per-token learning signal is weaker. |
| RL fine-tuning | Loss isn't the metric | RL maximises reward, not perplexity. Reward and loss can diverge (e.g., during RLHF the SFT loss often rises). |
| Reasoning tasks | Loss falls but task accuracy plateaus | Tasks that require multi-step reasoning need test-time compute (CoT, search) rather than more pretraining. |
| Tail of distribution | Loss on rare data is much higher | Average loss is dominated by common tokens; rare-token loss falls slower with scale. |
Interactive · compute allocation
The interview probes
- "Derive the 6N rule." Per token, a transformer's forward FLOPs are 2N (each param is one MAC = 2 FLOPs, applied once). Backward is 2× forward. Total: 6N FLOPs per token per training step.
- "What changed between Kaplan and Chinchilla?" Kaplan held the LR schedule fixed; Chinchilla scaled it. Chinchilla also explored a wider grid of (N, D). Kaplan's recommendation (bigger model, less data) was an artifact of the fixed schedule.
- "Why is LLaMA-3 8B trained on 15T tokens — way more than Chinchilla-optimal?" Because Chinchilla-optimal is about training loss, not serving cost. An 8B model is much cheaper to serve than a 70B; overtraining it gets it as smart as it can be at 8B-size, which is what matters for production.
- "You have 10²³ FLOPs. What model do you train?" Chinchilla: N = √(C/120) ≈ 30B, D = 600B tokens. Or production: smaller (~6B), more tokens (~3T). On 1k H100s at MFU 50%: ~22 days for Chinchilla recipe.
- "What's μP and when does it matter?" A parameterisation under which the optimal LR doesn't change with model size. Matters when you can't afford an LR sweep at the target scale — you tune at small scale and transfer. The hard part is using μP throughout the codebase consistently; many "μP" implementations are partial and don't actually transfer.
Interview prompts you should be ready for
- "Compute-optimal training: derive 20 tokens per parameter from L(N, D) = E + A/N^α + B/D^β." (Lagrangian: minimise L subject to C = 6ND. ∂L/∂N = -αA/N^(α+1), ∂L/∂D = -βB/D^(β+1). Setting (∂L/∂N) · (∂N/∂C) + (∂L/∂D) · (∂D/∂C) = 0 with C=6ND gives D/N = (β · B)/(α · A) · ratio of N^α and D^β. Plugging Chinchilla values α=0.34, β=0.28, A=406, B=411 gives D/N ≈ 20.)
- "Your boss wants a faster model. Should you (a) reduce model size and increase tokens, or (b) reduce both?" (For inference speed, smaller is always faster, since inference cost is ~2N per token. Reducing both reduces both training time and inference time. The right question is: do you care about per-token latency (smaller model) or quality (more tokens)?)
- "You've trained a 7B model on 1T tokens. Loss is 1.85 on the validation set. Predict the loss of a 70B model on 10T tokens." (Chinchilla formula: L = 1.69 + 406/(70e9)^0.34 + 411/(10e12)^0.28 ≈ 1.69 + 0.083 + 0.094 ≈ 1.87. Note this is in Chinchilla's units, not directly comparable to your 1.85 unless you've calibrated the constants. The takeaway: 10× scale-up of both gives a small loss reduction — exponents are small, returns are diminishing.)
- "When do you train a small model on too much data?" (When (a) inference cost matters more than training cost — production serving at scale; (b) you have abundant unique data — the 'too much' is actually a small fraction of available. The risk is repeating data: epochs > 1 on the same tokens give diminishing returns past ~4 epochs.)
- "What's data quality worth, scaling-law-wise?" (2–4× the FLOPs. Empirical observation: high-quality data (curated, deduplicated, instruction-aligned) trains a 2–4× larger model's worth of capability per FLOP than random web crawl. Hence why "data" is the moat — Anthropic / OpenAI / Mistral compete on data curation as much as architecture.)
- "Are emergent abilities real?" (Mixed. Some are evaluation artifacts (accuracy is a step function over per-token loss). Some are robust (chain-of-thought benefit, in-context learning's emergence). The careful interview answer: emergent abilities exist for some tasks under some metrics; calling them all 'emergent' is sloppy; the underlying loss curves are smooth.)