deep_learning / 08 · scaling laws lesson 8 / 12

Scaling laws — Kaplan, Chinchilla, μP

How loss falls with compute, data, and parameters — and which of those you should spend extra on. The interview's favourite back-of-envelope question and the right answer for "should I make the model 2× bigger or train it 2× longer".

The compute identity — derive 6N

Training compute (in FLOPs) for an autoregressive model:

FLOPs ≈ 6 · N · D

where N is total parameters and D is total training tokens. Why 6? A forward pass through a transformer is ~2N FLOPs per token (each parameter is touched once in a multiply-add, and matmul = 2 FLOPs per MAC). The backward pass is ~2× the forward (one matmul to get gradients w.r.t. inputs, one for gradients w.r.t. weights). Total: 2 (forward) + 4 (backward) = 6 FLOPs per parameter per token.

The senior's correction
6N·D is a tight approximation that ignores: (a) the attention O(N_seq² · d) term, which matters at long context, (b) the embedding and LM head, which are non-trivial for small models. For most production LLMs in 2026 (8B–400B params, context 4–32k), 6N·D is accurate to within 10%. For long-context (128k+) or small (1B) models, account for those.

Kaplan et al. (2020) — the first scaling law

OpenAI's original result: loss as a function of each one of compute, parameters, data, held independently constant:

L(N) = (N_c / N)^α_N   — α_N ≈ 0.076 L(D) = (D_c / D)^α_D   — α_D ≈ 0.095 L(C) = (C_c / C)^α_C   — α_C ≈ 0.050

The takeaways at the time:

  1. Loss falls as a power law in each variable.
  2. The exponents are small (~0.05–0.1), so doubling compute reduces loss by only ~10%.
  3. Kaplan recommended: for fixed compute, prefer bigger model, less data. The compute-optimal allocation was roughly N ∝ C^{0.73}, D ∝ C^{0.27}.

Most large models trained 2020–2022 (GPT-3, OPT, BLOOM) followed Kaplan: huge models trained on relatively little data. It was wrong.

Chinchilla (Hoffmann et al. 2022) — the correction

DeepMind redid the experiments with a wider grid and got a different answer:

For compute-optimal training: N ∝ C^{0.5}, D ∝ C^{0.5} Equivalently: D ≈ 20 · N   (20 training tokens per parameter)

So for fixed compute, you should split it ~equally between making the model bigger and feeding it more data. Chinchilla itself: 70B parameters, 1.4T tokens — exactly the 20:1 ratio. It outperformed Gopher (280B, 300B tokens) using 4× fewer parameters but 4.6× more data, at the same total compute.

Why Kaplan was wrong
Kaplan held the learning rate schedule fixed across runs. As you raise the parameter count, the relative position on the LR schedule changes — bigger models got effectively less LR than smaller ones. Chinchilla ran the experiments with a properly-scaled LR schedule and found a different optimum. The lesson: scaling experiments are extraordinarily sensitive to the implicit assumptions of the training recipe. Test the recipe before generalising.

The Chinchilla loss formula:

L(N, D) = E + A/N^α + B/D^β   — α ≈ 0.34, β ≈ 0.28 E ≈ 1.69 (irreducible loss in their units) A ≈ 406, B ≈ 411

This gives you a 2D surface. Setting ∂L/∂N · ∂N/∂C + ∂L/∂D · ∂D/∂C = 0 under the constraint C = 6ND gives the optimal allocation. The 20:1 D/N ratio comes from Chinchilla's experimental Approaches 1 and 2 (fixed-FLOP sweeps and IsoFLOP curves) at their training compute (~5.8e23 FLOPs). The parametric Approach 3 (the formula above) actually predicts a D/N that drifts up with compute — at Chinchilla's scale it gives D/N ≈ 90, not 20. This is a known internal disagreement in the paper. Modern practice cites 20:1 as a heuristic; the formula itself implies "even more tokens" at higher compute.

Post-Chinchilla — the "overtrain" regime

Chinchilla-optimal is the right answer for minimising training loss given a fixed training compute budget. But that's the wrong objective for production. Production wants:

So in 2026 the practical recipe is overtrain: train a smaller model on much more data than Chinchilla-optimal. LLaMA-3 8B: trained on ~15T tokens (~1800 tokens/param, ~90× over Chinchilla). The training loss is slightly higher than a 70B at Chinchilla optimal, but the model is ~9× cheaper to serve.

ModelParamsTokensTokens/ParamRecipe
GPT-3175B300B1.7Kaplan (severely undertrained by Chinchilla standards)
Chinchilla70B1.4T20Compute-optimal
LLaMA-2 70B70B2T29Slightly overtrained
LLaMA-3 8B8B15T1875Massively overtrained (production-optimal)
LLaMA-3 70B70B15T21410× over Chinchilla

μP — hyperparameter transfer across scales

Maximal Update Parametrisation (Yang et al. 2021–22). The problem: typically the optimal learning rate changes with model size, so to find LR for a 70B model you'd have to do a hyperparameter sweep at 70B scale (expensive). μP solves this by rescaling parameters and learning rates so that the optimal hyperparameters are the same across all scales.

The key rescalings:

ComponentStandard parameterizationμP
Input/embedding initO(1)O(1)
Hidden linear init~1/√d~1/√d (same)
Output / LM head init~1/√d1/d (smaller)
LR on hidden layersconstant in widthscales as 1/d
LR on output layerconstantconstant

With these rescalings, the optimal LR at width d is identical to the optimal LR at width 2d. So you tune at d=512 (cheap) and use the same LR at d=8192 (expensive). This is μTransfer.

Why this works (intuition)
Standard init keeps the magnitude of pre-activations stable in width but lets the learning rate behave differently. μP additionally normalises so the magnitude of the parameter update relative to the parameter is stable in width. As long as your training dynamics are linear in this normalised regime, the optimum is the same. The math (Tensor Programs) is involved, but the recipe is simple to apply.

Emergent abilities — what scaling laws don't predict

Scaling laws give a smooth loss curve. But individual tasks are not smooth: some abilities emerge sharply at certain scales. Examples:

The interpretation is debated. Schaeffer et al. (2023, "Are emergent abilities a mirage?") argued that "emergence" is partly an artifact of non-linear evaluation metrics: per-token loss is smooth, but accuracy ("got the whole answer right") is a step function. When the per-token error drops below ~1/sequence_length, the whole answer becomes correct. Other emergent abilities seem more robust.

When scaling laws break

RegimeWhat breaksWhy
Data-constrainedAdding compute by training longer failsRepeating epochs of the same data has diminishing returns; eventually causes overfitting.
Low-resource languagesLoss curves are steeper / shallowerTokenizer and data distribution are biased against the language; per-token learning signal is weaker.
RL fine-tuningLoss isn't the metricRL maximises reward, not perplexity. Reward and loss can diverge (e.g., during RLHF the SFT loss often rises).
Reasoning tasksLoss falls but task accuracy plateausTasks that require multi-step reasoning need test-time compute (CoT, search) rather than more pretraining.
Tail of distributionLoss on rare data is much higherAverage loss is dominated by common tokens; rare-token loss falls slower with scale.

Interactive · compute allocation

Compute budget allocation
Set total compute and the recipe. The widget tells you the recommended model size and tokens — for Chinchilla-optimal and for production-overtrained (200 tokens/param).
total compute
recommended N
recommended D
wall-clock on 1k H100s
Reading

The interview probes

  1. "Derive the 6N rule." Per token, a transformer's forward FLOPs are 2N (each param is one MAC = 2 FLOPs, applied once). Backward is 2× forward. Total: 6N FLOPs per token per training step.
  2. "What changed between Kaplan and Chinchilla?" Kaplan held the LR schedule fixed; Chinchilla scaled it. Chinchilla also explored a wider grid of (N, D). Kaplan's recommendation (bigger model, less data) was an artifact of the fixed schedule.
  3. "Why is LLaMA-3 8B trained on 15T tokens — way more than Chinchilla-optimal?" Because Chinchilla-optimal is about training loss, not serving cost. An 8B model is much cheaper to serve than a 70B; overtraining it gets it as smart as it can be at 8B-size, which is what matters for production.
  4. "You have 10²³ FLOPs. What model do you train?" Chinchilla: N = √(C/120) ≈ 30B, D = 600B tokens. Or production: smaller (~6B), more tokens (~3T). On 1k H100s at MFU 50%: ~22 days for Chinchilla recipe.
  5. "What's μP and when does it matter?" A parameterisation under which the optimal LR doesn't change with model size. Matters when you can't afford an LR sweep at the target scale — you tune at small scale and transfer. The hard part is using μP throughout the codebase consistently; many "μP" implementations are partial and don't actually transfer.

Interview prompts you should be ready for

  1. "Compute-optimal training: derive 20 tokens per parameter from L(N, D) = E + A/N^α + B/D^β." (Lagrangian: minimise L subject to C = 6ND. ∂L/∂N = -αA/N^(α+1), ∂L/∂D = -βB/D^(β+1). Setting (∂L/∂N) · (∂N/∂C) + (∂L/∂D) · (∂D/∂C) = 0 with C=6ND gives D/N = (β · B)/(α · A) · ratio of N^α and D^β. Plugging Chinchilla values α=0.34, β=0.28, A=406, B=411 gives D/N ≈ 20.)
  2. "Your boss wants a faster model. Should you (a) reduce model size and increase tokens, or (b) reduce both?" (For inference speed, smaller is always faster, since inference cost is ~2N per token. Reducing both reduces both training time and inference time. The right question is: do you care about per-token latency (smaller model) or quality (more tokens)?)
  3. "You've trained a 7B model on 1T tokens. Loss is 1.85 on the validation set. Predict the loss of a 70B model on 10T tokens." (Chinchilla formula: L = 1.69 + 406/(70e9)^0.34 + 411/(10e12)^0.28 ≈ 1.69 + 0.083 + 0.094 ≈ 1.87. Note this is in Chinchilla's units, not directly comparable to your 1.85 unless you've calibrated the constants. The takeaway: 10× scale-up of both gives a small loss reduction — exponents are small, returns are diminishing.)
  4. "When do you train a small model on too much data?" (When (a) inference cost matters more than training cost — production serving at scale; (b) you have abundant unique data — the 'too much' is actually a small fraction of available. The risk is repeating data: epochs > 1 on the same tokens give diminishing returns past ~4 epochs.)
  5. "What's data quality worth, scaling-law-wise?" (2–4× the FLOPs. Empirical observation: high-quality data (curated, deduplicated, instruction-aligned) trains a 2–4× larger model's worth of capability per FLOP than random web crawl. Hence why "data" is the moat — Anthropic / OpenAI / Mistral compete on data curation as much as architecture.)
  6. "Are emergent abilities real?" (Mixed. Some are evaluation artifacts (accuracy is a step function over per-token loss). Some are robust (chain-of-thought benefit, in-context learning's emergence). The careful interview answer: emergent abilities exist for some tasks under some metrics; calling them all 'emergent' is sloppy; the underlying loss curves are smooth.)
Takeaway
Compute = 6 · N · D. Chinchilla-optimal is ~20 tokens/param when you minimise training loss. Production overtrains by 10–100× because inference is forever. μP makes hyperparameter sweeps cheap by parameterising so the optimum doesn't move with size. Scaling laws are smooth in average loss but per-task accuracy can have step changes. The interview signal: do back-of-envelope estimates from 6N·D, articulate when Chinchilla-optimal is or isn't the right target, and name the limits.