SGD, momentum, Adam, AdamW, Lion, schedules
Every optimizer is a different bet about the loss curvature. Once you can re-derive Adam from "estimate the second moment and divide", you understand every modern variant in one diff.
The setup — what an "update rule" actually is
You have a parameter vector θ ∈ ℝ^P, a loss L(θ), and a stochastic gradient g_t = ∇_θ L(θ_t; x_t). An optimizer maps a history of (θ_s, g_s) to a parameter update:
The function u(·) is the optimizer. The learning rate η is its scalar prefactor. Everything in this lesson is about choices for u.
The reason any of this is interesting: g_t is stochastic (we sampled a minibatch) and L is non-convex (deep nets). Plain gradient descent u = g_t is correct in expectation but has high variance, and is poorly conditioned along ill-conditioned axes. Better optimizers are variance-reduction + preconditioning machines.
SGD — the floor
One scalar learning rate, no state. Cheap (no memory beyond the gradient), guaranteed to converge under mild conditions on convex losses, and almost always loses to its successors on deep nets. The reason: a single η must be small enough to not diverge in the most curved direction, and is therefore too small in the flat directions.
| Variant | Update | Why |
|---|---|---|
| SGD | θ ← θ − ηg | Baseline. Memory: 0 extra. |
| SGD + momentum (Polyak) | v ← βv + g ; θ ← θ − ηv | Exponential moving average of gradients smooths noise and adds inertia. Memory: 1× params (the velocity v). β typically 0.9. |
| Nesterov momentum | "Look-ahead" gradient at θ − ηβv, then update | Theoretically optimal among first-order methods on smooth convex losses. Practical gain on neural nets is small. |
Momentum is the simplest variance reduction: averaging gradients over a window kills noise. The bias is 1/(1 − β) step lag — the optimizer responds slowly to abrupt loss-landscape changes, which is fine in the interior of training and annoying at warmup.
Adam — the per-parameter version
The key idea: track both the mean and the (uncentred) variance of the gradient, per parameter, and divide. Parameters with consistently large gradients get a small effective LR; parameters with consistently small gradients get a large effective LR. This is diagonal preconditioning.
Three things to internalise:
- The effective LR per parameter is η / √v̂. If a parameter has a noisy gradient with RMS 1, its effective LR is η. If RMS 10, effective LR is η/10. Adam is doing automatic per-parameter LR scaling.
- The bias correction matters at warmup. At t = 1, m_1 = (1 − β₁) g_1 would be tiny without dividing by (1 − β₁). The correction term 1 − β_i^t tends to 1 as t grows, so it only matters in the first ~1/(1 − β) steps. Removing it gives ~10–30% slower initial progress.
- Adam is approximately scale-invariant in the gradient. If you multiply the loss by k, the gradient scales by k, but so does m̂ (linearly) and √v̂ (also linearly, since v scales by k²). The ratio m̂/√v̂ is invariant — Adam's update is the same regardless of loss scale (modulo ε). This is exactly why gradient clipping does almost nothing to Adam at steady state: Adam already normalises gradient scale per-parameter.
AdamW — the one-line fix that mattered
Original Adam plus L2 regularisation looks like:
Notice the bug: the L2 term λθ is now also being divided by √v̂. So parameters with large v̂ get less weight decay than parameters with small v̂ — exactly the opposite of what you want. AdamW (Loshchilov & Hutter, 2017/2019) decouples:
Empirically: AdamW vs Adam-with-L2 is a 1–3% generalisation gap on most tasks. AdamW is the de facto LLM optimizer. The unsexy reason: with weight decay decoupled, you can set it independently of η, which makes hyperparameter tuning a 1-D problem instead of 2-D.
Lion, LAMB, Adafactor — what each is solving
| Optimizer | Update (roughly) | What it fixes | Trade-off |
|---|---|---|---|
| LAMB (You et al. 2019) | Adam direction · ‖θ_layer‖ / ‖update_layer‖ | Adam's per-parameter scaling can produce updates much larger than the parameter; LAMB rescales per-layer so the relative step is bounded. | Lets you push batch size to 32k–64k (BERT-large). At smaller batch, no benefit over AdamW. |
| Adafactor (Shazeer 2018) | Adam-like, but stores only row and column means of v, not the full matrix. | Adam's v for a (d × m) weight matrix is O(d·m); Adafactor stores O(d + m). | ~50% optimizer memory. Slightly worse convergence than AdamW on dense models; widely used for very large MoE because optimizer memory matters. |
| Lion (Chen et al. 2023) | θ ← θ − η · sign(β₁ m_{t−1} + (1 − β₁) g_t), m ← β₂ m_{t−1} + (1 − β₂) g_t | Drops the second moment entirely; uses sign of momentum. Halves optimizer memory vs Adam. | Requires smaller LR (~3× smaller than AdamW) and larger weight decay. Empirically competitive at scale; not yet displaced AdamW. |
| Shampoo | Approximate full-matrix preconditioner per layer using Kronecker factors of G^⊤G and GG^⊤. | Adam's diagonal preconditioner ignores correlations between parameters; Shampoo captures them within a layer. | Genuine accuracy gains on some benchmarks; expensive (matrix root/inverse every few steps). Niche. |
Memory cost — the number every interviewer expects
Let P = number of parameters. Bytes of optimizer state per parameter:
| Optimizer | State (fp32 master) | Total (params + grads + state) |
|---|---|---|
| SGD | 0 | params(2) + grads(2) = 4 bytes/param (bf16) |
| SGD+momentum | 4 (fp32 v) | 4 + 4 = 8 bytes/param |
| AdamW | 4 (m) + 4 (v) + 4 (master fp32 θ) = 12 | 4 + 12 = 16 bytes/param |
| Adafactor | ~4 (per-row + per-column factored) | ~8 bytes/param |
| Lion | 4 (momentum) + 4 (master) = 8 | 4 + 8 = 12 bytes/param |
The 16 bytes/param figure is the load-bearing number for distributed-training memory planning. A 70B model in AdamW needs 70e9 × 16 = 1.12 TB of optimizer state. This is why ZeRO-1 (shards optimizer state across N ranks) is a baseline efficiency win even before sharding params.
Schedules — the second hyperparameter that matters as much as LR
The learning rate η is rarely constant. The standard schedule is warmup → main schedule → decay. Each piece does something different:
| Phase | Purpose | Typical recipe |
|---|---|---|
| Warmup | At step 0, Adam's v̂ is near 0 → the denominator is small → the step is enormous. Warmup keeps η small until v̂ stabilises. | Linear ramp from 0 to peak LR over 1–10% of training steps. For LLMs: 2k steps is the standard starter. |
| Main | Most of the work. Higher LR → faster progress. | Constant, cosine (decays smoothly), or WSD (stay at peak, then decay). WSD is more checkpoint-friendly because you can resume the "stable" phase indefinitely. |
| Decay | Lower LR refines the solution. Empirically: most of the final accuracy comes from the last ~10% of training, during decay. | Cosine to η/10 or η/100. For WSD: linear decay over last 10–20%. |
Interactive · feel preconditioning
An ill-conditioned 2D quadratic loss L(θ₁, θ₂) = a θ₁² + b θ₂² with a ≫ b. Watch SGD oscillate, momentum smooth, Adam dilate one direction more than the other.
The peculiar facts every interviewer probes
- Adam's effective LR is η · sign(g) in steady state. Because m̂ ≈ E[g] and √v̂ ≈ RMS(g), the update is "step η in the direction of the gradient", regardless of magnitude. Hence LRs are 10–100× smaller for Adam than SGD.
- The optimal LR for Adam on transformers is ~3e-4 and barely moves with model size. This is suspicious and important — it's why μP exists (later lesson). Roughly, Adam's per-parameter scaling absorbs most of the model-size dependence.
- Gradient clipping interacts differently with each optimizer. For SGD: clip → linearly cap step size. For Adam: barely matters at steady state (Adam already normalises). For Lion: matters more (sign descent can be amplified by any single large gradient).
- Adam's ε is not a numerical hack. It's the smallest √v̂ below which "divide by sqrt" stops being safe. Typical ε = 1e-8. For mixed-precision (bf16), some recipes use ε = 1e-6 because v̂ can underflow below 1e-8 in bf16.
- β₂ controls how fast the optimizer adapts. Standard β₂ = 0.999, effective window ~1000 steps. For RL training with non-stationary rewards, lower β₂ = 0.95 so the optimizer doesn't carry stale variance estimates from old policies.
The senior question: when does Adam fail?
Three failure modes that distinguish senior candidates:
- Adam can fail to converge on convex problems. Reddi et al. (2018, "On the Convergence of Adam") gave a 1D counterexample: a periodically large gradient is downweighted in v̂ over time, so Adam keeps stepping in the wrong direction. Fix: AMSGrad (use max(v̂_t, v̂_{t-1})). Almost never used in practice — the failure is rare in deep learning.
- Adam memorises through Adam's preconditioner. On small datasets, Adam's per-parameter LR scaling can drive it to overfit specifically along the directions where the loss landscape is anisotropic. SGD's uniform step is empirically a better regulariser; this is why SGD+momentum sometimes generalises better despite slower training (Wilson et al. 2017, "The marginal value of adaptive gradient methods").
- Adam couples to batch size in a way SGD doesn't. Doubling the batch should ~halve the gradient noise. Adam's v̂ auto-adjusts, so the LR scale doesn't change. SGD doesn't, so you need to double LR (linear scaling). This is why hyperparameter sensitivity profiles look so different.
Interview prompts you should be ready for
- "Why is Adam's LR ~3e-4 for transformers but SGD's is 0.1 for ResNets?" (Adam's update is approximately η · sign(g) in magnitude; SGD's is η · g. For typical gradient magnitudes |g| ~ 1e-2, you want similar step size, so Adam's η is smaller by exactly that factor of |g|.)
- "Walk me through the memory cost of training a 13B model in bf16 + AdamW." (Params bf16: 26 GB. Grads bf16: 26 GB. Adam m,v (each fp32): 52 + 52 = 104 GB. fp32 master copy of weights: 52 GB. Total: ~208 GB just for state, before activations. Doesn't fit on H100 80GB — need ZeRO/FSDP. With ZeRO-2 across 8 ranks (shards optimizer state + grads): per-rank state ≈ (26 params unsharded) + (26/8 grads) + (156/8 opt) ≈ 49 GB → fits.)
- "Your loss spikes ~5000 steps in. Walk me through three causes." (1) Bad batch — log per-batch loss to find outliers; 2) Adam's v̂ has decayed to near-zero in some dimension and now responds violently to a fresh gradient; 3) LR schedule misconfigured — maybe cosine has just hit a peak. Mitigations: gradient clipping ‖g‖ ≤ 1, β₂ → 0.95 for faster adaptation, lower peak LR.)
- "Why does AdamW separate weight decay from L2?" (L2 enters as gradient term and is then divided by √v̂; weight decay should hit all params equally. Decoupling restores the intended uniform shrinkage. Practical gain: ~1–3% generalisation; bigger gain is decoupled tuning of η and λ.)
- "What's the difference between Lion and Adam at the algorithmic level?" (Lion uses only first-moment EMA, then sign. Adam uses first- and second-moment EMAs, then divides. Lion uses ~half the memory and is sign-descent in disguise. Empirically: competitive but more sensitive to LR.)
- "You're training on TPU and Adam's optimizer state doesn't fit. What do you do?" (1) Adafactor — factored 2nd moment, 50% memory. 2) ZeRO/FSDP sharding. 3) Lower precision optimizer state (8-bit Adam, bitsandbytes). 4) CPU offload optimizer state (with overlap). Order: try sharding first, then Adafactor, then 8-bit.)
- "Why does warmup matter for Adam but not SGD?" (SGD's update η · g is bounded by initial gradient size. Adam's update η · m̂/√v̂ at step 0 has tiny denominator → huge step. Warmup lets v̂ stabilise before reaching peak LR. Alternative: lower β₂.)