SGD, momentum, Adam, AdamW, Lion, schedules

Every optimizer is a different bet about the loss curvature. Once you can re-derive Adam from "estimate the second moment and divide", you understand every modern variant in one diff.

The setup — what an "update rule" actually is

You have a parameter vector θ ∈ ℝ^P, a loss L(θ), and a stochastic gradient g_t = ∇_θ L(θ_t; x_t). An optimizer maps a history of (θ_s, g_s) to a parameter update:

θ_{t+1} = θ_t − η · u(θ_{0..t}, g_{0..t})

The function u(·) is the optimizer. The learning rate η is its scalar prefactor. Everything in this lesson is about choices for u.

The reason any of this is interesting: g_t is stochastic (we sampled a minibatch) and L is non-convex (deep nets). Plain gradient descent u = g_t is correct in expectation but has high variance, and is poorly conditioned along ill-conditioned axes. Better optimizers are variance-reduction + preconditioning machines.

SGD — the floor

θ_{t+1} = θ_t − η · g_t

One scalar learning rate, no state. Cheap (no memory beyond the gradient), guaranteed to converge under mild conditions on convex losses, and almost always loses to its successors on deep nets. The reason: a single η must be small enough to not diverge in the most curved direction, and is therefore too small in the flat directions.

Variant	Update	Why
SGD	θ ← θ − ηg	Baseline. Memory: 0 extra.
SGD + momentum (Polyak)	v ← βv + g ; θ ← θ − ηv	Exponential moving average of gradients smooths noise and adds inertia. Memory: 1× params (the velocity v). β typically 0.9.
Nesterov momentum	"Look-ahead" gradient at θ − ηβv, then update	Theoretically optimal among first-order methods on smooth convex losses. Practical gain on neural nets is small.

Momentum is the simplest variance reduction: averaging gradients over a window kills noise. The bias is 1/(1 − β) step lag — the optimizer responds slowly to abrupt loss-landscape changes, which is fine in the interior of training and annoying at warmup.

SGD's last stand

In 2026, SGD-momentum still wins for ResNets on ImageNet and for some small-batch convex problems. It's also the default in many RL implementations (TD3, SAC), where Adam's per-parameter scaling can interact badly with bootstrapped targets. Don't dismiss it. The interview tell is being able to name a regime where SGD > Adam.

Adam — the per-parameter version

The key idea: track both the mean and the (uncentred) variance of the gradient, per parameter, and divide. Parameters with consistently large gradients get a small effective LR; parameters with consistently small gradients get a large effective LR. This is diagonal preconditioning.

m_t = β₁ m_{t−1} + (1 − β₁) g_t — first moment (EMA of gradient) v_t = β₂ v_{t−1} + (1 − β₂) g_t² — second moment (EMA of squared gradient) m̂_t = m_t / (1 − β₁^t) — bias correction v̂_t = v_t / (1 − β₂^t) — bias correction θ_{t+1} = θ_t − η · m̂_t / (√v̂_t + ε)

Three things to internalise:

The effective LR per parameter is η / √v̂. If a parameter has a noisy gradient with RMS 1, its effective LR is η. If RMS 10, effective LR is η/10. Adam is doing automatic per-parameter LR scaling.
The bias correction matters at warmup. At t = 1, m_1 = (1 − β₁) g_1 would be tiny without dividing by (1 − β₁). The correction term 1 − β_i^t tends to 1 as t grows, so it only matters in the first ~1/(1 − β) steps. Removing it gives ~10–30% slower initial progress.
Adam is approximately scale-invariant in the gradient. If you multiply the loss by k, the gradient scales by k, but so does m̂ (linearly) and √v̂ (also linearly, since v scales by k²). The ratio m̂/√v̂ is invariant — Adam's update is the same regardless of loss scale (modulo ε). This is exactly why gradient clipping does almost nothing to Adam at steady state: Adam already normalises gradient scale per-parameter.

The interview soundbite

Adam's effective step in steady state is approximately η · sign(g) when the gradient is on the order of its EMA std — it acts like a per-parameter sign descent. This is why Adam's LR is set ~10× smaller than SGD's: η · sign(g) is a much larger raw step than η · g.

AdamW — the one-line fix that mattered

Original Adam plus L2 regularisation looks like:

g_t = ∇L(θ_t) + λ θ_t — L2 absorbed into gradient m_t = β₁ m_{t−1} + (1 − β₁) g_t v_t = β₂ v_{t−1} + (1 − β₂) g_t² θ_{t+1} = θ_t − η · m̂_t / (√v̂_t + ε)

Notice the bug: the L2 term λθ is now also being divided by √v̂. So parameters with large v̂ get less weight decay than parameters with small v̂ — exactly the opposite of what you want. AdamW (Loshchilov & Hutter, 2017/2019) decouples:

θ_{t+1} = θ_t − η · ( m̂_t / (√v̂_t + ε) + λ θ_t ) — weight decay applied directly to params

Empirically: AdamW vs Adam-with-L2 is a 1–3% generalisation gap on most tasks. AdamW is the de facto LLM optimizer. The unsexy reason: with weight decay decoupled, you can set it independently of η, which makes hyperparameter tuning a 1-D problem instead of 2-D.

Lion, LAMB, Adafactor — what each is solving

Optimizer	Update (roughly)	What it fixes	Trade-off
LAMB (You et al. 2019)	Adam direction · ‖θ_layer‖ / ‖update_layer‖	Adam's per-parameter scaling can produce updates much larger than the parameter; LAMB rescales per-layer so the relative step is bounded.	Lets you push batch size to 32k–64k (BERT-large). At smaller batch, no benefit over AdamW.
Adafactor (Shazeer 2018)	Adam-like, but stores only row and column means of v, not the full matrix.	Adam's v for a (d × m) weight matrix is O(d·m); Adafactor stores O(d + m).	~50% optimizer memory. Slightly worse convergence than AdamW on dense models; widely used for very large MoE because optimizer memory matters.
Lion (Chen et al. 2023)	θ ← θ − η · sign(β₁ m_{t−1} + (1 − β₁) g_t), m ← β₂ m_{t−1} + (1 − β₂) g_t	Drops the second moment entirely; uses sign of momentum. Halves optimizer memory vs Adam.	Requires smaller LR (~3× smaller than AdamW) and larger weight decay. Empirically competitive at scale; not yet displaced AdamW.
Shampoo	Approximate full-matrix preconditioner per layer using Kronecker factors of G^⊤G and GG^⊤.	Adam's diagonal preconditioner ignores correlations between parameters; Shampoo captures them within a layer.	Genuine accuracy gains on some benchmarks; expensive (matrix root/inverse every few steps). Niche.

Memory cost — the number every interviewer expects

Let P = number of parameters. Bytes of optimizer state per parameter:

Optimizer	State (fp32 master)	Total (params + grads + state)
SGD	0	params(2) + grads(2) = 4 bytes/param (bf16)
SGD+momentum	4 (fp32 v)	4 + 4 = 8 bytes/param
AdamW	4 (m) + 4 (v) + 4 (master fp32 θ) = 12	4 + 12 = 16 bytes/param
Adafactor	~4 (per-row + per-column factored)	~8 bytes/param
Lion	4 (momentum) + 4 (master) = 8	4 + 8 = 12 bytes/param

The 16 bytes/param figure is the load-bearing number for distributed-training memory planning. A 70B model in AdamW needs 70e9 × 16 = 1.12 TB of optimizer state. This is why ZeRO-1 (shards optimizer state across N ranks) is a baseline efficiency win even before sharding params.

Mental model

Adam-class optimizers cost ~16 bytes/param. Lion ~12. Adafactor ~8. SGD-momentum ~8. The savings from picking Lion over AdamW are real but second-order to FSDP/ZeRO sharding, which can divide all of these by N (number of ranks). Picking the optimizer is a 1.5× memory lever; picking the parallelism strategy is a 100× memory lever.

Schedules — the second hyperparameter that matters as much as LR

The learning rate η is rarely constant. The standard schedule is warmup → main schedule → decay. Each piece does something different:

lr ▲ │ ████████████ ──── main (cosine / linear / WSD plateau) │ ██ │ ██ ╲ │ ██ ╲ decay │ ██ ╲ │ ██ warmup ╲___ end │ ──────────────────────────────────────────► │ step 0 1k 100k

Phase	Purpose	Typical recipe
Warmup	At step 0, Adam's v̂ is near 0 → the denominator is small → the step is enormous. Warmup keeps η small until v̂ stabilises.	Linear ramp from 0 to peak LR over 1–10% of training steps. For LLMs: 2k steps is the standard starter.
Main	Most of the work. Higher LR → faster progress.	Constant, cosine (decays smoothly), or WSD (stay at peak, then decay). WSD is more checkpoint-friendly because you can resume the "stable" phase indefinitely.
Decay	Lower LR refines the solution. Empirically: most of the final accuracy comes from the last ~10% of training, during decay.	Cosine to η/10 or η/100. For WSD: linear decay over last 10–20%.

Why warmup is non-negotiable for Adam at high LR

Without warmup, the first ~1/(1−β₂) ≈ 1000 steps with β₂ = 0.999 have a near-zero v̂, so m̂/√v̂ is enormous. The peak LR is calibrated to steady-state v̂; using it at step 0 means an effective step ~1000× too large. Loss spikes / NaNs. The fix is either warmup or β₂ = 0.95 (faster v̂ stabilisation, used in modern LLM recipes).

Interactive · feel preconditioning

An ill-conditioned 2D quadratic loss L(θ₁, θ₂) = a θ₁² + b θ₂² with a ≫ b. Watch SGD oscillate, momentum smooth, Adam dilate one direction more than the other.

The peculiar facts every interviewer probes

Adam's effective LR is η · sign(g) in steady state. Because m̂ ≈ E[g] and √v̂ ≈ RMS(g), the update is "step η in the direction of the gradient", regardless of magnitude. Hence LRs are 10–100× smaller for Adam than SGD.
The optimal LR for Adam on transformers is ~3e-4 and barely moves with model size. This is suspicious and important — it's why μP exists (later lesson). Roughly, Adam's per-parameter scaling absorbs most of the model-size dependence.
Gradient clipping interacts differently with each optimizer. For SGD: clip → linearly cap step size. For Adam: barely matters at steady state (Adam already normalises). For Lion: matters more (sign descent can be amplified by any single large gradient).
Adam's ε is not a numerical hack. It's the smallest √v̂ below which "divide by sqrt" stops being safe. Typical ε = 1e-8. For mixed-precision (bf16), some recipes use ε = 1e-6 because v̂ can underflow below 1e-8 in bf16.
β₂ controls how fast the optimizer adapts. Standard β₂ = 0.999, effective window ~1000 steps. For RL training with non-stationary rewards, lower β₂ = 0.95 so the optimizer doesn't carry stale variance estimates from old policies.

The senior question: when does Adam fail?

Three failure modes that distinguish senior candidates:

Adam can fail to converge on convex problems. Reddi et al. (2018, "On the Convergence of Adam") gave a 1D counterexample: a periodically large gradient is downweighted in v̂ over time, so Adam keeps stepping in the wrong direction. Fix: AMSGrad (use max(v̂_t, v̂_{t-1})). Almost never used in practice — the failure is rare in deep learning.
Adam memorises through Adam's preconditioner. On small datasets, Adam's per-parameter LR scaling can drive it to overfit specifically along the directions where the loss landscape is anisotropic. SGD's uniform step is empirically a better regulariser; this is why SGD+momentum sometimes generalises better despite slower training (Wilson et al. 2017, "The marginal value of adaptive gradient methods").
Adam couples to batch size in a way SGD doesn't. Doubling the batch should ~halve the gradient noise. Adam's v̂ auto-adjusts, so the LR scale doesn't change. SGD doesn't, so you need to double LR (linear scaling). This is why hyperparameter sensitivity profiles look so different.

Interview prompts you should be ready for

"Why is Adam's LR ~3e-4 for transformers but SGD's is 0.1 for ResNets?" (Adam's update is approximately η · sign(g) in magnitude; SGD's is η · g. For typical gradient magnitudes |g| ~ 1e-2, you want similar step size, so Adam's η is smaller by exactly that factor of |g|.)
"Walk me through the memory cost of training a 13B model in bf16 + AdamW." (Params bf16: 26 GB. Grads bf16: 26 GB. Adam m,v (each fp32): 52 + 52 = 104 GB. fp32 master copy of weights: 52 GB. Total: ~208 GB just for state, before activations. Doesn't fit on H100 80GB — need ZeRO/FSDP. With ZeRO-2 across 8 ranks (shards optimizer state + grads): per-rank state ≈ (26 params unsharded) + (26/8 grads) + (156/8 opt) ≈ 49 GB → fits.)
"Your loss spikes ~5000 steps in. Walk me through three causes." (1) Bad batch — log per-batch loss to find outliers; 2) Adam's v̂ has decayed to near-zero in some dimension and now responds violently to a fresh gradient; 3) LR schedule misconfigured — maybe cosine has just hit a peak. Mitigations: gradient clipping ‖g‖ ≤ 1, β₂ → 0.95 for faster adaptation, lower peak LR.)
"Why does AdamW separate weight decay from L2?" (L2 enters as gradient term and is then divided by √v̂; weight decay should hit all params equally. Decoupling restores the intended uniform shrinkage. Practical gain: ~1–3% generalisation; bigger gain is decoupled tuning of η and λ.)
"What's the difference between Lion and Adam at the algorithmic level?" (Lion uses only first-moment EMA, then sign. Adam uses first- and second-moment EMAs, then divides. Lion uses ~half the memory and is sign-descent in disguise. Empirically: competitive but more sensitive to LR.)
"You're training on TPU and Adam's optimizer state doesn't fit. What do you do?" (1) Adafactor — factored 2nd moment, 50% memory. 2) ZeRO/FSDP sharding. 3) Lower precision optimizer state (8-bit Adam, bitsandbytes). 4) CPU offload optimizer state (with overlap). Order: try sharding first, then Adafactor, then 8-bit.)
"Why does warmup matter for Adam but not SGD?" (SGD's update η · g is bounded by initial gradient size. Adam's update η · m̂/√v̂ at step 0 has tiny denominator → huge step. Warmup lets v̂ stabilise before reaching peak LR. Alternative: lower β₂.)

Takeaway

Optimizers are variance-reduction + preconditioning machines. SGD does neither; momentum does variance reduction; Adam does both diagonally; Shampoo does both with full Kronecker structure. The trade-off is always memory (state per parameter) and compute (per step) vs convergence rate. For LLMs in 2026 the default is AdamW with cosine/WSD schedule and 1–2k warmup steps. Knowing why those specific defaults — not just memorising them — is what an interviewer is testing.