Deep Learning Foundations — interview prep

The math you get drilled on once the interviewer stops asking about XGBoost. Backprop, optimizers, normalization, attention, positional encodings, tokenization, scaling laws, calibration, A/B testing, init, MoE — derived from first principles, with the trade-offs an interviewer actually probes.

Why a separate track for DL fundamentals

The traditional ML track answers "how do I fit a model to tabular data?". The search/ads/recsys track answers "how do I rank at scale?". Neither covers the questions that a research-flavoured MLE / Applied Scientist / Research Scientist interview spends most of its time on once you say the word "transformer":

Derive backprop on a two-layer MLP. Where do the dimensions go?
Why does Adam use a second moment? What does it converge to in steady-state?
Why is LayerNorm preferred over BatchNorm in transformers? Why does RMSNorm replace LayerNorm in modern LLMs?
Derive scaled dot-product attention. Why √d_k? What goes wrong if you remove it?
RoPE vs ALiBi vs sinusoidal vs learned — explain when each generalises to longer contexts.
BPE vs WordPiece vs SentencePiece — and why the choice matters for RL credit assignment.
Chinchilla says 20 tokens/parameter. Why? What changes the constant?
You have a model with 60% raw accuracy and ECE = 0.12. What do you do — and why is "just retrain" wrong?
An A/B test ran for 3 days, p = 0.04, lift = 0.3%. Ship it? Why not?
You change activation function and the model diverges. Walk through three first-principles diagnoses.
MoE: why does a 50B model with 8 experts have ~6B active parameters? What's the routing loss?

Every one of those is a single 10-minute interview question. Most of them require deriving something from E[x] = 0, Var[x] = 1 or from argmax over an exponential family. The skill being tested is "can you re-derive what you were taught instead of repeating slogans". That's what this track drills.

How the track is structured

Strict linear: each lesson assumes only the lessons before it. The first three lessons are the foundation — they appear, in some form, in every subsequent lesson.

Stage	Lessons	What you can do after
Foundation	01–03: backprop, optimizers, normalization	Derive any gradient. Explain why Adam, AdamW, Lion, and LAMB differ. Diagnose a divergence by activation/gradient statistics.
Architectures	04–06: attention, positional encodings, transformer block	Implement a transformer from scratch in pseudocode in an interview. Defend every architectural choice.
Inputs & scale	07–08: tokenization, scaling laws	Make compute / data / parameter trade-offs from first principles. Predict when a finetune will help vs hurt.
Production rigour	09–10: calibration, A/B testing	Distinguish a calibrated model from an accurate one. Compute MDE, choose between offline and online evaluation, and read variance-reduced experiment results.
Stability & scale-up	11–12: initialization, MoE	Explain why deep nets used to be untrainable. Reason about routing losses, capacity factors, and conditional compute.

What an interviewer is testing

Skill	Strong	Weak
Re-derive	Asked "why √d_k?", you compute Var(Q·K) under standard init and show it grows linearly in d_k. Then you explain the downstream effect on softmax saturation.	"It normalises". Cannot say what would happen without it.
Connect layers of abstraction	You move fluently between the math (gradient flow), the kernel (HBM traffic), and the metric (ECE, accuracy, p-value). Sees a divergence and proposes three causes from three layers.	Stays in one register — math-only or systems-only or metric-only — and can't relate them.
Know which lever moves which metric	"Throughput halved, latency up 30%. Which of: batch size, kv-cache dtype, attention impl, learning rate?" — you rule out three on first-principles and probe the fourth.	Tries every knob without knowing which one each knob touches.
Quantitative trade-offs with numbers	"How long does a 70B Chinchilla-optimal run take on 1k H100s?" — you back-of-envelope it: ~6 × 70e9 × 1.4e12 / (1000 × 990e12) ≈ 6 days. Then you correct for MFU.	Cannot estimate. "Depends" without a sketch.

The lessons

Backprop & autodiff from first principles

Forward / backward / Jacobian-vector products. Why reverse mode is O(forward). Memory cost of activations and the recompute / save trade-off. Common gradient-shape bugs and how to spot them in 30 seconds.

SGD, momentum, Adam, AdamW, Lion, schedules

Each optimizer as a different bet about loss curvature. Adam's effective LR per parameter. Why decoupled weight decay matters. Warmup, cosine, WSD schedules and what each one is solving for. Memory cost: Adam = 2× params, just in optimizer state.

BatchNorm, LayerNorm, RMSNorm, GroupNorm

BN as a statistical trick (and its serving headache). LN as a per-token fix. RMSNorm as the simplification that ate the LLM stack. Pre-norm vs post-norm and why every modern transformer is pre-norm. Numerical traps in mixed precision.

Scaled dot-product attention from scratch

QKV from a single learned projection. Softmax over the right axis. Why √d_k. Causal masking. Multi-head as a low-rank trick on Q,K,V. Complexity (O(N²d)) and where it bites. MHA vs MQA vs GQA — the KV-cache calculation you'll be quizzed on.

Positional encodings — sinusoidal, learned, RoPE, ALiBi

Why attention is permutation-equivariant and needs a position signal. Sinusoidal as Fourier features. RoPE as a rotation on Q,K (relative position for free). ALiBi as a per-head linear bias. NTK / YaRN extrapolation tricks. What "context-length extension" actually changes.

The transformer block — full forward pass

Pre-norm residual stream as a "highway with edits". Why the FFN is ~4× wider. SwiGLU vs GeLU and the FLOP accounting that makes it free. Weight tying. KV cache shapes. The exact parameter count and FLOP count of a single block.

Tokenization — BPE, WordPiece, SentencePiece, byte-level

Why we don't train on characters or words. BPE greedy merge algorithm. The "leading-space" / "Ġ" trick that GPT-2 introduced. Why tokenization breaks math and code. RL credit-assignment at the subword level. Tokenizer-dependent benchmark gotchas.

Scaling laws — Kaplan, Chinchilla, μP

Compute = 6 × N × D, the back-of-envelope every interviewer expects. Chinchilla's 20:1 tokens-to-parameters ratio and why Kaplan got it wrong. μP / μ-Transfer for hyperparameter scaling without grid search. When the recipe breaks (data-constrained regime, low-resource languages).

Calibration & uncertainty

Confidence ≠ accuracy. Reliability diagrams, ECE, Brier score, proper scoring rules. Why deep nets are overconfident. Temperature scaling vs Platt vs isotonic. Calibration after RLHF (it disappears). Predictive entropy and selective prediction.

Statistical testing & A/B testing for ML

Type I / II errors, power, MDE. The four formulas every PM-facing MLE memorises. CUPED variance reduction. Sequential testing pitfalls (peeking inflates Type I). Heterogeneous effects, ratio metrics, the delta method. Counterfactual evaluation when an A/B isn't possible.

Initialization & gradient stability

Why a deep ReLU network with N(0,1) init dies. Xavier and Kaiming derived from "preserve variance under linear + nonlinear". Why residuals + LayerNorm rescued deep training. Spectral norm and weight singular values. Per-layer LR scaling under μP. Loss spikes — diagnosing one in a real run.

Mixture-of-Experts — sparse activation

Top-k routing. Active vs total parameters. Auxiliary load-balancing loss and why it's necessary. Capacity factor and token dropping. Expert parallelism: an AllToAll on every layer. Why MoE is throughput-optimal but memory-heavy. Mixtral, DBRX, DeepSeek-MoE design choices.

What this track does NOT cover

Out of scope on purpose — they're already deep in other tracks:

Distributed training (DDP, FSDP, TP, PP, EP). See system_ml.
Kernels & GPU internals (warps, tensor cores, FlashAttention internals). See gpu_kernels.
Post-training RL (PPO, GRPO, DPO, RLHF). See RL.
Serving (KV cache, continuous batching, paged attention). See vllm.
Generative continuous (diffusion, flow matching, DiT). See generative_continuous.

This track is the prerequisite for all of the above. If you can't derive backprop on a two-layer MLP, the FSDP communication-volume calculation in system_ml/05 will not stick.

How to use this

Linearly the first time. Lessons 01–03 are load-bearing for everything downstream. Skipping them makes the attention-derivation lesson harder than it needs to be.
Re-derive on paper. Each lesson has a "derive this in 60 seconds" box. The interviewer wants to see the derivation, not hear the slogan.
Touch the widget. Each lesson has at least one interactive visualisation. They make abstract trade-offs concrete.
Read the "interview prompts" box. The questions in those boxes are the questions you will actually be asked. The prose tells you the depth of answer that distinguishes a hire from a strong-hire.

A note on what makes deep learning "fundamental"

Architectures come and go. The Mamba paper has not displaced the transformer; one day something will. What does not change is the small set of operations underneath: matmul, normalisation, residual connections, softmax, cross-entropy, autograd. If you understand those mechanically, every new architecture is a one-page diff. If you don't, every new architecture is a new thing to memorise. This track is about the mechanical understanding.