Deep Learning Foundations — interview prep
The math you get drilled on once the interviewer stops asking about XGBoost. Backprop, optimizers, normalization, attention, positional encodings, tokenization, scaling laws, calibration, A/B testing, init, MoE — derived from first principles, with the trade-offs an interviewer actually probes.
Why a separate track for DL fundamentals
The traditional ML track answers "how do I fit a model to tabular data?". The search/ads/recsys track answers "how do I rank at scale?". Neither covers the questions that a research-flavoured MLE / Applied Scientist / Research Scientist interview spends most of its time on once you say the word "transformer":
- Derive backprop on a two-layer MLP. Where do the dimensions go?
- Why does Adam use a second moment? What does it converge to in steady-state?
- Why is LayerNorm preferred over BatchNorm in transformers? Why does RMSNorm replace LayerNorm in modern LLMs?
- Derive scaled dot-product attention. Why √d_k? What goes wrong if you remove it?
- RoPE vs ALiBi vs sinusoidal vs learned — explain when each generalises to longer contexts.
- BPE vs WordPiece vs SentencePiece — and why the choice matters for RL credit assignment.
- Chinchilla says 20 tokens/parameter. Why? What changes the constant?
- You have a model with 60% raw accuracy and ECE = 0.12. What do you do — and why is "just retrain" wrong?
- An A/B test ran for 3 days, p = 0.04, lift = 0.3%. Ship it? Why not?
- You change activation function and the model diverges. Walk through three first-principles diagnoses.
- MoE: why does a 50B model with 8 experts have ~6B active parameters? What's the routing loss?
Every one of those is a single 10-minute interview question. Most of them require deriving something from E[x] = 0, Var[x] = 1 or from argmax over an exponential family. The skill being tested is "can you re-derive what you were taught instead of repeating slogans". That's what this track drills.
How the track is structured
Strict linear: each lesson assumes only the lessons before it. The first three lessons are the foundation — they appear, in some form, in every subsequent lesson.
| Stage | Lessons | What you can do after |
|---|---|---|
| Foundation | 01–03: backprop, optimizers, normalization | Derive any gradient. Explain why Adam, AdamW, Lion, and LAMB differ. Diagnose a divergence by activation/gradient statistics. |
| Architectures | 04–06: attention, positional encodings, transformer block | Implement a transformer from scratch in pseudocode in an interview. Defend every architectural choice. |
| Inputs & scale | 07–08: tokenization, scaling laws | Make compute / data / parameter trade-offs from first principles. Predict when a finetune will help vs hurt. |
| Production rigour | 09–10: calibration, A/B testing | Distinguish a calibrated model from an accurate one. Compute MDE, choose between offline and online evaluation, and read variance-reduced experiment results. |
| Stability & scale-up | 11–12: initialization, MoE | Explain why deep nets used to be untrainable. Reason about routing losses, capacity factors, and conditional compute. |
What an interviewer is testing
| Skill | Strong | Weak |
|---|---|---|
| Re-derive | Asked "why √d_k?", you compute Var(Q·K) under standard init and show it grows linearly in d_k. Then you explain the downstream effect on softmax saturation. | "It normalises". Cannot say what would happen without it. |
| Connect layers of abstraction | You move fluently between the math (gradient flow), the kernel (HBM traffic), and the metric (ECE, accuracy, p-value). Sees a divergence and proposes three causes from three layers. | Stays in one register — math-only or systems-only or metric-only — and can't relate them. |
| Know which lever moves which metric | "Throughput halved, latency up 30%. Which of: batch size, kv-cache dtype, attention impl, learning rate?" — you rule out three on first-principles and probe the fourth. | Tries every knob without knowing which one each knob touches. |
| Quantitative trade-offs with numbers | "How long does a 70B Chinchilla-optimal run take on 1k H100s?" — you back-of-envelope it: ~6 × 70e9 × 1.4e12 / (1000 × 990e12) ≈ 6 days. Then you correct for MFU. | Cannot estimate. "Depends" without a sketch. |
The lessons
What this track does NOT cover
Out of scope on purpose — they're already deep in other tracks:
- Distributed training (DDP, FSDP, TP, PP, EP). See system_ml.
- Kernels & GPU internals (warps, tensor cores, FlashAttention internals). See gpu_kernel_serving.
- Post-training RL (PPO, GRPO, DPO, RLHF). See RL.
- Serving (KV cache, continuous batching, paged attention). See vllm.
- Generative continuous (diffusion, flow matching, DiT). See generative_continuous.
This track is the prerequisite for all of the above. If you can't derive backprop on a two-layer MLP, the FSDP communication-volume calculation in system_ml/05 will not stick.
How to use this
- Linearly the first time. Lessons 01–03 are load-bearing for everything downstream. Skipping them makes the attention-derivation lesson harder than it needs to be.
- Re-derive on paper. Each lesson has a "derive this in 60 seconds" box. The interviewer wants to see the derivation, not hear the slogan.
- Touch the widget. Each lesson has at least one interactive visualisation. They make abstract trade-offs concrete.
- Read the "interview prompts" box. The questions in those boxes are the questions you will actually be asked. The prose tells you the depth of answer that distinguishes a hire from a strong-hire.