interview_questions / deep_learning 12 lessons · ~3h read

Deep Learning Foundations — interview prep

The math you get drilled on once the interviewer stops asking about XGBoost. Backprop, optimizers, normalization, attention, positional encodings, tokenization, scaling laws, calibration, A/B testing, init, MoE — derived from first principles, with the trade-offs an interviewer actually probes.

Why a separate track for DL fundamentals

The traditional ML track answers "how do I fit a model to tabular data?". The search/ads/recsys track answers "how do I rank at scale?". Neither covers the questions that a research-flavoured MLE / Applied Scientist / Research Scientist interview spends most of its time on once you say the word "transformer":

Every one of those is a single 10-minute interview question. Most of them require deriving something from E[x] = 0, Var[x] = 1 or from argmax over an exponential family. The skill being tested is "can you re-derive what you were taught instead of repeating slogans". That's what this track drills.

How the track is structured

Strict linear: each lesson assumes only the lessons before it. The first three lessons are the foundation — they appear, in some form, in every subsequent lesson.

StageLessonsWhat you can do after
Foundation 01–03: backprop, optimizers, normalization Derive any gradient. Explain why Adam, AdamW, Lion, and LAMB differ. Diagnose a divergence by activation/gradient statistics.
Architectures 04–06: attention, positional encodings, transformer block Implement a transformer from scratch in pseudocode in an interview. Defend every architectural choice.
Inputs & scale 07–08: tokenization, scaling laws Make compute / data / parameter trade-offs from first principles. Predict when a finetune will help vs hurt.
Production rigour 09–10: calibration, A/B testing Distinguish a calibrated model from an accurate one. Compute MDE, choose between offline and online evaluation, and read variance-reduced experiment results.
Stability & scale-up 11–12: initialization, MoE Explain why deep nets used to be untrainable. Reason about routing losses, capacity factors, and conditional compute.

What an interviewer is testing

SkillStrongWeak
Re-derive Asked "why √d_k?", you compute Var(Q·K) under standard init and show it grows linearly in d_k. Then you explain the downstream effect on softmax saturation. "It normalises". Cannot say what would happen without it.
Connect layers of abstraction You move fluently between the math (gradient flow), the kernel (HBM traffic), and the metric (ECE, accuracy, p-value). Sees a divergence and proposes three causes from three layers. Stays in one register — math-only or systems-only or metric-only — and can't relate them.
Know which lever moves which metric "Throughput halved, latency up 30%. Which of: batch size, kv-cache dtype, attention impl, learning rate?" — you rule out three on first-principles and probe the fourth. Tries every knob without knowing which one each knob touches.
Quantitative trade-offs with numbers "How long does a 70B Chinchilla-optimal run take on 1k H100s?" — you back-of-envelope it: ~6 × 70e9 × 1.4e12 / (1000 × 990e12) ≈ 6 days. Then you correct for MFU. Cannot estimate. "Depends" without a sketch.

The lessons

01
Backprop & autodiff from first principles
Forward / backward / Jacobian-vector products. Why reverse mode is O(forward). Memory cost of activations and the recompute / save trade-off. Common gradient-shape bugs and how to spot them in 30 seconds.
02
SGD, momentum, Adam, AdamW, Lion, schedules
Each optimizer as a different bet about loss curvature. Adam's effective LR per parameter. Why decoupled weight decay matters. Warmup, cosine, WSD schedules and what each one is solving for. Memory cost: Adam = 2× params, just in optimizer state.
03
BatchNorm, LayerNorm, RMSNorm, GroupNorm
BN as a statistical trick (and its serving headache). LN as a per-token fix. RMSNorm as the simplification that ate the LLM stack. Pre-norm vs post-norm and why every modern transformer is pre-norm. Numerical traps in mixed precision.
04
Scaled dot-product attention from scratch
QKV from a single learned projection. Softmax over the right axis. Why √d_k. Causal masking. Multi-head as a low-rank trick on Q,K,V. Complexity (O(N²d)) and where it bites. MHA vs MQA vs GQA — the KV-cache calculation you'll be quizzed on.
05
Positional encodings — sinusoidal, learned, RoPE, ALiBi
Why attention is permutation-equivariant and needs a position signal. Sinusoidal as Fourier features. RoPE as a rotation on Q,K (relative position for free). ALiBi as a per-head linear bias. NTK / YaRN extrapolation tricks. What "context-length extension" actually changes.
06
The transformer block — full forward pass
Pre-norm residual stream as a "highway with edits". Why the FFN is ~4× wider. SwiGLU vs GeLU and the FLOP accounting that makes it free. Weight tying. KV cache shapes. The exact parameter count and FLOP count of a single block.
07
Tokenization — BPE, WordPiece, SentencePiece, byte-level
Why we don't train on characters or words. BPE greedy merge algorithm. The "leading-space" / "Ġ" trick that GPT-2 introduced. Why tokenization breaks math and code. RL credit-assignment at the subword level. Tokenizer-dependent benchmark gotchas.
08
Scaling laws — Kaplan, Chinchilla, μP
Compute = 6 × N × D, the back-of-envelope every interviewer expects. Chinchilla's 20:1 tokens-to-parameters ratio and why Kaplan got it wrong. μP / μ-Transfer for hyperparameter scaling without grid search. When the recipe breaks (data-constrained regime, low-resource languages).
09
Calibration & uncertainty
Confidence ≠ accuracy. Reliability diagrams, ECE, Brier score, proper scoring rules. Why deep nets are overconfident. Temperature scaling vs Platt vs isotonic. Calibration after RLHF (it disappears). Predictive entropy and selective prediction.
10
Statistical testing & A/B testing for ML
Type I / II errors, power, MDE. The four formulas every PM-facing MLE memorises. CUPED variance reduction. Sequential testing pitfalls (peeking inflates Type I). Heterogeneous effects, ratio metrics, the delta method. Counterfactual evaluation when an A/B isn't possible.
11
Initialization & gradient stability
Why a deep ReLU network with N(0,1) init dies. Xavier and Kaiming derived from "preserve variance under linear + nonlinear". Why residuals + LayerNorm rescued deep training. Spectral norm and weight singular values. Per-layer LR scaling under μP. Loss spikes — diagnosing one in a real run.
12
Mixture-of-Experts — sparse activation
Top-k routing. Active vs total parameters. Auxiliary load-balancing loss and why it's necessary. Capacity factor and token dropping. Expert parallelism: an AllToAll on every layer. Why MoE is throughput-optimal but memory-heavy. Mixtral, DBRX, DeepSeek-MoE design choices.

What this track does NOT cover

Out of scope on purpose — they're already deep in other tracks:

This track is the prerequisite for all of the above. If you can't derive backprop on a two-layer MLP, the FSDP communication-volume calculation in system_ml/05 will not stick.

How to use this

  1. Linearly the first time. Lessons 01–03 are load-bearing for everything downstream. Skipping them makes the attention-derivation lesson harder than it needs to be.
  2. Re-derive on paper. Each lesson has a "derive this in 60 seconds" box. The interviewer wants to see the derivation, not hear the slogan.
  3. Touch the widget. Each lesson has at least one interactive visualisation. They make abstract trade-offs concrete.
  4. Read the "interview prompts" box. The questions in those boxes are the questions you will actually be asked. The prose tells you the depth of answer that distinguishes a hire from a strong-hire.
A note on what makes deep learning "fundamental"
Architectures come and go. The Mamba paper has not displaced the transformer; one day something will. What does not change is the small set of operations underneath: matmul, normalisation, residual connections, softmax, cross-entropy, autograd. If you understand those mechanically, every new architecture is a one-page diff. If you don't, every new architecture is a new thing to memorise. This track is about the mechanical understanding.