A small model can learn more from a big model's answer than from the right answer. Two lessons: the foundations and classic techniques, then distilling generative models and the frontier — every claim derived, every trade-off made honest.
Three knobs organize the whole field: what the student matches (the teacher's outputs, its internal features, or the sequences it generates), how the gap is measured (temperature, forward vs reverse KL), and where the training data comes from (a fixed set, the teacher's generations, or the student's own rollouts). A concrete method is one choice per knob — Hinton KD is response × forward-KL × fixed-set; MiniLLM is sequences × reverse-KL × on-policy.
Who this is for
You know cross-entropy, softmax, and what an autoregressive language model is. By the end you can: derive the T² gradient correction in a few lines; explain why reverse KL makes a student commit and forward KL makes it hedge; say why sequence-level KD works through a black-box API but word-level KD does not; and lay out a distill-then-prune-then-quantize deployment stack.
The one picture
A hard label is a single point; a teacher is a whole distribution. Distillation moves the teacher's function into a smaller student — and the three knobs are the degrees of freedom in how you do it.
Distillation is SFT on a teacher's outputs — so the gpt_mini SFT lesson is the prerequisite in spirit, and on-policy distillation borrows its machinery from the RL track. Few-step diffusion samplers are distilled too (the generative series), and the speculative-decoding draft model is a distilled student (the vLLM / SGLang serving tracks).