Knowledge Distillation, From First Principles

A small model can learn more from a big model's answer than from the right answer. Two lessons: the foundations and classic techniques, then distilling generative models and the frontier — every claim derived, every trade-off made honest.

Three knobs organize the whole field: what the student matches (the teacher's outputs, its internal features, or the sequences it generates), how the gap is measured (temperature, forward vs reverse KL), and where the training data comes from (a fixed set, the teacher's generations, or the student's own rollouts). A concrete method is one choice per knob — Hinton KD is response × forward-KL × fixed-set; MiniLLM is sequences × reverse-KL × on-policy.

Who this is for

You know cross-entropy, softmax, and what an autoregressive language model is. By the end you can: derive the T² gradient correction in a few lines; explain why reverse KL makes a student commit and forward KL makes it hedge; say why sequence-level KD works through a black-box API but word-level KD does not; and lay out a distill-then-prune-then-quantize deployment stack.

The one picture

A hard label is a single point; a teacher is a whole distribution. Distillation moves the teacher's function into a smaller student — and the three knobs are the degrees of freedom in how you do it.

The two lessons

Distillation foundations

The core idea and dark knowledge; soft targets and temperature, with the T² gradient correction derived from scratch and the high-T logit-matching limit; the four lenses on why it works and the capacity-gap paradox; and the WHAT knob — response, feature, and relational distillation, plus born-again / self-distillation.

Distilling generative models & the frontier

Word-level vs sequence-level KD and the exposure bias both inherit; the HOW knob — forward vs reverse KL, mode-covering vs mode-seeking; the WHERE knob — on-policy distillation, GKD, MiniLLM, and distillation as RL with the teacher as a dense reward; how it composes with quantization and pruning; and the frontier — reasoning traces, black-box style-vs-substance, and weak-to-strong.

Where this connects

Distillation is SFT on a teacher's outputs — so the gpt_mini SFT lesson is the prerequisite in spirit, and on-policy distillation borrows its machinery from the RL track. Few-step diffusion samplers are distilled too (the generative series), and the speculative-decoding draft model is a distilled student (the vLLM / SGLang serving tracks).