Distilling generative models & the frontier
An LLM's output isn't one distribution — it's a tree of them. The HOW knob (which KL), the WHERE knob (whose samples), how distillation composes with quantization and pruning, and where the frontier is.
Lesson 01 covered the WHAT knob with a fixed teacher and forward KL. Generative models force the other two knobs into the open, because an autoregressive model factorizes its output over positions:
For vocab V ≈ 50{,}000 and length L ≈ 200 there are VL possible sequences — impossible to enumerate. So you can constrain the tree two ways.
Word-level vs sequence-level KD
Word-level KD matches the teacher's full conditional at each position the prefix visits:
It is cheap (one teacher forward pass), dense (a full V-dim target per position), but white-box — you need the teacher's logits. Sequence-level KD matches the joint, approximated by sampling whole sequences from the teacher:
Its hard variant (Kim & Rush) keeps just the beam-search mode ŷ and treats it as a label: L ≈ −log qθ(ŷ|x). That is exactly SFT on teacher-generated text — sparse, but it works black-box, through nothing but an API that returns text. This is why most small chat models are trained on a bigger model's generations.
| word-level KD | sequence-level KD | |
|---|---|---|
| matches | local conditionals (each node) | the joint via samples (paths) |
| teacher access | full logits — white-box | generated text — black-box OK |
| signal density | dense (V-dim/pos) | sparse (hard labels) |
| storage | logits per token (large) | token ids only (tiny) |
The HOW knob: forward vs reverse KL
Which direction of KL you minimize decides the student's personality. The two integrands differ in whose distribution the expectation is taken under:
- Forward KL is zero-avoiding → mode-covering. Where p>0 but qθ→0, the term blows up, so the student is punished for missing any teacher mass. It spreads to cover everything, filling the valleys between modes → hedgy, sometimes incoherent. This is the MLE / word-level default.
- Reverse KL is zero-forcing → mode-seeking. Where qθ>0 but p→0, the term blows up, so the student is punished for putting mass outside the teacher's support. It commits to a high-probability subset → crisp, confident, but can drop modes (MiniLLM's argument for higher-precision generations).
Temperature and direction are not independent: softening the teacher flattens its modes, which doubly hedges a forward-KL student but hands a reverse-KL student a cleaner peak to lock onto. JS divergence (½KL(p‖m)+½KL(q‖m), m=½(p+q)) is the symmetric middle ground that GKD uses by default.
The WHERE knob: on-policy distillation
The cure for exposure bias is to train the student on the sequences the student actually produces — this is DAgger with an LLM teacher: (1) sample y ~ qθ(·|x); (2) label it with the teacher's conditionals p(·|y<t,x) at every visited position; (3) train the student to match there. Train and test prefixes now match by construction, so errors get corrected instead of compounded.
GKD (Generalized Knowledge Distillation) exposes exactly two orthogonal knobs — the divergence D and the on-policy fraction λ:
for each step:
if rand() < λ: y ~ q_θ(·|x) # on-policy: student's own rollout
else: y ~ fixed_data(·|x) # off-policy: gold / teacher text
loss = Σ_t D( p(·|y_
The whole spectrum is special cases: λ=0,\ D=forward KL is word-level KD; λ=1,\ D=reverse KL is the MiniLLM regime; λ≈0.5,\ D=JSD is the robust default. MiniLLM takes the honest reverse-KL policy gradient and tames its variance with single-step decomposition, length normalization, and teacher-mixed sampling.
How distillation composes with quantization & pruning
All three shrink a model, but they cut different factors of a rough cost equation cost ≈ (#params) × (bits/param) × (work/param):
- Quantization lowers bits/param (FP16 → INT8 → INT4). No retraining for PTQ (calibration only); QAT recovers accuracy at low bits. Watch the activation-outlier problem below ~3 bits.
- Pruning lowers #params. Unstructured pruning hits 80–90% sparsity but needs special kernels to see any speedup; structured pruning removes whole heads/channels/layers for real wall-clock gains on stock hardware.
- Distillation changes the architecture / function class itself — the only one of the three that can. Most flexible and most expensive (you train a whole new model).
Because they touch different factors they are largely orthogonal and compose multiplicatively: a production stack might distill 70B → 7B, structured-prune + fine-tune to 5B, then quantize to INT4. Distillation also doubles as the recovery step that heals the damage from aggressive quant/prune (QAT with a distillation loss; "prune then distill from the original"). Pick by bottleneck: memory-bound → quantize first; latency-bound → fewer layers (prune / distill shallower); only the teacher's text available → sequence-level distillation; absolute smallest → the full stack.
The frontier
- Reasoning-trace distillation (canonical). Take a strong reasoning model (DeepSeek-R1), generate full chain-of-thought traces, and SFT a small model on them. R1-distilled small models beat same-size models trained with RL directly, because a trace is dense supervision over every intermediate step — it turns a sparse-reward problem into the next-token problem transformers learn efficiently. This is sequence-level KD, usually black-box. Limit: the student inherits the teacher's reasoning ceiling — it copies competence and mistakes, it doesn't discover new strategies.
- Black-box / synthetic-data distillation. Self-Instruct / Alpaca, Evol-Instruct, Orca: prompt a strong teacher to generate instructions, responses, and explanations, then fine-tune. The honest caveat (Gudibande et al., "The False Promise…"): imitation copies style — tone, formatting, confidence — fast, while underlying capability (factuality, reasoning) closes far more slowly, if at all. Useful, but it can fool weak judges before it is truly capable.
- Weak-to-strong generalization. Invert the setup — a weak teacher supervises a strong student. Measured by PGR (performance-gap recovered), strong students partially transcend their weak supervisor's mistakes. It is a toy model of humans overseeing superhuman models (superalignment); encouraging but well below 100% and an open research program.
- Distillation ⇄ serving. The draft model in speculative decoding is trained to match the target's distribution — a distillation objective whose acceptance rate is its quality metric. Few-step diffusion samplers are distilled the same way (progressive / consistency distillation).
Open questions stay open: where the capacity-gap knee sits, when a student can exceed its teacher, how to measure true capability transfer vs surface mimicry, and how much self-generated data triggers model collapse (recursive training on synthetic data narrows the distribution and erases rare events).
The whole field on three knobs
Strip away the variations and every method is one choice per knob — what to match, how to measure the gap, where the data comes from:
| method | WHAT | HOW | WHERE |
|---|---|---|---|
| Hinton KD | response (logits) | forward KL + T | fixed set |
| FitNets / RKD | features / relations | L2 / relational | fixed set |
| Sequence KD (R1-distill) | sequences | forward KL (hard) | teacher generations |
| GKD | sequences | any D (JSD default) | mix (λ) |
| MiniLLM | sequences | reverse KL (PG) | on-policy (student) |