Distilling generative models & the frontier

An LLM's output isn't one distribution — it's a tree of them. The HOW knob (which KL), the WHERE knob (whose samples), how distillation composes with quantization and pruning, and where the frontier is.

Lesson 01 covered the WHAT knob with a fixed teacher and forward KL. Generative models force the other two knobs into the open, because an autoregressive model factorizes its output over positions:

p(y | x) = ∏_t=1^|y| p(y_t | y_<t, x)

For vocab V ≈ 50{,}000 and length L ≈ 200 there are V^L possible sequences — impossible to enumerate. So you can constrain the tree two ways.

Word-level vs sequence-level KD

Word-level KD matches the teacher's full conditional at each position the prefix visits:

L_word = ∑_t KL( p(·|y_<t,x) ‖ q_θ(·|y_<t,x) )

It is cheap (one teacher forward pass), dense (a full V-dim target per position), but white-box — you need the teacher's logits. Sequence-level KD matches the joint, approximated by sampling whole sequences from the teacher:

L_seq = KL( p(y|x) ‖ q_θ(y|x) ) ≈ −(1/N) ∑_i log q_θ( y⁽ⁱ⁾ | x ), y⁽ⁱ⁾ ~ p(·|x)

Its hard variant (Kim & Rush) keeps just the beam-search mode ŷ and treats it as a label: L ≈ −log q_θ(ŷ|x). That is exactly SFT on teacher-generated text — sparse, but it works black-box, through nothing but an API that returns text. This is why most small chat models are trained on a bigger model's generations.

	word-level KD	sequence-level KD
matches	local conditionals (each node)	the joint via samples (paths)
teacher access	full logits — white-box	generated text — black-box OK
signal density	dense (V-dim/pos)	sparse (hard labels)
storage	logits per token (large)	token ids only (tiny)

The crack both inherit: exposure bias

Both train on prefixes drawn from the gold or teacher distribution, but at inference the prefixes come from the student. A small early error pushes the student onto a prefix it was never trained on, the next prediction is an extrapolation, and errors compound over the sequence. This is behavior cloning, and it has a structural train/test gap — the reason the WHERE knob matters.

The HOW knob: forward vs reverse KL

Which direction of KL you minimize decides the student's personality. The two integrands differ in whose distribution the expectation is taken under:

forward: KL(p‖q_θ) = ∑ p log(p/q_θ) | reverse: KL(q_θ‖p) = ∑ q_θ log(q_θ/p)

Forward KL is zero-avoiding → mode-covering. Where p>0 but q_θ→0, the term blows up, so the student is punished for missing any teacher mass. It spreads to cover everything, filling the valleys between modes → hedgy, sometimes incoherent. This is the MLE / word-level default.
Reverse KL is zero-forcing → mode-seeking. Where q_θ>0 but p→0, the term blows up, so the student is punished for putting mass outside the teacher's support. It commits to a high-probability subset → crisp, confident, but can drop modes (MiniLLM's argument for higher-precision generations).

Temperature and direction are not independent: softening the teacher flattens its modes, which doubly hedges a forward-KL student but hands a reverse-KL student a cleaner peak to lock onto. JS divergence (½KL(p‖m)+½KL(q‖m), m=½(p+q)) is the symmetric middle ground that GKD uses by default.

The catch

Forward KL takes its expectation under the fixed teacher — sample once, done. Reverse KL takes it under the student, the very thing being trained, so it needs a policy-gradient (REINFORCE) estimator with r(z) = log(q_θ(z)/p(z)) as the reward. That estimator is high-variance — which is precisely why reverse-KL distillation is on-policy and needs the machinery below.

The WHERE knob: on-policy distillation

The cure for exposure bias is to train the student on the sequences the student actually produces — this is DAgger with an LLM teacher: (1) sample y ~ q_θ(·|x); (2) label it with the teacher's conditionals p(·|y_<t,x) at every visited position; (3) train the student to match there. Train and test prefixes now match by construction, so errors get corrected instead of compounded.

GKD (Generalized Knowledge Distillation) exposes exactly two orthogonal knobs — the divergence D and the on-policy fraction λ:

for each step:
    if rand() < λ:   y ~ q_θ(·|x)        # on-policy: student's own rollout
    else:            y ~ fixed_data(·|x) # off-policy: gold / teacher text
    loss = Σ_t D( p(·|y_



The whole spectrum is special cases: λ=0,\ D=forward KL is word-level KD; λ=1,\ D=reverse KL is the MiniLLM regime; λ≈0.5,\ D=JSD is the robust default. MiniLLM takes the honest reverse-KL policy gradient and tames its variance with single-step decomposition, length normalization, and teacher-mixed sampling.


  Distillation as RL without a reward model
  On-policy distillation is RL with the teacher as a dense, per-token reward (r_t ≈ log p(ŷ_t|ŷ_<t,x) or the per-step KL). It is RLHF where the teacher's logits replace the reward model — dense signal at every token instead of one sparse end-of-sequence score. The cost is real: generating from the student inside the training loop is slow and sequential, whereas off-policy teacher data is generated once and reused. You buy exposure-bias robustness with wall-clock.

How distillation composes with quantization & pruning

All three shrink a model, but they cut different factors of a rough cost equation cost ≈ (#params) × (bits/param) × (work/param):


  Quantization lowers bits/param (FP16 → INT8 → INT4). No retraining for PTQ (calibration only); QAT recovers accuracy at low bits. Watch the activation-outlier problem below ~3 bits.
  Pruning lowers #params. Unstructured pruning hits 80–90% sparsity but needs special kernels to see any speedup; structured pruning removes whole heads/channels/layers for real wall-clock gains on stock hardware.
  Distillation changes the architecture / function class itself — the only one of the three that can. Most flexible and most expensive (you train a whole new model).


Because they touch different factors they are largely orthogonal and compose multiplicatively: a production stack might distill 70B → 7B, structured-prune + fine-tune to 5B, then quantize to INT4. Distillation also doubles as the recovery step that heals the damage from aggressive quant/prune (QAT with a distillation loss; "prune then distill from the original"). Pick by bottleneck: memory-bound → quantize first; latency-bound → fewer layers (prune / distill shallower); only the teacher's text available → sequence-level distillation; absolute smallest → the full stack.

The frontier


  Reasoning-trace distillation (canonical). Take a strong reasoning model (DeepSeek-R1), generate full chain-of-thought traces, and SFT a small model on them. R1-distilled small models beat same-size models trained with RL directly, because a trace is dense supervision over every intermediate step — it turns a sparse-reward problem into the next-token problem transformers learn efficiently. This is sequence-level KD, usually black-box. Limit: the student inherits the teacher's reasoning ceiling — it copies competence and mistakes, it doesn't discover new strategies.
  Black-box / synthetic-data distillation. Self-Instruct / Alpaca, Evol-Instruct, Orca: prompt a strong teacher to generate instructions, responses, and explanations, then fine-tune. The honest caveat (Gudibande et al., "The False Promise…"): imitation copies style — tone, formatting, confidence — fast, while underlying capability (factuality, reasoning) closes far more slowly, if at all. Useful, but it can fool weak judges before it is truly capable.
  Weak-to-strong generalization. Invert the setup — a weak teacher supervises a strong student. Measured by PGR (performance-gap recovered), strong students partially transcend their weak supervisor's mistakes. It is a toy model of humans overseeing superhuman models (superalignment); encouraging but well below 100% and an open research program.
  Distillation ⇄ serving. The draft model in speculative decoding is trained to match the target's distribution — a distillation objective whose acceptance rate is its quality metric. Few-step diffusion samplers are distilled the same way (progressive / consistency distillation).


Open questions stay open: where the capacity-gap knee sits, when a student can exceed its teacher, how to measure true capability transfer vs surface mimicry, and how much self-generated data triggers model collapse (recursive training on synthetic data narrows the distribution and erases rare events).

The whole field on three knobs

Strip away the variations and every method is one choice per knob — what to match, how to measure the gap, where the data comes from:


  method WHAT HOW WHERE
  Hinton KD response (logits) forward KL + T fixed set
  FitNets / RKD features / relations L2 / relational fixed set
  Sequence KD (R1-distill) sequences forward KL (hard) teacher generations
  GKD sequences any D (JSD default) mix (λ)
  MiniLLM sequences reverse KL (PG) on-policy (student)



  Takeaway
  A generative teacher is a tree of distributions. Word-level KD matches each node (dense, white-box); sequence-level KD is SFT on the teacher's text (sparse, black-box) — both off-policy, both exposure-biased. Forward KL makes the student hedge (mode-covering); reverse KL makes it commit (mode-seeking) but needs student samples, so it is on-policy. GKD unifies the space with two knobs (divergence × on-policy fraction); MiniLLM is its reverse-KL corner, and the whole thing is RL with the teacher as a dense reward. Distillation changes the architecture, so it composes multiplicatively with quantization (bits) and pruning (params) and heals their damage. At the frontier, reasoning traces are dense supervision that works, black-box imitation transfers style before substance, and weak-to-strong inverts the arrow for alignment.


  
    ← Prev
    01 · Distillation foundations
  
  
    Done →
    Back to all lessons

method	WHAT	HOW	WHERE
Hinton KD	response (logits)	forward KL + T	fixed set
FitNets / RKD	features / relations	L2 / relational	fixed set
Sequence KD (R1-distill)	sequences	forward KL (hard)	teacher generations
GKD	sequences	any D (JSD default)	mix (λ)
MiniLLM	sequences	reverse KL (PG)	on-policy (student)