distillation / lessons / 02 · generative & frontier lesson 2 / 2

Distilling generative models & the frontier

An LLM's output isn't one distribution — it's a tree of them. The HOW knob (which KL), the WHERE knob (whose samples), how distillation composes with quantization and pruning, and where the frontier is.

Lesson 01 covered the WHAT knob with a fixed teacher and forward KL. Generative models force the other two knobs into the open, because an autoregressive model factorizes its output over positions:

p(y | x) = ∏t=1|y| p(yt | y<t, x)

For vocab V ≈ 50{,}000 and length L ≈ 200 there are VL possible sequences — impossible to enumerate. So you can constrain the tree two ways.

Word-level vs sequence-level KD

Word-level KD matches the teacher's full conditional at each position the prefix visits:

Lword = ∑t KL( p(·|y<t,x) ‖ qθ(·|y<t,x) )

It is cheap (one teacher forward pass), dense (a full V-dim target per position), but white-box — you need the teacher's logits. Sequence-level KD matches the joint, approximated by sampling whole sequences from the teacher:

Lseq = KL( p(y|x) ‖ qθ(y|x) ) ≈ −(1/N) ∑i log qθ( y(i) | x ),   y(i) ~ p(·|x)

Its hard variant (Kim & Rush) keeps just the beam-search mode ŷ and treats it as a label: L ≈ −log qθ(ŷ|x). That is exactly SFT on teacher-generated text — sparse, but it works black-box, through nothing but an API that returns text. This is why most small chat models are trained on a bigger model's generations.

word-level KDsequence-level KD
matcheslocal conditionals (each node)the joint via samples (paths)
teacher accessfull logits — white-boxgenerated text — black-box OK
signal densitydense (V-dim/pos)sparse (hard labels)
storagelogits per token (large)token ids only (tiny)
The crack both inherit: exposure bias
Both train on prefixes drawn from the gold or teacher distribution, but at inference the prefixes come from the student. A small early error pushes the student onto a prefix it was never trained on, the next prediction is an extrapolation, and errors compound over the sequence. This is behavior cloning, and it has a structural train/test gap — the reason the WHERE knob matters.

The HOW knob: forward vs reverse KL

Which direction of KL you minimize decides the student's personality. The two integrands differ in whose distribution the expectation is taken under:

forward: KL(p‖qθ) = ∑ p log(p/qθ)   |   reverse: KL(qθ‖p) = ∑ qθ log(qθ/p)

Temperature and direction are not independent: softening the teacher flattens its modes, which doubly hedges a forward-KL student but hands a reverse-KL student a cleaner peak to lock onto. JS divergence (½KL(p‖m)+½KL(q‖m), m=½(p+q)) is the symmetric middle ground that GKD uses by default.

The catch
Forward KL takes its expectation under the fixed teacher — sample once, done. Reverse KL takes it under the student, the very thing being trained, so it needs a policy-gradient (REINFORCE) estimator with r(z) = log(qθ(z)/p(z)) as the reward. That estimator is high-variance — which is precisely why reverse-KL distillation is on-policy and needs the machinery below.

The WHERE knob: on-policy distillation

The cure for exposure bias is to train the student on the sequences the student actually produces — this is DAgger with an LLM teacher: (1) sample y ~ qθ(·|x); (2) label it with the teacher's conditionals p(·|y<t,x) at every visited position; (3) train the student to match there. Train and test prefixes now match by construction, so errors get corrected instead of compounded.

GKD (Generalized Knowledge Distillation) exposes exactly two orthogonal knobs — the divergence D and the on-policy fraction λ:

for each step:
    if rand() < λ:   y ~ q_θ(·|x)        # on-policy: student's own rollout
    else:            y ~ fixed_data(·|x) # off-policy: gold / teacher text
    loss = Σ_t D( p(·|y_

The whole spectrum is special cases: λ=0,\ D=forward KL is word-level KD; λ=1,\ D=reverse KL is the MiniLLM regime; λ≈0.5,\ D=JSD is the robust default. MiniLLM takes the honest reverse-KL policy gradient and tames its variance with single-step decomposition, length normalization, and teacher-mixed sampling.

Distillation as RL without a reward model
On-policy distillation is RL with the teacher as a dense, per-token reward (rt ≈ log p(ŷt<t,x) or the per-step KL). It is RLHF where the teacher's logits replace the reward model — dense signal at every token instead of one sparse end-of-sequence score. The cost is real: generating from the student inside the training loop is slow and sequential, whereas off-policy teacher data is generated once and reused. You buy exposure-bias robustness with wall-clock.

How distillation composes with quantization & pruning

All three shrink a model, but they cut different factors of a rough cost equation cost ≈ (#params) × (bits/param) × (work/param):

Because they touch different factors they are largely orthogonal and compose multiplicatively: a production stack might distill 70B → 7B, structured-prune + fine-tune to 5B, then quantize to INT4. Distillation also doubles as the recovery step that heals the damage from aggressive quant/prune (QAT with a distillation loss; "prune then distill from the original"). Pick by bottleneck: memory-bound → quantize first; latency-bound → fewer layers (prune / distill shallower); only the teacher's text available → sequence-level distillation; absolute smallest → the full stack.

The frontier

Open questions stay open: where the capacity-gap knee sits, when a student can exceed its teacher, how to measure true capability transfer vs surface mimicry, and how much self-generated data triggers model collapse (recursive training on synthetic data narrows the distribution and erases rare events).

The whole field on three knobs

Strip away the variations and every method is one choice per knob — what to match, how to measure the gap, where the data comes from:

methodWHATHOWWHERE
Hinton KDresponse (logits)forward KL + Tfixed set
FitNets / RKDfeatures / relationsL2 / relationalfixed set
Sequence KD (R1-distill)sequencesforward KL (hard)teacher generations
GKDsequencesany D (JSD default)mix (λ)
MiniLLMsequencesreverse KL (PG)on-policy (student)
Takeaway
A generative teacher is a tree of distributions. Word-level KD matches each node (dense, white-box); sequence-level KD is SFT on the teacher's text (sparse, black-box) — both off-policy, both exposure-biased. Forward KL makes the student hedge (mode-covering); reverse KL makes it commit (mode-seeking) but needs student samples, so it is on-policy. GKD unifies the space with two knobs (divergence × on-policy fraction); MiniLLM is its reverse-KL corner, and the whole thing is RL with the teacher as a dense reward. Distillation changes the architecture, so it composes multiplicatively with quantization (bits) and pruning (params) and heals their damage. At the frontier, reasoning traces are dense supervision that works, black-box imitation transfers style before substance, and weak-to-strong inverts the arrow for alignment.