The LLM era (下) — DPO & GRPO, derived
Lesson 11 drew the map: PPO is the cheap TRPO, DPO skips the RL loop, GRPO drops the critic. A map is not a derivation. Here we earn both of the new methods from one equation each — the KL-regularized objective for DPO, the policy-gradient baseline theorem for GRPO — so neither feels like a trick you have to memorize.
Part A · DPO — inverting the KL-regularized optimum
The LLM-era RL objective (the one PPO optimizes, the one we'll wire into RLHF next lesson) is "maximize reward, but don't drift too far from a trusted reference policy πref":
The KL anchor (with temperature β) is the same idea as TRPO's trust region from L10 — keep the new policy near a reference — but here it lives inside the objective as a penalty rather than as a hard constraint. PPO chases this maximum with sampling and clipping. DPO's insight is that for any fixed reward r, the maximizer has a closed form, so we never have to chase it.
Step 1 — the optimal policy is a tilted reference
Write the per-prompt objective and add a Lagrange term for normalization. Expand the KL as 𝔼y∼π[ log π − log πref ], take the functional derivative w.r.t. π(y|x), set it to zero. Two lines of calculus give the Gibbs/Boltzmann form:
Read it: the optimal policy is the reference, reweighted by how much reward each response earns, with sharpness set by 1/β. Small β → sharp tilt toward high reward; large β → barely move off πref. The normalizer Z(x) = Σy πref(y|x) · exp(r(x,y)/β) sums over all possible responses — astronomically expensive to compute, which is exactly why people thought you needed sampling. Hold that thought; it is about to disappear.
Step 2 — invert it: reward is a log-ratio
Take logs and solve the closed form for r. The reward we'd need for π* to be optimal is exactly:
This is the pivot. We started wanting "the policy that maximizes a given reward." We now have "the reward that a given policy is implicitly optimal for." The reward model and the policy are the same object viewed from two sides — a log-ratio against the reference, plus a per-prompt constant.
Step 3 — plug into Bradley–Terry; Z(x) cancels
We don't have numeric rewards — we have human preferences: pairs (x, yw, yl) where yw (winner/chosen) was preferred to yl (loser/rejected). The Bradley–Terry model says the probability a human prefers yw depends only on the reward difference, squashed through a sigmoid:
Substitute the inverted reward from Step 2 into that difference. Both responses share the same prompt x, so they share the same β · log Z(x) — and a difference erases any term common to both. The intractable partition function cancels completely.
Now rename the unknown optimal policy as the thing we are training, πθ, take the negative log-likelihood of the observed preferences, and you have the entire DPO loss:
Stare at it. There is no reward model (it folded into the policy), no sampling (the loss reads two log-probs off a static dataset), no clipping, no value head, no KL penalty term — the πref ratios are the KL. It is a binary classifier: "make the chosen response more likely than the rejected one, measured relative to the reference." That is why DPO is summarized as "RL without RL."
Interactive · a DPO trainer on preference pairs
The smallest honest DPO: four candidate responses A,B,C,D to one prompt, with hidden true quality. The reference πref is uniform; the policy is four logits whose softmax is πθ. Each step samples a Bradley–Terry preference pair and applies one DPO gradient step. Watch the chosen response's log-prob climb above the rejected one — and watch what β does to the gap and to the drift from πref.
What you should observe: with a moderate β (≈0.5) the chosen-vs-rejected ordering snaps into place and the implicit reward r̂ = β·log(πθ/πref) tracks the hidden quality order. Crank β down to 0.1 and the gap Δr̂ runs away while the KL to πref balloons — that is the overfitting/drift failure the sibling lesson warns about (both πθ(yw) and πθ(yl) can fall together; the loss only cares about the difference). Crank it to 2.0 and the bars barely leave uniform — the anchor is too strong to learn.
Part B · GRPO — the critic was a baseline all along
Now the on-policy branch. PPO (L11) optimizes the clipped surrogate min(ρ·A, clip(ρ,1±ε)·A) with the importance ratio ρ = πθ/πold from L09. To compute the advantage A it trains a second network — a value critic Vφ — as the baseline. For a 7B+ LLM that critic roughly doubles training memory and, on a reward that only arrives at the final token of a long response, it is a noisy mess for thousands of steps. GRPO asks: do we actually need a learned baseline?
The baseline is free to be anything that depends on the prompt
Recall the policy-gradient baseline theorem from L03/L07: subtracting any function b(x) that depends on the state (here the prompt) but not the sampled response leaves the gradient unbiased, because
The critic's whole job was to estimate 𝔼y∼πθ[r(x,y)] — the prompt's expected reward — so it could be subtracted. But we can estimate that expectation by Monte Carlo: sample a group of K responses to the same prompt and average their rewards. The group mean is a one-line, network-free baseline.
The GRPO advantage
For prompt x, sample K rollouts, score each, and set
then run the exact PPO-clip surrogate with Ai broadcast to every token of rollout i. Everything else — ρ, the clip [1−ε, 1+ε], the β·KL anchor to πref — is PPO unchanged. GRPO = PPO − value head. The trade: K forward passes per prompt to build the baseline, against an entire second network you no longer train or shard. Memory drops from ~2× to ~1× the policy.
Cross-link · where these go next
- DPO — RL without RL: the same closed-form derivation, plus the preference-loss family (IPO, KTO, ORPO, SimPO), what DPO gives up (no exploration), and why frontier pipelines still end with an on-policy RL stage.
- GRPO — drop the critic: the group baseline in production code, the degenerate-group (all-equal rewards) edge case, and the PPO-vs-GRPO memory accounting.
- Dr.GRPO — GRPO done right: the derivation of the /std scale bias flagged above, plus a second length-normalization bias, and why dropping both matters for long chain-of-thought reasoning.
The two derivations, side by side
| DPO | GRPO | |
|---|---|---|
| Starts from | KL-regularized objective max 𝔼[r] − β·KL | PPO clipped surrogate (L11) |
| Key identity | optimal policy ⇒ reward = β·log(π/πref) + β·log Z | PG baseline theorem: any b(x) is unbiased |
| What cancels / drops | Z(x) cancels in the BT difference | the value critic Vφ drops out |
| Removes | reward model + on-policy sampling | a second network (~half the memory) |
| Cost paid | cannot explore; offline only | K rollouts/prompt; small /std & 1/K bias |
| Best for | style/tone/helpfulness from preference pairs | verifiable reward (math, code, reasoning) |