rl_foundations / lessons / 12 · the LLM era (下) lesson 12 / 32

The LLM era (下) — DPO & GRPO, derived

Lesson 11 drew the map: PPO is the cheap TRPO, DPO skips the RL loop, GRPO drops the critic. A map is not a derivation. Here we earn both of the new methods from one equation each — the KL-regularized objective for DPO, the policy-gradient baseline theorem for GRPO — so neither feels like a trick you have to memorize.

Where we are
We have the full lineage now: REINFORCE (L03/L07) → advantage & GAE (L08) → importance sampling (L09) → TRPO's KL trust region (L10) → PPO's clipped surrogate (L11). Lesson 11 also named two LLM-era descendants without proving them. DPO removes the on-policy loop entirely; GRPO removes the value critic. This lesson derives each from machinery you already have, then hands the systems-level depth to the sibling course.

Part A · DPO — inverting the KL-regularized optimum

The LLM-era RL objective (the one PPO optimizes, the one we'll wire into RLHF next lesson) is "maximize reward, but don't drift too far from a trusted reference policy πref":

maxπ   𝔼x ∼ D 𝔼y ∼ π(·|x) [ r(x, y) ]  −  β · KL( π(·|x) ‖ πref(·|x) )

The KL anchor (with temperature β) is the same idea as TRPO's trust region from L10 — keep the new policy near a reference — but here it lives inside the objective as a penalty rather than as a hard constraint. PPO chases this maximum with sampling and clipping. DPO's insight is that for any fixed reward r, the maximizer has a closed form, so we never have to chase it.

Step 1 — the optimal policy is a tilted reference

Write the per-prompt objective and add a Lagrange term for normalization. Expand the KL as 𝔼y∼π[ log π − log πref ], take the functional derivative w.r.t. π(y|x), set it to zero. Two lines of calculus give the Gibbs/Boltzmann form:

π*(y | x)  =  (1 / Z(x)) · πref(y | x) · exp( r(x, y) / β )

Read it: the optimal policy is the reference, reweighted by how much reward each response earns, with sharpness set by 1/β. Small β → sharp tilt toward high reward; large β → barely move off πref. The normalizer Z(x) = Σy πref(y|x) · exp(r(x,y)/β) sums over all possible responses — astronomically expensive to compute, which is exactly why people thought you needed sampling. Hold that thought; it is about to disappear.

Step 2 — invert it: reward is a log-ratio

Take logs and solve the closed form for r. The reward we'd need for π* to be optimal is exactly:

r(x, y)  =  β · log( π*(y | x) / πref(y | x) )  +  β · log Z(x)

This is the pivot. We started wanting "the policy that maximizes a given reward." We now have "the reward that a given policy is implicitly optimal for." The reward model and the policy are the same object viewed from two sides — a log-ratio against the reference, plus a per-prompt constant.

Step 3 — plug into Bradley–Terry; Z(x) cancels

We don't have numeric rewards — we have human preferences: pairs (x, yw, yl) where yw (winner/chosen) was preferred to yl (loser/rejected). The Bradley–Terry model says the probability a human prefers yw depends only on the reward difference, squashed through a sigmoid:

P(yw ≻ yl | x)  =  σ( r(x, yw) − r(x, yl) )

Substitute the inverted reward from Step 2 into that difference. Both responses share the same prompt x, so they share the same β · log Z(x) — and a difference erases any term common to both. The intractable partition function cancels completely.

r(x, yw) − r(x, yl)  =  β · log  π*(yw|x)/πref(yw|x)  −  β · log  π*(yl|x)/πref(yl|x)

Now rename the unknown optimal policy as the thing we are training, πθ, take the negative log-likelihood of the observed preferences, and you have the entire DPO loss:

LDPO(θ)  =  − 𝔼(x, yw, yl)  log σ ⎡ β · log  πθ(yw|x)/πref(yw|x)  −  β · log  πθ(yl|x)/πref(yl|x)  ⎤

Stare at it. There is no reward model (it folded into the policy), no sampling (the loss reads two log-probs off a static dataset), no clipping, no value head, no KL penalty term — the πref ratios are the KL. It is a binary classifier: "make the chosen response more likely than the rejected one, measured relative to the reference." That is why DPO is summarized as "RL without RL."

The role of β — the surprise the widget shows
β is the same KL temperature as in the objective. It does not behave like a learning rate. Too small a β weakens the anchor: the implicit-reward gap blows up, the policy overfits the chosen/rejected distinction and drifts off πref (often both probabilities collapse, just at different rates). Too large a β over-anchors: the gradient is scaled down so hard the policy barely moves off the reference at all. The widget below lets you feel both failure modes.

Interactive · a DPO trainer on preference pairs

The smallest honest DPO: four candidate responses A,B,C,D to one prompt, with hidden true quality. The reference πref is uniform; the policy is four logits whose softmax is πθ. Each step samples a Bradley–Terry preference pair and applies one DPO gradient step. Watch the chosen response's log-prob climb above the rejected one — and watch what β does to the gap and to the drift from πref.

DPO trainer · 4 responses, πref = uniform
Hidden quality: A=0.0, B=1.0, C=2.0, D=2.8 (D is best). Blue/green = πθ, dim grey = πref, orange = implicit reward β·log(πθref). Try β=0.1 (watch the gap explode and probabilities drift) vs β=2.0 (watch nothing move).
Pairs seen
0
Implicit Δr̂ (D − A)
0.00
KL(πθ ‖ πref)
0.000
Status
Show the JS that runs the DPO step (≈18 lines)
// π_ref is uniform; π_θ = softmax(logits). Sample a BT preference pair.
const pi = softmax(logits), ref = softmax(refLogits);
const margin = beta * ( Math.log(pi[w]/ref[w]) - Math.log(pi[l]/ref[l]) );
const g = sigmoid(-margin);                 // dL/d(margin) factor
// ∂ log π_k / ∂θ_m = δ_km − π_m   (softmax score function)
for (let m = 0; m < 4; m++) {
  const dW = (m===w ? 1 : 0) - pi[m];
  const dL = (m===l ? 1 : 0) - pi[m];
  logits[m] += eta * g * beta * (dW - dL); // ascent on log σ(margin)
}

What you should observe: with a moderate β (≈0.5) the chosen-vs-rejected ordering snaps into place and the implicit reward r̂ = β·log(πθref) tracks the hidden quality order. Crank β down to 0.1 and the gap Δr̂ runs away while the KL to πref balloons — that is the overfitting/drift failure the sibling lesson warns about (both πθ(yw) and πθ(yl) can fall together; the loss only cares about the difference). Crank it to 2.0 and the bars barely leave uniform — the anchor is too strong to learn.

Part B · GRPO — the critic was a baseline all along

Now the on-policy branch. PPO (L11) optimizes the clipped surrogate min(ρ·A, clip(ρ,1±ε)·A) with the importance ratio ρ = πθold from L09. To compute the advantage A it trains a second network — a value critic Vφ — as the baseline. For a 7B+ LLM that critic roughly doubles training memory and, on a reward that only arrives at the final token of a long response, it is a noisy mess for thousands of steps. GRPO asks: do we actually need a learned baseline?

The baseline is free to be anything that depends on the prompt

Recall the policy-gradient baseline theorem from L03/L07: subtracting any function b(x) that depends on the state (here the prompt) but not the sampled response leaves the gradient unbiased, because

𝔼y∼πθ[ b(x) · ∇θ log πθ(y|x) ]  =  b(x) · ∇θ ∫ πθ(y|x) dy  =  b(x) · ∇θ 1  =  0

The critic's whole job was to estimate 𝔼y∼πθ[r(x,y)] — the prompt's expected reward — so it could be subtracted. But we can estimate that expectation by Monte Carlo: sample a group of K responses to the same prompt and average their rewards. The group mean is a one-line, network-free baseline.

The GRPO advantage

For prompt x, sample K rollouts, score each, and set

Ai  =  ( ri − meanj rj ) / ( stdj rj + ε )

then run the exact PPO-clip surrogate with Ai broadcast to every token of rollout i. Everything else — ρ, the clip [1−ε, 1+ε], the β·KL anchor to πref — is PPO unchanged. GRPO = PPO − value head. The trade: K forward passes per prompt to build the baseline, against an entire second network you no longer train or shard. Memory drops from ~2× to ~1× the policy.

Two subtleties — one is a forward reference
(1) A small finite-K bias. Each ri appears inside its own baseline (via the mean), so strictly the estimator is biased at finite K — the bias is O(1/K) and vanishes as K grows. The strictly-unbiased fix uses the mean of the other K−1 rollouts (RLOO). (2) The /std introduces a scale bias. Dividing by the group std makes the loss reward-scale-invariant, but it amplifies gradients from low-variance groups (the easiest and hardest prompts), which is not where the most learning signal is. That is a real bias the raw policy-gradient theorem has no /std for — and removing it is exactly what Dr.GRPO does. We forward-reference it here and the sibling course derives it.

Cross-link · where these go next

WHERE THIS GOES NEXT (SYSTEMS)
This theory course derives the algorithms; the sibling RL Post-Training course runs them on a GPU cluster and dissects the failure modes: Same symbols throughout: β (KL/DPO temperature), ρ = πθold (IS ratio), A (advantage), KL(·‖·).

The two derivations, side by side

DPOGRPO
Starts fromKL-regularized objective max 𝔼[r] − β·KLPPO clipped surrogate (L11)
Key identityoptimal policy ⇒ reward = β·log(π/πref) + β·log ZPG baseline theorem: any b(x) is unbiased
What cancels / dropsZ(x) cancels in the BT differencethe value critic Vφ drops out
Removesreward model + on-policy samplinga second network (~half the memory)
Cost paidcannot explore; offline onlyK rollouts/prompt; small /std & 1/K bias
Best forstyle/tone/helpfulness from preference pairsverifiable reward (math, code, reasoning)
Takeaway
DPO and GRPO are not new ideas bolted onto PPO — they are two inversions of machinery you already had. DPO inverts the KL-regularized optimum (π* ∝ πref·er/β) so the reward becomes a log-ratio; plugged into Bradley–Terry the partition function Z(x) cancels, leaving a reward-model-free, sampling-free binary classifier. GRPO replaces PPO's learned critic with the mean of K grouped rollouts — a valid prompt-only baseline that halves memory, at the price of K samples and a small /std bias. Next we wire these together into the end-to-end post-training recipe: RLHF.