The LLM era (下) — DPO & GRPO, derived

Lesson 11 drew the map: PPO is the cheap TRPO, DPO skips the RL loop, GRPO drops the critic. A map is not a derivation. Here we earn both of the new methods from one equation each — the KL-regularized objective for DPO, the policy-gradient baseline theorem for GRPO — so neither feels like a trick you have to memorize.

Where we are

We have the full lineage now: REINFORCE (L03/L07) → advantage & GAE (L08) → importance sampling (L09) → TRPO's KL trust region (L10) → PPO's clipped surrogate (L11). Lesson 11 also named two LLM-era descendants without proving them. DPO removes the on-policy loop entirely; GRPO removes the value critic. This lesson derives each from machinery you already have, then hands the systems-level depth to the sibling course.

Part A · DPO — inverting the KL-regularized optimum

The LLM-era RL objective (the one PPO optimizes, the one we'll wire into RLHF next lesson) is "maximize reward, but don't drift too far from a trusted reference policy π_ref":

max_π 𝔼_{x ∼ D} 𝔼_{y ∼ π(·|x)} [ r(x, y) ] − β · KL( π(·|x) ‖ π_ref(·|x) )

The KL anchor (with temperature β) is the same idea as TRPO's trust region from L10 — keep the new policy near a reference — but here it lives inside the objective as a penalty rather than as a hard constraint. PPO chases this maximum with sampling and clipping. DPO's insight is that for any fixed reward r, the maximizer has a closed form, so we never have to chase it.

Step 1 — the optimal policy is a tilted reference

Write the per-prompt objective and add a Lagrange term for normalization. Expand the KL as 𝔼_y∼π[ log π − log π_ref ], take the functional derivative w.r.t. π(y|x), set it to zero. Two lines of calculus give the Gibbs/Boltzmann form:

π^*(y | x) = (1 / Z(x)) · π_ref(y | x) · exp( r(x, y) / β )

Read it: the optimal policy is the reference, reweighted by how much reward each response earns, with sharpness set by 1/β. Small β → sharp tilt toward high reward; large β → barely move off π_ref. The normalizer Z(x) = Σ_y π_ref(y|x) · exp(r(x,y)/β) sums over all possible responses — astronomically expensive to compute, which is exactly why people thought you needed sampling. Hold that thought; it is about to disappear.

Step 2 — invert it: reward is a log-ratio

Take logs and solve the closed form for r. The reward we'd need for π^* to be optimal is exactly:

r(x, y) = β · log( π^*(y | x) / π_ref(y | x) ) + β · log Z(x)

This is the pivot. We started wanting "the policy that maximizes a given reward." We now have "the reward that a given policy is implicitly optimal for." The reward model and the policy are the same object viewed from two sides — a log-ratio against the reference, plus a per-prompt constant.

Step 3 — plug into Bradley–Terry; Z(x) cancels

We don't have numeric rewards — we have human preferences: pairs (x, y_w, y_l) where y_w (winner/chosen) was preferred to y_l (loser/rejected). The Bradley–Terry model says the probability a human prefers y_w depends only on the reward difference, squashed through a sigmoid:

P(y_w ≻ y_l | x) = σ( r(x, y_w) − r(x, y_l) )

Substitute the inverted reward from Step 2 into that difference. Both responses share the same prompt x, so they share the same β · log Z(x) — and a difference erases any term common to both. The intractable partition function cancels completely.

r(x, y_w) − r(x, y_l) = β · log π^*(y_w|x)/π_ref(y_w|x) − β · log π^*(y_l|x)/π_ref(y_l|x)

Now rename the unknown optimal policy as the thing we are training, π_θ, take the negative log-likelihood of the observed preferences, and you have the entire DPO loss:

L_DPO(θ) = − 𝔼_{(x, y_w, y_l)} log σ ⎡ β · log π_θ(y_w|x)/π_ref(y_w|x) − β · log π_θ(y_l|x)/π_ref(y_l|x) ⎤

Stare at it. There is no reward model (it folded into the policy), no sampling (the loss reads two log-probs off a static dataset), no clipping, no value head, no KL penalty term — the π_ref ratios are the KL. It is a binary classifier: "make the chosen response more likely than the rejected one, measured relative to the reference." That is why DPO is summarized as "RL without RL."

The role of β — the surprise the widget shows

β is the same KL temperature as in the objective. It does not behave like a learning rate. Too small a β weakens the anchor: the implicit-reward gap blows up, the policy overfits the chosen/rejected distinction and drifts off π_ref (often both probabilities collapse, just at different rates). Too large a β over-anchors: the gradient is scaled down so hard the policy barely moves off the reference at all. The widget below lets you feel both failure modes.

Interactive · a DPO trainer on preference pairs

The smallest honest DPO: four candidate responses A,B,C,D to one prompt, with hidden true quality. The reference π_ref is uniform; the policy is four logits whose softmax is π_θ. Each step samples a Bradley–Terry preference pair and applies one DPO gradient step. Watch the chosen response's log-prob climb above the rejected one — and watch what β does to the gap and to the drift from π_ref.

DPO trainer · 4 responses, π_ref = uniform

Hidden quality: A=0.0, B=1.0, C=2.0, D=2.8 (D is best). Blue/green = π_θ, dim grey = π_ref, orange = implicit reward β·log(π_θ/π_ref). Try β=0.1 (watch the gap explode and probabilities drift) vs β=2.0 (watch nothing move).

β: 0.50 η: 0.20

Pairs seen

Implicit Δr̂ (D − A)

0.00

KL(π_θ ‖ π_ref)

0.000

Status

—

Show the JS that runs the DPO step (≈18 lines)

// π_ref is uniform; π_θ = softmax(logits). Sample a BT preference pair.
const pi = softmax(logits), ref = softmax(refLogits);
const margin = beta * ( Math.log(pi[w]/ref[w]) - Math.log(pi[l]/ref[l]) );
const g = sigmoid(-margin);                 // dL/d(margin) factor
// ∂ log π_k / ∂θ_m = δ_km − π_m   (softmax score function)
for (let m = 0; m < 4; m++) {
  const dW = (m===w ? 1 : 0) - pi[m];
  const dL = (m===l ? 1 : 0) - pi[m];
  logits[m] += eta * g * beta * (dW - dL); // ascent on log σ(margin)
}

What you should observe: with a moderate β (≈0.5) the chosen-vs-rejected ordering snaps into place and the implicit reward r̂ = β·log(π_θ/π_ref) tracks the hidden quality order. Crank β down to 0.1 and the gap Δr̂ runs away while the KL to π_ref balloons — that is the overfitting/drift failure the sibling lesson warns about (both π_θ(y_w) and π_θ(y_l) can fall together; the loss only cares about the difference). Crank it to 2.0 and the bars barely leave uniform — the anchor is too strong to learn.

Part B · GRPO — the critic was a baseline all along

Now the on-policy branch. PPO (L11) optimizes the clipped surrogate min(ρ·A, clip(ρ,1±ε)·A) with the importance ratio ρ = π_θ/π_old from L09. To compute the advantage A it trains a second network — a value critic V_φ — as the baseline. For a 7B+ LLM that critic roughly doubles training memory and, on a reward that only arrives at the final token of a long response, it is a noisy mess for thousands of steps. GRPO asks: do we actually need a learned baseline?

The baseline is free to be anything that depends on the prompt

Recall the policy-gradient baseline theorem from L03/L07: subtracting any function b(x) that depends on the state (here the prompt) but not the sampled response leaves the gradient unbiased, because

𝔼_{y∼π_θ}[ b(x) · ∇_θ log π_θ(y|x) ] = b(x) · ∇_θ ∫ π_θ(y|x) dy = b(x) · ∇_θ 1 = 0

The critic's whole job was to estimate 𝔼_{y∼π_θ}[r(x,y)] — the prompt's expected reward — so it could be subtracted. But we can estimate that expectation by Monte Carlo: sample a group of K responses to the same prompt and average their rewards. The group mean is a one-line, network-free baseline.

The GRPO advantage

For prompt x, sample K rollouts, score each, and set

A_i = ( r_i − mean_j r_j ) / ( std_j r_j + ε )

then run the exact PPO-clip surrogate with A_i broadcast to every token of rollout i. Everything else — ρ, the clip [1−ε, 1+ε], the β·KL anchor to π_ref — is PPO unchanged. GRPO = PPO − value head. The trade: K forward passes per prompt to build the baseline, against an entire second network you no longer train or shard. Memory drops from ~2× to ~1× the policy.

Two subtleties — one is a forward reference

(1) A small finite-K bias. Each r_i appears inside its own baseline (via the mean), so strictly the estimator is biased at finite K — the bias is O(1/K) and vanishes as K grows. The strictly-unbiased fix uses the mean of the other K−1 rollouts (RLOO). (2) The /std introduces a scale bias. Dividing by the group std makes the loss reward-scale-invariant, but it amplifies gradients from low-variance groups (the easiest and hardest prompts), which is not where the most learning signal is. That is a real bias the raw policy-gradient theorem has no /std for — and removing it is exactly what Dr.GRPO does. We forward-reference it here and the sibling course derives it.

Cross-link · where these go next

WHERE THIS GOES NEXT (SYSTEMS)

This theory course derives the algorithms; the sibling RL Post-Training course runs them on a GPU cluster and dissects the failure modes:

DPO — RL without RL: the same closed-form derivation, plus the preference-loss family (IPO, KTO, ORPO, SimPO), what DPO gives up (no exploration), and why frontier pipelines still end with an on-policy RL stage.
GRPO — drop the critic: the group baseline in production code, the degenerate-group (all-equal rewards) edge case, and the PPO-vs-GRPO memory accounting.
Dr.GRPO — GRPO done right: the derivation of the /std scale bias flagged above, plus a second length-normalization bias, and why dropping both matters for long chain-of-thought reasoning.

Same symbols throughout: β (KL/DPO temperature), ρ = π_θ/π_old (IS ratio), A (advantage), KL(·‖·).

The two derivations, side by side

	DPO	GRPO
Starts from	KL-regularized objective max 𝔼[r] − β·KL	PPO clipped surrogate (L11)
Key identity	optimal policy ⇒ reward = β·log(π/π_ref) + β·log Z	PG baseline theorem: any b(x) is unbiased
What cancels / drops	Z(x) cancels in the BT difference	the value critic V_φ drops out
Removes	reward model + on-policy sampling	a second network (~half the memory)
Cost paid	cannot explore; offline only	K rollouts/prompt; small /std & 1/K bias
Best for	style/tone/helpfulness from preference pairs	verifiable reward (math, code, reasoning)

Takeaway

DPO and GRPO are not new ideas bolted onto PPO — they are two inversions of machinery you already had. DPO inverts the KL-regularized optimum (π^* ∝ π_ref·e^r/β) so the reward becomes a log-ratio; plugged into Bradley–Terry the partition function Z(x) cancels, leaving a reward-model-free, sampling-free binary classifier. GRPO replaces PPO's learned critic with the mean of K grouped rollouts — a valid prompt-only baseline that halves memory, at the price of K samples and a small /std bias. Next we wire these together into the end-to-end post-training recipe: RLHF.