rl_lessons / 16 · DPO & family lesson 2 / 9 · part III

DPO — RL without RL Rafailov '23

RLHF compresses preferences into a reward model, then optimizes the policy against it. DPO observes that the optimal policy of that RL problem has a closed form in the reward — so you can skip the reward model and directly fit the policy from preferences with one supervised-style loss.

Where this lesson sits
Lesson 15 introduced the three-stage RLHF ladder: SFT → reward model → PPO. DPO collapses the last two stages into one supervised-style loss. The trade-off is dramatic: you give up the on-policy sampling loop entirely, in exchange for a single, debuggable training objective. The full preference-loss family (IPO, KTO, ORPO, SimPO) are all small perturbations on the same idea. After this you have both halves of "reward without a verifier" — RLHF and DPO — and lesson 17 will then ask what happens when the reward is per-step instead of per-response.

The trick, stated up front

Start from the RLHF objective (lesson 15):

maxπ   𝔼y ∼ π(·|x) [ r(x, y) ]  −  β · KL( π ‖ πref )

For any fixed reward r, this problem has a known optimal policy. (It's a textbook variational result; the Lagrangian gives it in two lines.) The optimum is:

π*(y | x)  =  πref(y | x) · exp( r(x, y) / β )  /  Z(x)

where Z(x) is the per-prompt normalizer. Solve this for r:

r(x, y)  =  β · log( π*(y | x) / πref(y | x) )  +  β · log Z(x)

Now plug this expression into the Bradley–Terry preference likelihood from lesson 15. The Z(x) term cancels because BT depends only on the score difference, and both responses share the same prompt:

LDPO(θ)  =  − 𝔼(x, yw, yl) log σ  ⎡ β · log  πθ(yw|x)/πref(yw|x)  −  β · log  πθ(yl|x)/πref(yl|x)  ⎤

That's it. The reward model is gone; the PPO loop is gone; the only thing being trained is the policy πθ itself, against a static dataset of preference pairs, with a single forward pass per response. No rollouts. No clipping. No KL penalty term (the πref ratios are the KL). No reward hacking — there is no reward.

What just happened, in one sentence
DPO re-parameterizes the policy as the implicit reward model. Training the policy to satisfy the BT likelihood on observed preferences is mathematically equivalent to training a BT reward model and PPO-optimizing the KL-regularized objective against it — but with one network instead of two and no on-policy sampling.

What you give up

The trick only works because we collapsed three things into one: off-policy data, a fixed reference, and the specific KL-regularized RL objective. If you bend any of those, you bend the equivalence.

Interactive · DPO on a 4-response toy

Same setup as lesson 15: four responses, hidden true scores, BT-simulated human pairs. But now we're training a policy: four logits whose softmax is πθ(y|x). The reference πref is uniform. Each step samples one preference pair and applies the DPO loss. Watch the policy concentrate on the highest-scored response while the implicit reward gap blows up.

DPO on 4 responses (πref = uniform)
Hidden true scores: A=0.0, B=1.0, C=2.0, D=2.5. β controls implicit-reward sharpness. Notice that absolute probabilities of both responses in a pair can drop together — the loss only cares about the difference.
Pairs seen
0
Avg π(D)
0.25
Implicit Δr (D − A)
0.00

The preference-loss family · same skeleton, different innards

Once you accept DPO's frame — "treat the policy as the implicit reward and fit it directly to preference data" — there is a small family of variations, each one a swap of the loss function or the data interpretation. The skeleton is the same; the cost is the same; the trade-offs differ.

MethodYearLoss / changeWhy use it
DPO '23 −log σ(β · Δlogratio) The baseline. Cheap, offline, no RM. Overconfident; both probs may drop.
IPO '23 (Δlogratio − 1/(2β))2 Replaces the log-sigmoid with a regression on the implicit-reward gap. Bounded loss prevents the BT-induced overconfidence, fixes one of DPO's main pathologies.
KTO '24 Per-response Kahneman–Tversky utility (no pairs) Trained from "this response is good / bad" labels rather than pairwise preferences. Useful when you have thumbs-up/down data instead of A-vs-B.
ORPO '24 NLL on yw + odds-ratio penalty on yl Fuses SFT and preference learning in one stage — no πref needed. Saves one model copy.
SimPO '24 Length-normalized DPO, no πref Drops the reference model and divides log-prob by length. Cheaper still; competitive with DPO on length-balanced benchmarks.

IPO, in one paragraph

DPO's pathology: when one preference is repeated many times, BT pushes the score gap toward infinity. The log-sigmoid loss has no fixed point — gradient persists. IPO swaps the BT log-likelihood for a squared-error regression: the implicit reward gap should equal a target value 1/(2β), not be infinite. Now the loss has a minimum and the model stops over-optimizing the chosen vs. rejected gap once it's reached. In practice IPO is a one-line swap and a meaningful robustness win on noisy or partially-tied preferences.

The most intuitive way to see the difference is to plot the per-pair loss as a function of the (already-learned) implicit reward gap h = β·Δlog-ratio. DPO's −log σ(h) asymptotes to zero only at h = ∞; gradient never goes away. IPO's (h − 1/(2β))2 has a clean minimum at the target gap; gradient flips sign past it. Drag β below and watch IPO's minimum slide along the axis.

Per-pair loss vs. implicit reward gap h
Same axis for both: h = β · (log π/π_ref of winner − log π/π_ref of loser). The DPO loss curves down asymptotically; IPO's bowl has a finite minimum that anchors training. Drag β.

KTO, when you don't have pairs

Real product data is rarely pairwise — it's thumbs-up / thumbs-down on single responses. KTO reframes the loss in terms of per-response utility under prospect theory: the model is rewarded for high implicit reward on liked responses and penalized for high implicit reward on disliked ones, with an asymmetric loss that penalizes losses more than equivalent gains. The math is heavier than DPO; the data requirement is much lighter.

ORPO and SimPO, when you want even fewer moving parts

Both eliminate the frozen reference model. ORPO fuses SFT (cross-entropy on the chosen response) with an odds-ratio penalty (push the rejected response's relative odds down) in one loss. SimPO drops πref by length-normalizing the policy log-prob and adding a margin: L = −log σ( β/|yw| · log π(yw) − β/|yl| · log π(yl) − γ ). Both save the memory of holding πref in GPU RAM — which matters when you're already tight on VRAM at 70B+ scale.

When DPO & family beat PPO, and when they don't

Pick DPO/IPO/KTO/ORPO/SimPO if…Pick PPO (lesson 15) if…
You already have a fixed preference dataset and won't re-sample. You can run rollouts during training and want fresh data.
Compute is tight (no RM, no rollout infra). You can afford 4× model copies + rollout engine + RM serving.
The reachable optimum lies inside the data distribution. The good responses don't exist in your preference set; you need to discover them.
You want a reproducible, debuggable supervised-style loss. You're willing to debug a multi-component training loop.
Your task is style, tone, formatting, helpfulness. Your task is reasoning, math, code with a verifier (then use GRPO/DAPO instead).
A common surprise in practice
Teams often try DPO first because it's cheap, get a 60–70% solution, then hit a ceiling: the model has learned to imitate the chosen-vs-rejected distinction, but the absolute response quality plateaus. The fix is usually one of (a) iterative DPO — re-collect preference pairs from the new model and repeat, (b) switch to IPO to avoid the overconfidence trap, or (c) bite the bullet on a PPO/GRPO setup. Every frontier-lab post-training pipeline ends with at least one on-policy RL stage.

Why this matters for reasoning RL

The reasoning-RL algorithms in Part II (GRPO, RLOO, DAPO, Dr.GRPO) are all on-policy like PPO. DPO and friends are an alternative track that works well when (1) the reward signal is a human preference, not a verifier, and (2) you have or can synthesize lots of static preference pairs. In modern post-training pipelines you often see DPO used between RL stages — for example, R1's final stage mixes RLVR-on-reasoning with DPO-on-helpfulness data. The two families are complementary, not competitive.

Takeaway
DPO is RLHF with the reward model folded into the policy and the on-policy sampling deleted. You save infrastructure but lose exploration. The IPO/KTO/ORPO/SimPO variants are small perturbations on the same idea, each trading away a different RLHF assumption.