DPO — RL without RL Rafailov '23
RLHF compresses preferences into a reward model, then optimizes the policy against it. DPO observes that the optimal policy of that RL problem has a closed form in the reward — so you can skip the reward model and directly fit the policy from preferences with one supervised-style loss.
The trick, stated up front
Start from the RLHF objective (lesson 15):
For any fixed reward r, this problem has a known optimal policy. (It's a textbook variational result; the Lagrangian gives it in two lines.) The optimum is:
where Z(x) is the per-prompt normalizer. Solve this for r:
Now plug this expression into the Bradley–Terry preference likelihood from lesson 15. The Z(x) term cancels because BT depends only on the score difference, and both responses share the same prompt:
That's it. The reward model is gone; the PPO loop is gone; the only thing being trained is the policy πθ itself, against a static dataset of preference pairs, with a single forward pass per response. No rollouts. No clipping. No KL penalty term (the πref ratios are the KL). No reward hacking — there is no reward.
What you give up
The trick only works because we collapsed three things into one: off-policy data, a fixed reference, and the specific KL-regularized RL objective. If you bend any of those, you bend the equivalence.
- It's offline. Preference pairs are sampled from some policy at collection time — usually πSFT. Once collected, you cannot use the in-training policy's own samples. You're learning from a fixed dataset, like SFT.
- It cannot explore. If the preference set doesn't contain the response you'd want the model to produce, no amount of DPO will get there. RLHF can sample a new response, score it, and reinforce it; DPO cannot.
- It bakes β into the data. Changing β in DPO doesn't change the optimization trajectory the way the KL coefficient does in PPO — it just rescales the implicit-reward magnitude relative to the preference noise.
- The implicit reward is overconfident. Empirically DPO drives log πθ(yw) − log πθ(yl) very high on the training pairs and overfits chosen-vs-rejected structure; both πθ(yw) and πθ(yl) typically decrease in absolute probability. IPO and KTO below were designed to address exactly this.
Interactive · DPO on a 4-response toy
Same setup as lesson 15: four responses, hidden true scores, BT-simulated human pairs. But now we're training a policy: four logits whose softmax is πθ(y|x). The reference πref is uniform. Each step samples one preference pair and applies the DPO loss. Watch the policy concentrate on the highest-scored response while the implicit reward gap blows up.
The preference-loss family · same skeleton, different innards
Once you accept DPO's frame — "treat the policy as the implicit reward and fit it directly to preference data" — there is a small family of variations, each one a swap of the loss function or the data interpretation. The skeleton is the same; the cost is the same; the trade-offs differ.
| Method | Year | Loss / change | Why use it |
|---|---|---|---|
| DPO | '23 | −log σ(β · Δlogratio) | The baseline. Cheap, offline, no RM. Overconfident; both probs may drop. |
| IPO | '23 | (Δlogratio − 1/(2β))2 | Replaces the log-sigmoid with a regression on the implicit-reward gap. Bounded loss prevents the BT-induced overconfidence, fixes one of DPO's main pathologies. |
| KTO | '24 | Per-response Kahneman–Tversky utility (no pairs) | Trained from "this response is good / bad" labels rather than pairwise preferences. Useful when you have thumbs-up/down data instead of A-vs-B. |
| ORPO | '24 | NLL on yw + odds-ratio penalty on yl | Fuses SFT and preference learning in one stage — no πref needed. Saves one model copy. |
| SimPO | '24 | Length-normalized DPO, no πref | Drops the reference model and divides log-prob by length. Cheaper still; competitive with DPO on length-balanced benchmarks. |
IPO, in one paragraph
DPO's pathology: when one preference is repeated many times, BT pushes the score gap toward infinity. The log-sigmoid loss has no fixed point — gradient persists. IPO swaps the BT log-likelihood for a squared-error regression: the implicit reward gap should equal a target value 1/(2β), not be infinite. Now the loss has a minimum and the model stops over-optimizing the chosen vs. rejected gap once it's reached. In practice IPO is a one-line swap and a meaningful robustness win on noisy or partially-tied preferences.
The most intuitive way to see the difference is to plot the per-pair loss as a function of the (already-learned) implicit reward gap h = β·Δlog-ratio. DPO's −log σ(h) asymptotes to zero only at h = ∞; gradient never goes away. IPO's (h − 1/(2β))2 has a clean minimum at the target gap; gradient flips sign past it. Drag β below and watch IPO's minimum slide along the axis.
KTO, when you don't have pairs
Real product data is rarely pairwise — it's thumbs-up / thumbs-down on single responses. KTO reframes the loss in terms of per-response utility under prospect theory: the model is rewarded for high implicit reward on liked responses and penalized for high implicit reward on disliked ones, with an asymmetric loss that penalizes losses more than equivalent gains. The math is heavier than DPO; the data requirement is much lighter.
ORPO and SimPO, when you want even fewer moving parts
Both eliminate the frozen reference model. ORPO fuses SFT (cross-entropy on the chosen response) with an odds-ratio penalty (push the rejected response's relative odds down) in one loss. SimPO drops πref by length-normalizing the policy log-prob and adding a margin: L = −log σ( β/|yw| · log π(yw) − β/|yl| · log π(yl) − γ ). Both save the memory of holding πref in GPU RAM — which matters when you're already tight on VRAM at 70B+ scale.
When DPO & family beat PPO, and when they don't
| Pick DPO/IPO/KTO/ORPO/SimPO if… | Pick PPO (lesson 15) if… |
|---|---|
| You already have a fixed preference dataset and won't re-sample. | You can run rollouts during training and want fresh data. |
| Compute is tight (no RM, no rollout infra). | You can afford 4× model copies + rollout engine + RM serving. |
| The reachable optimum lies inside the data distribution. | The good responses don't exist in your preference set; you need to discover them. |
| You want a reproducible, debuggable supervised-style loss. | You're willing to debug a multi-component training loop. |
| Your task is style, tone, formatting, helpfulness. | Your task is reasoning, math, code with a verifier (then use GRPO/DAPO instead). |
Why this matters for reasoning RL
The reasoning-RL algorithms in Part II (GRPO, RLOO, DAPO, Dr.GRPO) are all on-policy like PPO. DPO and friends are an alternative track that works well when (1) the reward signal is a human preference, not a verifier, and (2) you have or can synthesize lots of static preference pairs. In modern post-training pipelines you often see DPO used between RL stages — for example, R1's final stage mixes RLVR-on-reasoning with DPO-on-helpfulness data. The two families are complementary, not competitive.