DPO — RL without RL Rafailov '23

RLHF compresses preferences into a reward model, then optimizes the policy against it. DPO observes that the optimal policy of that RL problem has a closed form in the reward — so you can skip the reward model and directly fit the policy from preferences with one supervised-style loss.

Where this lesson sits

Lesson 15 introduced the three-stage RLHF ladder: SFT → reward model → PPO. DPO collapses the last two stages into one supervised-style loss. The trade-off is dramatic: you give up the on-policy sampling loop entirely, in exchange for a single, debuggable training objective. The full preference-loss family (IPO, KTO, ORPO, SimPO) are all small perturbations on the same idea. After this you have both halves of "reward without a verifier" — RLHF and DPO — and lesson 17 will then ask what happens when the reward is per-step instead of per-response.

The trick, stated up front

Start from the RLHF objective (lesson 15):

max_π 𝔼_{y ∼ π(·|x)} [ r(x, y) ] − β · KL( π ‖ π_ref )

For any fixed reward r, this problem has a known optimal policy. (It's a textbook variational result; the Lagrangian gives it in two lines.) The optimum is:

π^*(y | x) = π_ref(y | x) · exp( r(x, y) / β ) / Z(x)

where Z(x) is the per-prompt normalizer. Solve this for r:

r(x, y) = β · log( π^*(y | x) / π_ref(y | x) ) + β · log Z(x)

Now plug this expression into the Bradley–Terry preference likelihood from lesson 15. The Z(x) term cancels because BT depends only on the score difference, and both responses share the same prompt:

L_DPO(θ) = − 𝔼_{(x, y_w, y_l)} log σ ⎡ β · log π_θ(y_w|x)/π_ref(y_w|x) − β · log π_θ(y_l|x)/π_ref(y_l|x) ⎤

That's it. The reward model is gone; the PPO loop is gone; the only thing being trained is the policy π_θ itself, against a static dataset of preference pairs, with a single forward pass per response. No rollouts. No clipping. No KL penalty term (the π_ref ratios are the KL). No reward hacking — there is no reward.

What just happened, in one sentence

DPO re-parameterizes the policy as the implicit reward model. Training the policy to satisfy the BT likelihood on observed preferences is mathematically equivalent to training a BT reward model and PPO-optimizing the KL-regularized objective against it — but with one network instead of two and no on-policy sampling.

What you give up

The trick only works because we collapsed three things into one: off-policy data, a fixed reference, and the specific KL-regularized RL objective. If you bend any of those, you bend the equivalence.

It's offline. Preference pairs are sampled from some policy at collection time — usually π_SFT. Once collected, you cannot use the in-training policy's own samples. You're learning from a fixed dataset, like SFT.
It cannot explore. If the preference set doesn't contain the response you'd want the model to produce, no amount of DPO will get there. RLHF can sample a new response, score it, and reinforce it; DPO cannot.
It bakes β into the data. Changing β in DPO doesn't change the optimization trajectory the way the KL coefficient does in PPO — it just rescales the implicit-reward magnitude relative to the preference noise.
The implicit reward is overconfident. Empirically DPO drives log π_θ(y_w) − log π_θ(y_l) very high on the training pairs and overfits chosen-vs-rejected structure; both π_θ(y_w) and π_θ(y_l) typically decrease in absolute probability. IPO and KTO below were designed to address exactly this.

Interactive · DPO on a 4-response toy

Same setup as lesson 15: four responses, hidden true scores, BT-simulated human pairs. But now we're training a policy: four logits whose softmax is π_θ(y|x). The reference π_ref is uniform. Each step samples one preference pair and applies the DPO loss. Watch the policy concentrate on the highest-scored response while the implicit reward gap blows up.

The preference-loss family · same skeleton, different innards

Once you accept DPO's frame — "treat the policy as the implicit reward and fit it directly to preference data" — there is a small family of variations, each one a swap of the loss function or the data interpretation. The skeleton is the same; the cost is the same; the trade-offs differ.

Method	Year	Loss / change	Why use it
DPO	'23	−log σ(β · Δlogratio)	The baseline. Cheap, offline, no RM. Overconfident; both probs may drop.
IPO	'23	(Δlogratio − 1/(2β))²	Replaces the log-sigmoid with a regression on the implicit-reward gap. Bounded loss prevents the BT-induced overconfidence, fixes one of DPO's main pathologies.
KTO	'24	Per-response Kahneman–Tversky utility (no pairs)	Trained from "this response is good / bad" labels rather than pairwise preferences. Useful when you have thumbs-up/down data instead of A-vs-B.
ORPO	'24	NLL on y_w + odds-ratio penalty on y_l	Fuses SFT and preference learning in one stage — no π_ref needed. Saves one model copy.
SimPO	'24	Length-normalized DPO, no π_ref	Drops the reference model and divides log-prob by length. Cheaper still; competitive with DPO on length-balanced benchmarks.

IPO, in one paragraph

DPO's pathology: when one preference is repeated many times, BT pushes the score gap toward infinity. The log-sigmoid loss has no fixed point — gradient persists. IPO swaps the BT log-likelihood for a squared-error regression: the implicit reward gap should equal a target value 1/(2β), not be infinite. Now the loss has a minimum and the model stops over-optimizing the chosen vs. rejected gap once it's reached. In practice IPO is a one-line swap and a meaningful robustness win on noisy or partially-tied preferences.

The most intuitive way to see the difference is to plot the per-pair loss as a function of the (already-learned) implicit reward gap h = β·Δlog-ratio. DPO's −log σ(h) asymptotes to zero only at h = ∞; gradient never goes away. IPO's (h − 1/(2β))² has a clean minimum at the target gap; gradient flips sign past it. Drag β below and watch IPO's minimum slide along the axis.

KTO, when you don't have pairs

Real product data is rarely pairwise — it's thumbs-up / thumbs-down on single responses. KTO reframes the loss in terms of per-response utility under prospect theory: the model is rewarded for high implicit reward on liked responses and penalized for high implicit reward on disliked ones, with an asymmetric loss that penalizes losses more than equivalent gains. The math is heavier than DPO; the data requirement is much lighter.

ORPO and SimPO, when you want even fewer moving parts

Both eliminate the frozen reference model. ORPO fuses SFT (cross-entropy on the chosen response) with an odds-ratio penalty (push the rejected response's relative odds down) in one loss. SimPO drops π_ref by length-normalizing the policy log-prob and adding a margin: L = −log σ( β/|y_w| · log π(y_w) − β/|y_l| · log π(y_l) − γ ). Both save the memory of holding π_ref in GPU RAM — which matters when you're already tight on VRAM at 70B+ scale.

When DPO & family beat PPO, and when they don't

Pick DPO/IPO/KTO/ORPO/SimPO if…	Pick PPO (lesson 15) if…
You already have a fixed preference dataset and won't re-sample.	You can run rollouts during training and want fresh data.
Compute is tight (no RM, no rollout infra).	You can afford 4× model copies + rollout engine + RM serving.
The reachable optimum lies inside the data distribution.	The good responses don't exist in your preference set; you need to discover them.
You want a reproducible, debuggable supervised-style loss.	You're willing to debug a multi-component training loop.
Your task is style, tone, formatting, helpfulness.	Your task is reasoning, math, code with a verifier (then use GRPO/DAPO instead).

A common surprise in practice

Teams often try DPO first because it's cheap, get a 60–70% solution, then hit a ceiling: the model has learned to imitate the chosen-vs-rejected distinction, but the absolute response quality plateaus. The fix is usually one of (a) iterative DPO — re-collect preference pairs from the new model and repeat, (b) switch to IPO to avoid the overconfidence trap, or (c) bite the bullet on a PPO/GRPO setup. Every frontier-lab post-training pipeline ends with at least one on-policy RL stage.

Why this matters for reasoning RL

The reasoning-RL algorithms in Part II (GRPO, RLOO, DAPO, Dr.GRPO) are all on-policy like PPO. DPO and friends are an alternative track that works well when (1) the reward signal is a human preference, not a verifier, and (2) you have or can synthesize lots of static preference pairs. In modern post-training pipelines you often see DPO used between RL stages — for example, R1's final stage mixes RLVR-on-reasoning with DPO-on-helpfulness data. The two families are complementary, not competitive.

Takeaway

DPO is RLHF with the reward model folded into the policy and the on-policy sampling deleted. You save infrastructure but lose exploration. The IPO/KTO/ORPO/SimPO variants are small perturbations on the same idea, each trading away a different RLHF assumption.