DPO — preferences without a reward model

Five algebraic moves that erase the reward model and leave behind a single closed-form loss.

The gap SFT (and CoT-SFT) leaves

SFT and its CoT cousin both have the same shape: every example is a prompt x paired with one gold response y*, and the loss pushes log π_θ(y* | x) upward. That works as long as the world hands you a single canonical answer per prompt. The trouble is that most real human feedback does not look like that. It looks like relative judgments:

"y' was close but worse than y*" — a rater's near-miss, not wrong enough to throw away.
"any of {y_a, y_b, y_c} are fine but y_d is not" — a partial-order signal with multiple acceptable winners.
"y_w was rated higher than y_l" — the canonical pairwise comparison, the format most RLHF labeling pipelines actually produce.

SFT cannot consume any of these. A cross-entropy loss against a single target by construction treats every non-target token as "wrong" and every target as "right". It has no syntax for "A beats B and we don't care about the rest". You can try ugly hacks — pretend the winner is the gold and the loser doesn't exist — but you throw away the comparison information that made the pair valuable in the first place, and you ignore the strength of the rejection.

That gap is the reason a post-SFT stage exists at all. We want a training objective whose native input is the triple (x, y_w, y_l): a prompt, a preferred response, a rejected response. The history of how the field arrived at the modern answer is short, and it explains why DPO feels almost suspiciously clean.

The classical RLHF path — and why it hurts

The original recipe, dating to InstructGPT-era papers, takes two networks and a sampler. It is two steps:

Fit a reward model r_φ(x, y) on the preference dataset. The training objective is the Bradley–Terry likelihood: the probability that y_w beats y_l under a latent score is the logistic of their score difference,
P(y_w ≻ y_l | x) = σ(r_φ(x, y_w) − r_φ(x, y_l)).
Minimize the negative log-likelihood of the preference labels and you get a learned scalar function that scores any response.
Maximize expected reward under a KL anchor. With r_φ in hand, treat it as the reward signal and optimize the policy with PPO (or another on-policy RL method) against
max_θ 𝔼_{y ∼ π_θ(·|x)} [ r_φ(x, y) ] − β · KL( π_θ(·|x) ‖ π_ref(·|x) ).
The KL term to a frozen reference policy keeps π_θ from drifting into degenerate distributions that score high under r_φ but produce gibberish.

This works. It also has a long list of operational headaches:

Two networks. You train, evaluate, and ship a reward model alongside the policy. They have to agree on tokenization, special tokens, and chat template.
On-policy sampling. PPO needs fresh rollouts from the current policy at every update, which means you're running inference inside your training loop. The systems work alone is a project.
Moving value head. PPO's advantage estimator uses a value baseline that you also have to train, often with a separate head on the policy. Its loss can dominate gradients early in training.
Reward hacking. The reward model is a learned function with finite training data. The policy is a powerful optimizer that will find points in response space where r_φ is large but the actual response is bad. Classic failure modes: length hacking (longer outputs systematically score higher), sycophancy (agreeing with the user pleases the reward model), and broader reward overoptimization where validation performance falls as training reward climbs.

The motivating question

Can we get the benefit of preference-based training — a policy that ranks y_w above y_l — without standing up a reward model and an on-policy RL loop? The answer turns out to be yes, and the path is algebraic, not engineering.

DPO's central trick — erasing the reward model

The DPO paper observes that if you write down the KL-regularized RL objective above, its optimum has a closed form. That closed form contains all the information about the reward you would need. And when you plug that form back into the Bradley–Terry preference likelihood, the reward function literally cancels out, leaving a loss expressed only in terms of π_θ and π_ref. We can train on preferences without ever materializing r.

The derivation is five steps. Read it slowly the first time. Every move is elementary; the surprise is only that they combine.

(A) The KL-regularized objective

Drop the explicit expectation over x for a moment and look at the per-prompt problem. For a fixed x, we want to choose the distribution π(·|x) that maximizes

J(π) = 𝔼_{y ∼ π(·|x)} [ r(x, y) ] − β · KL( π(·|x) ‖ π_ref(·|x) ).

Think of it as a tug-of-war: the first term pulls π toward responses with high reward; the second term pulls it back toward π_ref. The strength of the pullback is β. With β = 0 there is no constraint and π collapses onto the highest-reward response (a point mass). With β → ∞ the constraint dominates and π is forced back to π_ref. Real β sits between.

(B) The optimum is a Gibbs distribution

For each x this is a variational problem: pick the function π(·|x) that maximizes J(π), subject to π being a probability distribution (sums to 1, non-negative). Expand the KL,

J(π) = Σ_y π(y|x) · r(x, y) − β · Σ_y π(y|x) · log(π(y|x) / π_ref(y|x)),

and add a Lagrange multiplier λ(x) for the normalization constraint Σ_y π(y|x) = 1:

Take the functional derivative with respect to π(y|x) for a fixed y, and set it to zero (first-order condition):

r(x, y) − β · ( log(π(y|x) / π_ref(y|x)) + 1 ) − λ(x) = 0.

Solve algebraically for log π(y|x):

log π(y|x) = log π_ref(y|x) + r(x, y) / β − 1 − λ(x) / β.

Exponentiate, and collect the x-only constants into a single normalizer Z(x) chosen so the result sums to 1:

π*(y|x) = (1 / Z(x)) · π_ref(y|x) · exp( r(x, y) / β ). (*)

This is the Gibbs distribution: take the reference, multiply each response's probability by its reward-exponential, then renormalize. The normalizer Z(x) = Σ_y π_ref(y|x) · exp(r(x, y) / β) depends only on the prompt, not the response. Hold that fact — it is the load-bearing part of the next step.

(C) Invert for the reward

Take logs of (*) and solve for r:

r(x, y) = β · log( π*(y|x) / π_ref(y|x) ) + β · log Z(x).

Every reward function compatible with the KL-regularized optimum factors this way: a log-ratio of "the optimal policy versus the reference" scaled by β, plus an x-dependent constant. The constant is unavoidable — different rewards that differ only by a prompt-dependent additive shift produce the same optimal policy, because the Gibbs construction in (B) absorbs the shift into Z.

(D) Substitute into Bradley–Terry — the cancellation

Now apply the Bradley–Terry likelihood from step 1 of the classical recipe. The probability that y_w beats y_l is the sigmoid of their reward difference:

P(y_w ≻ y_l | x) = σ( r(x, y_w) − r(x, y_l) ).

Substitute the form of r from (C) into both terms:

r(x, y_w) − r(x, y_l) = β · log( π*(y_w|x) / π_ref(y_w|x) ) + ~~β · log Z(x)~~
− β · log( π*(y_l|x) / π_ref(y_l|x) ) − ~~β · log Z(x)~~.

The two β · log Z(x) terms are identical — both depend only on the prompt x, which is fixed across the pair — so they cancel in the difference. This is the move that makes the whole thing work. We are computing a difference of rewards on the same prompt, and the only piece of the reward we cannot evaluate without knowing the full response distribution is the partition function, which depends only on the prompt and therefore appears identically on both sides.

Putting the surviving pieces together:

(E) Treat the policy as the optimum, minimize preference NLL

We have an expression for the preference probability in terms of π* and π_ref — no reward function in sight. The last move is a bit of intellectual sleight of hand: we don't have π* either, but we have π_θ, the policy we are training. Treat π_θ as our parameterization of π*, plug it in, and minimize the negative log-likelihood of the observed preferences:

L_DPO(θ) = − 𝔼_{(x, y_w, y_l)} log σ ( β · log(π_θ(y_w|x) / π_ref(y_w|x)) − β · log(π_θ(y_l|x) / π_ref(y_l|x)) ).

What we just gained

One closed-form loss. No reward model to train. No value head. No on-policy sampling. The reference model stays — it's frozen, and it's the KL anchor that made the closed form exist in the first place. Take it away and step (B) has no Gibbs solution; the whole derivation collapses.

The pieces and their jobs

Three objects appear in the loss. Each does exactly one thing.

Object	Role	Training state
π_θ	The policy being trained. Initialized from the SFT (or CoT-SFT) checkpoint.	updated
π_ref	Frozen copy of the SFT checkpoint. Acts as the KL anchor — defines "the policy's behavioral neighborhood" so updates can't drift to degenerate distributions that score preferences well while destroying everything else.	frozen
β	Inverse "temperature" on the implicit reward. Sets how tightly the policy is anchored to the reference.	hyperparameter

The behavior of β deserves a sentence. Larger β means the implicit reward is amplified — the same log-ratio produces a larger reward magnitude — and, equivalently, deviation from π_ref is penalized harder in the underlying KL objective, so each step moves the policy less. Smaller β means stronger preference enforcement (the loss responds more aggressively to wrong rankings) but more drift from the reference. The standard range is 0.1 to 0.5; the 03_dpo.py default is 0.1.

The implicit reward

Even though we never train a reward model, the algebra above tells us what the policy is implicitly treating as the reward at any moment of training:

r̂(x, y) = β · log( π_θ(y | x) / π_ref(y | x) ).

With this notation the DPO loss is exactly

L_DPO(θ) = − 𝔼 log σ( r̂(x, y_w) − r̂(x, y_l) ).

The quantity r̂_w − r̂_l is the margin. It starts near zero (because π_θ begins as a copy of π_ref, so all log-ratios are zero) and should grow positive as training proceeds. It is the single most useful diagnostic you can log during a DPO run: if it never grows, the data is degenerate, β is too small, or your π_ref already separates the pairs perfectly.

The fraction of the batch where r̂_w > r̂_l — preference accuracy — climbs toward 1.0 as the policy learns to rank pairs correctly. In 03_dpo.py we log both each 100 steps.

Interactive · DPO margin simulator

One preference pair, one knob for β, and a step button. Each click applies a single gradient update — the policy nudges log π_θ(y_w) up and log π_θ(y_l) down by an amount proportional to σ(−margin), which is the gradient of −log σ(r̂_w − r̂_l) with respect to the log-ratios. Watch the implicit reward bars spread and the margin curve grow. Slide β first to see how it scales the per-step change in margin: larger β amplifies each log-ratio nudge into a bigger reward swing.

DPO update on one pair (chosen vs. rejected)

π_θ starts equal to π_ref → both log-ratios are 0 → margin is 0. Each "Step" performs one preference-NLL gradient update on the log-ratios. The margin curve is r̂_w − r̂_l over steps.

step

0.10

margin r̂_w − r̂_l

+0.000

pref-acc

—

implicit reward r̂ = β · (log π_θ − log π_ref)

y_w (chosen) +0.000

y_l (rejected) +0.000

margin over steps

x = step y = r̂_w − r̂_l

β: 0.10 η: 0.20

Why the update has this shape

The loss is L = − log σ(r̂_w − r̂_l). Using d/du [−log σ(u)] = −σ(−u), we get dL / d(r̂_w − r̂_l) = −σ(−(r̂_w − r̂_l)). Since r̂ = β · Δlogp, the gradient pushes log π_θ(y_w) up and log π_θ(y_l) down, each by η · β · σ(−margin). When the margin is large and positive (we've already won), σ(−margin) ≈ 0 and updates shrink — the loss saturates.

Computing log π(y | x) in practice

The loss reads as if it needs sequence-level probabilities. We compute them by running a single forward pass on the concatenated sequence [x; y], taking log_softmax, and gathering the log-probability assigned to the actual target token at each response position. The response mask is the same as in SFT: 1 on response tokens, 0 on prompt and pad.

def gather_logprobs(model, x, response_mask):
    logits, _ = model(x)              # (B, T, V)
    logits = logits[:, :-1, :]        # predictions for positions 0..T-2
    targets = x[:, 1:]                # actual next tokens
    mask    = response_mask[:, 1:]    # response positions only (shift by 1)
    logp    = F.log_softmax(logits, dim=-1)
    per_tok = logp.gather(-1, targets.unsqueeze(-1)).squeeze(-1)
    return (per_tok * mask).sum(dim=-1)   # (B,) — sum-log-prob of y given x

The shift-by-one convention is identical to every other stage in this pipeline: predictions at position t are scored against the token at position t+1. Masking out the prompt is essential — if you sum log-probs over the entire sequence you confound log π(y|x) with log π(x), and the latter is the same for chosen and rejected (same prompt) so it would just contribute noise, but in practice tokens-per-response varies and the mismatch becomes a length bias.

Run this function four times per batch: π_θ on y_w, π_θ on y_l, π_ref on y_w, π_ref on y_l. The two π_ref calls are inside a torch.no_grad() block — the reference is frozen, so no backward pass through it. Plug all four scalars into dpo_loss and you're done.

def dpo_loss(policy_lp_w, policy_lp_l, ref_lp_w, ref_lp_l, beta):
    r_w  = beta * (policy_lp_w - ref_lp_w)
    r_l  = beta * (policy_lp_l - ref_lp_l)
    loss = -F.logsigmoid(r_w - r_l).mean()
    return loss, r_w, r_l

Why DPO actually works — gradient intuition

Take the gradient of the loss with respect to the parameters. The chain rule gives

∇_θ L_DPO = −σ(−(r̂_w − r̂_l)) · β · ( ∇_θ log π_θ(y_w|x) − ∇_θ log π_θ(y_l|x) ).

Read this slowly. The bracketed term is "push the log-prob of y_w up, push the log-prob of y_l down". The leading coefficient σ(−margin) measures how wrong the preference is right now:

When the policy ranks the pair backwards (r̂_l > r̂_w, margin negative), σ(−margin) ≈ 1 and the update is large.
When the policy has already learned the preference (r̂_w ≫ r̂_l, margin large positive), σ(−margin) ≈ 0 and the update fades. The loss saturates — pairs the policy already understands stop contributing gradient.

That saturation is a feature: it makes the loss self-balancing across a dataset where some pairs are easy and others are hard. The hard pairs get the gradient. Compare to vanilla cross-entropy in SFT, which keeps pushing log-probabilities even when the target is already the argmax.

The role of the β · log Z(x) cancellation now reads as: we are doing this push/pull on log-probabilities without needing to know the per-prompt normalization, which we couldn't evaluate even in principle (it sums over all possible responses). The cancellation makes the math tractable for an autoregressive model where computing exact log Z is intractable.

What DPO cannot do — the gap RLVR closes

DPO is a beautiful tool, but it has structural limits that come straight from its inputs.

It consumes static pair datasets. The training signal is whatever pairs you wrote down before training started. The policy cannot improve beyond what's in the pair distribution, because it never sees its own samples scored.
It cannot discover new response styles by exploration. DPO re-ranks within the support of π_ref. If the reference assigns near-zero probability to a hypothetically-correct response, the log-ratio is undefined-ish and DPO has nothing to grab onto. In contrast, an on-policy method can sample, succeed, and reinforce.
For verifiable tasks, preference labels are wasteful. If you have a ground-truth checker — does the program pass the tests? does the math add up? — converting that to pairs and labeling them is throwing away signal. You want to score every rollout directly with the verifier, which is exactly the RLVR setting.

Lesson 6 picks up here. We keep the KL anchor and the frozen reference; we drop the static pairs and replace them with policy rollouts scored by a verifier.

Variants — the DPO family

Once the core trick was published, a small zoo of variants followed. You don't need to know any of them in depth; just know they exist and what shape problem each addresses.

Variant	One-line idea
IPO	Saturating loss prevents over-confident pairs from dominating; σ is replaced with a function that doesn't push margins to infinity.
KTO	Trained on point labels (good / bad) instead of pairs. Useful when raters give thumbs-up / thumbs-down rather than A-vs-B.
ORPO	Drops the reference model entirely and folds the preference signal into a single combined SFT+preference loss.
SimPO	No reference; length-normalized log-likelihoods. Cheaper at inference time (no ref to keep around) at the cost of the formal derivation.

Practical gotchas from `03_dpo.py`

Four things the toy file gets right that beginners often get wrong

Use F.logsigmoid. Computing log(sigmoid(.)) directly is numerically unstable on both tails: large positive arguments underflow inside the log, large negative arguments lose precision in the sigmoid. F.logsigmoid is the stable primitive that handles both.
Drop weight decay during DPO. SFT uses weight_decay=0.1. DPO uses weight_decay=0.0. The KL anchor to π_ref already pulls the policy toward a non-zero reference; adding weight decay piles a second regularizer that pulls toward zero, which is not what you want. Double-regularization shrinks the policy away from the reference instead of staying near it.
Use a smaller LR than SFT. The toy uses lr=5e-5 for DPO versus lr=3e-4 for SFT. Log-ratios are higher-variance than raw cross-entropy losses — they amplify whatever noise lives in the model's token-level probabilities — so the same effective gradient asks for a smaller step. The common wisdom is 5–10× smaller than the SFT learning rate.
Same-length pairs in the toy. The toy constructs y_w and y_l to have identical lengths so batches stack without padding. Real DPO needs proper padding plus per-token response masks, and you should also watch for length bias — chosen responses being systematically longer or shorter than rejected ones can confound the signal. The token mask in gather_logprobs is exactly the lever to handle padding cleanly.

Stage 3 in the pipeline

Where we are: the policy can follow instructions (SFT), reason out loud (CoT), and now rank responses according to a frozen preference dataset (DPO). What it still cannot do is improve through interaction with a verifier on its own samples. That is the move lesson 6 makes.

PRETRAIN SFT CoT DPO RLVR

The same algebra you just walked through reappears in RLVR / GRPO: the KL-regularized objective from step (A), the frozen π_ref, the per-token log-ratios. The difference is the source of the reward signal. In DPO, the reward is the implicit log-ratio fit through preference pairs. In RLVR, the reward is a programmatic verifier evaluated on policy rollouts. Same KL anchor, different question.

Takeaway

DPO collapses the two-step RLHF recipe (reward model, then PPO) into one closed-form loss by exploiting that the KL-regularized optimum is a Gibbs distribution whose partition function cancels across Bradley–Terry pair differences. The reference model is the only non-policy network that survives — frozen, used to compute log-ratios. No reward model. No value head. No on-policy sampling.