gpt_mini / lessons / 05 · dpo lesson 5 / 6

DPO — preferences without a reward model

Five algebraic moves that erase the reward model and leave behind a single closed-form loss.

The gap SFT (and CoT-SFT) leaves

SFT and its CoT cousin both have the same shape: every example is a prompt x paired with one gold response y*, and the loss pushes log πθ(y* | x) upward. That works as long as the world hands you a single canonical answer per prompt. The trouble is that most real human feedback does not look like that. It looks like relative judgments:

SFT cannot consume any of these. A cross-entropy loss against a single target by construction treats every non-target token as "wrong" and every target as "right". It has no syntax for "A beats B and we don't care about the rest". You can try ugly hacks — pretend the winner is the gold and the loser doesn't exist — but you throw away the comparison information that made the pair valuable in the first place, and you ignore the strength of the rejection.

That gap is the reason a post-SFT stage exists at all. We want a training objective whose native input is the triple (x, yw, yl): a prompt, a preferred response, a rejected response. The history of how the field arrived at the modern answer is short, and it explains why DPO feels almost suspiciously clean.

The classical RLHF path — and why it hurts

The original recipe, dating to InstructGPT-era papers, takes two networks and a sampler. It is two steps:

  1. Fit a reward model rφ(x, y) on the preference dataset. The training objective is the Bradley–Terry likelihood: the probability that yw beats yl under a latent score is the logistic of their score difference,
    P(yw ≻ yl | x)  =  σ(rφ(x, yw) − rφ(x, yl)).
    Minimize the negative log-likelihood of the preference labels and you get a learned scalar function that scores any response.
  2. Maximize expected reward under a KL anchor. With rφ in hand, treat it as the reward signal and optimize the policy with PPO (or another on-policy RL method) against
    maxθ   𝔼y ∼ πθ(·|x) [ rφ(x, y) ]  −  β · KL( πθ(·|x) ‖ πref(·|x) ).
    The KL term to a frozen reference policy keeps πθ from drifting into degenerate distributions that score high under rφ but produce gibberish.

This works. It also has a long list of operational headaches:

The motivating question
Can we get the benefit of preference-based training — a policy that ranks yw above yl — without standing up a reward model and an on-policy RL loop? The answer turns out to be yes, and the path is algebraic, not engineering.

DPO's central trick — erasing the reward model

The DPO paper observes that if you write down the KL-regularized RL objective above, its optimum has a closed form. That closed form contains all the information about the reward you would need. And when you plug that form back into the Bradley–Terry preference likelihood, the reward function literally cancels out, leaving a loss expressed only in terms of πθ and πref. We can train on preferences without ever materializing r.

The derivation is five steps. Read it slowly the first time. Every move is elementary; the surprise is only that they combine.

(A) The KL-regularized objective

Drop the explicit expectation over x for a moment and look at the per-prompt problem. For a fixed x, we want to choose the distribution π(·|x) that maximizes

J(π)  =  𝔼y ∼ π(·|x) [ r(x, y) ]  −  β · KL( π(·|x) ‖ πref(·|x) ).

Think of it as a tug-of-war: the first term pulls π toward responses with high reward; the second term pulls it back toward πref. The strength of the pullback is β. With β = 0 there is no constraint and π collapses onto the highest-reward response (a point mass). With β → ∞ the constraint dominates and π is forced back to πref. Real β sits between.

(B) The optimum is a Gibbs distribution

For each x this is a variational problem: pick the function π(·|x) that maximizes J(π), subject to π being a probability distribution (sums to 1, non-negative). Expand the KL,

J(π)  =  Σy π(y|x) · r(x, y)  −  β · Σy π(y|x) · log(π(y|x) / πref(y|x)),

and add a Lagrange multiplier λ(x) for the normalization constraint Σy π(y|x) = 1:

ℒ  =  Σy π(y|x) · r(x, y)  −  β · Σy π(y|x) · log(π(y|x) / πref(y|x))  −  λ(x) · (Σy π(y|x) − 1).

Take the functional derivative with respect to π(y|x) for a fixed y, and set it to zero (first-order condition):

r(x, y)  −  β · ( log(π(y|x) / πref(y|x)) + 1 )  −  λ(x)  =  0.

Solve algebraically for log π(y|x):

log π(y|x)  =  log πref(y|x)  +  r(x, y) / β  −  1  −  λ(x) / β.

Exponentiate, and collect the x-only constants into a single normalizer Z(x) chosen so the result sums to 1:

π*(y|x)  =  (1 / Z(x)) · πref(y|x) · exp( r(x, y) / β ).     (*)

This is the Gibbs distribution: take the reference, multiply each response's probability by its reward-exponential, then renormalize. The normalizer Z(x) = Σy πref(y|x) · exp(r(x, y) / β) depends only on the prompt, not the response. Hold that fact — it is the load-bearing part of the next step.

π*(y|x)   =   π_ref(y|x)   ·   exp(r(x,y) / β)   /   Z(x) π_ref(y) y₁ y₂ y₃ × exp(r(x,y) / β) y₁ y₂ y₃ / Z = π*(y|x) y₁ y₂ y₃ The optimal policy tilts the reference toward high-reward responses; Z renormalizes the result.

(C) Invert for the reward

Take logs of (*) and solve for r:

r(x, y)  =  β · log( π*(y|x) / πref(y|x) )  +  β · log Z(x).

Every reward function compatible with the KL-regularized optimum factors this way: a log-ratio of "the optimal policy versus the reference" scaled by β, plus an x-dependent constant. The constant is unavoidable — different rewards that differ only by a prompt-dependent additive shift produce the same optimal policy, because the Gibbs construction in (B) absorbs the shift into Z.

(D) Substitute into Bradley–Terry — the cancellation

Now apply the Bradley–Terry likelihood from step 1 of the classical recipe. The probability that yw beats yl is the sigmoid of their reward difference:

P(yw ≻ yl | x)  =  σ( r(x, yw) − r(x, yl) ).

Substitute the form of r from (C) into both terms:

r(x, yw) − r(x, yl)  =  β · log( π*(yw|x) / πref(yw|x) )  +  β · log Z(x)
                         − β · log( π*(yl|x) / πref(yl|x) )  −  β · log Z(x).

The two β · log Z(x) terms are identical — both depend only on the prompt x, which is fixed across the pair — so they cancel in the difference. This is the move that makes the whole thing work. We are computing a difference of rewards on the same prompt, and the only piece of the reward we cannot evaluate without knowing the full response distribution is the partition function, which depends only on the prompt and therefore appears identically on both sides.

r(x, y_w) − r(x, y_l) — the cancellation r(x, y_w) = β · log( π*(y_w|x) / π_ref(y_w|x) ) + β · log Z(x) r(x, y_l) = β · log( π*(y_l|x) / π_ref(y_l|x) ) + β · log Z(x) log Z(x) depends only on x. Subtract identical terms across the pair → they cancel. What survives is a difference of log-ratios — no reward, no partition function.

Putting the surviving pieces together:

P(yw ≻ yl | x)  =  σ (  β · log(π*(yw|x) / πref(yw|x))  −  β · log(π*(yl|x) / πref(yl|x))  ).

(E) Treat the policy as the optimum, minimize preference NLL

We have an expression for the preference probability in terms of π* and πref — no reward function in sight. The last move is a bit of intellectual sleight of hand: we don't have π* either, but we have πθ, the policy we are training. Treat πθ as our parameterization of π*, plug it in, and minimize the negative log-likelihood of the observed preferences:

LDPO(θ)  =  − 𝔼(x, yw, yl)   log σ (  β · log(πθ(yw|x) / πref(yw|x))  −  β · log(πθ(yl|x) / πref(yl|x))  ).
What we just gained
One closed-form loss. No reward model to train. No value head. No on-policy sampling. The reference model stays — it's frozen, and it's the KL anchor that made the closed form exist in the first place. Take it away and step (B) has no Gibbs solution; the whole derivation collapses.

The pieces and their jobs

Three objects appear in the loss. Each does exactly one thing.

ObjectRoleTraining state
πθ The policy being trained. Initialized from the SFT (or CoT-SFT) checkpoint. updated
πref Frozen copy of the SFT checkpoint. Acts as the KL anchor — defines "the policy's behavioral neighborhood" so updates can't drift to degenerate distributions that score preferences well while destroying everything else. frozen
β Inverse "temperature" on the implicit reward. Sets how tightly the policy is anchored to the reference. hyperparameter

The behavior of β deserves a sentence. Larger β means the implicit reward is amplified — the same log-ratio produces a larger reward magnitude — and, equivalently, deviation from πref is penalized harder in the underlying KL objective, so each step moves the policy less. Smaller β means stronger preference enforcement (the loss responds more aggressively to wrong rankings) but more drift from the reference. The standard range is 0.1 to 0.5; the 03_dpo.py default is 0.1.

The implicit reward

Even though we never train a reward model, the algebra above tells us what the policy is implicitly treating as the reward at any moment of training:

r̂(x, y)  =  β · log( πθ(y | x) / πref(y | x) ).

With this notation the DPO loss is exactly

LDPO(θ)  =  − 𝔼   log σ( r̂(x, yw) − r̂(x, yl) ).

The quantity w − r̂l is the margin. It starts near zero (because πθ begins as a copy of πref, so all log-ratios are zero) and should grow positive as training proceeds. It is the single most useful diagnostic you can log during a DPO run: if it never grows, the data is degenerate, β is too small, or your πref already separates the pairs perfectly.

The fraction of the batch where w > r̂lpreference accuracy — climbs toward 1.0 as the policy learns to rank pairs correctly. In 03_dpo.py we log both each 100 steps.

Interactive · DPO margin simulator

One preference pair, one knob for β, and a step button. Each click applies a single gradient update — the policy nudges log πθ(yw) up and log πθ(yl) down by an amount proportional to σ(−margin), which is the gradient of −log σ(r̂w − r̂l) with respect to the log-ratios. Watch the implicit reward bars spread and the margin curve grow. Slide β first to see how it scales the per-step change in margin: larger β amplifies each log-ratio nudge into a bigger reward swing.

DPO update on one pair (chosen vs. rejected)
π_θ starts equal to π_ref → both log-ratios are 0 → margin is 0. Each "Step" performs one preference-NLL gradient update on the log-ratios. The margin curve is r̂_w − r̂_l over steps.
step
0
β
0.10
margin r̂_w − r̂_l
+0.000
pref-acc
implicit reward r̂ = β · (log π_θ − log π_ref)
y_w (chosen) +0.000
y_l (rejected) +0.000
margin over steps
+5 0 −5
x = step   y = r̂_w − r̂_l
Why the update has this shape

The loss is L = − log σ(r̂w − r̂l). Using d/du [−log σ(u)] = −σ(−u), we get dL / d(r̂w − r̂l) = −σ(−(r̂w − r̂l)). Since r̂ = β · Δlogp, the gradient pushes log πθ(yw) up and log πθ(yl) down, each by η · β · σ(−margin). When the margin is large and positive (we've already won), σ(−margin) ≈ 0 and updates shrink — the loss saturates.

Computing log π(y | x) in practice

The loss reads as if it needs sequence-level probabilities. We compute them by running a single forward pass on the concatenated sequence [x; y], taking log_softmax, and gathering the log-probability assigned to the actual target token at each response position. The response mask is the same as in SFT: 1 on response tokens, 0 on prompt and pad.

def gather_logprobs(model, x, response_mask):
    logits, _ = model(x)              # (B, T, V)
    logits = logits[:, :-1, :]        # predictions for positions 0..T-2
    targets = x[:, 1:]                # actual next tokens
    mask    = response_mask[:, 1:]    # response positions only (shift by 1)
    logp    = F.log_softmax(logits, dim=-1)
    per_tok = logp.gather(-1, targets.unsqueeze(-1)).squeeze(-1)
    return (per_tok * mask).sum(dim=-1)   # (B,) — sum-log-prob of y given x

The shift-by-one convention is identical to every other stage in this pipeline: predictions at position t are scored against the token at position t+1. Masking out the prompt is essential — if you sum log-probs over the entire sequence you confound log π(y|x) with log π(x), and the latter is the same for chosen and rejected (same prompt) so it would just contribute noise, but in practice tokens-per-response varies and the mismatch becomes a length bias.

Run this function four times per batch: πθ on yw, πθ on yl, πref on yw, πref on yl. The two πref calls are inside a torch.no_grad() block — the reference is frozen, so no backward pass through it. Plug all four scalars into dpo_loss and you're done.

def dpo_loss(policy_lp_w, policy_lp_l, ref_lp_w, ref_lp_l, beta):
    r_w  = beta * (policy_lp_w - ref_lp_w)
    r_l  = beta * (policy_lp_l - ref_lp_l)
    loss = -F.logsigmoid(r_w - r_l).mean()
    return loss, r_w, r_l

Why DPO actually works — gradient intuition

Take the gradient of the loss with respect to the parameters. The chain rule gives

θ LDPO  =  −σ(−(r̂w − r̂l)) · β · ( ∇θ log πθ(yw|x)  −  ∇θ log πθ(yl|x) ).

Read this slowly. The bracketed term is "push the log-prob of yw up, push the log-prob of yl down". The leading coefficient σ(−margin) measures how wrong the preference is right now:

That saturation is a feature: it makes the loss self-balancing across a dataset where some pairs are easy and others are hard. The hard pairs get the gradient. Compare to vanilla cross-entropy in SFT, which keeps pushing log-probabilities even when the target is already the argmax.

The role of the β · log Z(x) cancellation now reads as: we are doing this push/pull on log-probabilities without needing to know the per-prompt normalization, which we couldn't evaluate even in principle (it sums over all possible responses). The cancellation makes the math tractable for an autoregressive model where computing exact log Z is intractable.

What DPO cannot do — the gap RLVR closes

DPO is a beautiful tool, but it has structural limits that come straight from its inputs.

Lesson 6 picks up here. We keep the KL anchor and the frozen reference; we drop the static pairs and replace them with policy rollouts scored by a verifier.

Variants — the DPO family

Once the core trick was published, a small zoo of variants followed. You don't need to know any of them in depth; just know they exist and what shape problem each addresses.

VariantOne-line idea
IPOSaturating loss prevents over-confident pairs from dominating; σ is replaced with a function that doesn't push margins to infinity.
KTOTrained on point labels (good / bad) instead of pairs. Useful when raters give thumbs-up / thumbs-down rather than A-vs-B.
ORPODrops the reference model entirely and folds the preference signal into a single combined SFT+preference loss.
SimPONo reference; length-normalized log-likelihoods. Cheaper at inference time (no ref to keep around) at the cost of the formal derivation.

Practical gotchas from 03_dpo.py

Four things the toy file gets right that beginners often get wrong

Stage 3 in the pipeline

Where we are: the policy can follow instructions (SFT), reason out loud (CoT), and now rank responses according to a frozen preference dataset (DPO). What it still cannot do is improve through interaction with a verifier on its own samples. That is the move lesson 6 makes.

PRETRAIN SFT CoT DPO RLVR

The same algebra you just walked through reappears in RLVR / GRPO: the KL-regularized objective from step (A), the frozen πref, the per-token log-ratios. The difference is the source of the reward signal. In DPO, the reward is the implicit log-ratio fit through preference pairs. In RLVR, the reward is a programmatic verifier evaluated on policy rollouts. Same KL anchor, different question.

Takeaway
DPO collapses the two-step RLHF recipe (reward model, then PPO) into one closed-form loss by exploiting that the KL-regularized optimum is a Gibbs distribution whose partition function cancels across Bradley–Terry pair differences. The reference model is the only non-policy network that survives — frozen, used to compute log-ratios. No reward model. No value head. No on-policy sampling.