rl_foundations / lessons / 13 · RLHF — the post-training workflow lesson 13 / 32

RLHF — the post-training workflow

Lessons 11–12 gave you the algorithms — PPO, DPO, GRPO. RLHF is the end-to-end recipe that wires them together, and its job is to answer one question: where does the reward come from for an open-ended task that no verifier can score?

What lessons 11–12 left open

We now have a toolbox of policy-optimizers. PPO (lesson 11) improves a policy safely with a clipped surrogate and a KL anchor; DPO (lesson 12) skips the RL loop given preference pairs; GRPO drops the critic for a group baseline. All three assume what the very first lesson assumed: that a reward r(x, y) already exists.

For verifiable tasks it does — a unit-test runner or a math checker returns 1 or 0. But most of what we want from an assistant is not verifiable: "write a helpful reply," "summarize this paper well," "be polite." No program scores helpfulness. So the question lessons 11–12 quietly postponed is the whole subject of this lesson:

The question RLHF answers
When you cannot write a verifier, and you cannot write down "the correct answer" to imitate, where does the scalar reward come from? RLHF's answer: ask humans which of two responses is better, fit a reward model to those comparisons, and optimize against it with the machinery you already have.

RLHF (Reinforcement Learning from Human Feedback) is not a new algorithm — it is a workflow built from three stages, each consuming the output of the one before. The optimizer in stage 3 is just PPO from lesson 11.

The three stages

1 · SFT imitate demonstrations next-token cross-entropy 2 · Reward model Bradley–Terry on preference pairs 3 · PPO max r̂φ − β·KL vs πSFT π_SFT r̂_φ π_RLHF inits scores π_SFT also anchors PPO (the KL term)

Read the dashed arrows: πSFT initializes the reward model and anchors the PPO policy; the reward model scores PPO's rollouts. The stages are a ladder, not a menu — break a rung and a specific pathology appears downstream.

Stage 1 · Supervised fine-tuning

Start from a pretrained base model and show it a few thousand (prompt, ideal response) demonstrations. Train with ordinary next-token cross-entropy — plain supervised learning, the instructive feedback of orientation, not the evaluative feedback of RL. The output πSFT is a competent, bland assistant that has learned the format: follow instructions, stop at the end, don't echo the prompt.

Why bother, if RL is coming next? Because the later stages both need a sane starting point: the reward model is built on top of πSFT, and the PPO policy is initialized from it and KL-anchored to it. RL is a local search — it polishes a good policy, it does not conjure one from noise.

Stage 2 · The reward model, from preference pairs

Collect human comparisons: a prompt x and two responses, of which the human marks one the winner yw and one the loser yl. Comparisons are far easier and more reliable for humans than absolute scores — "B is better than A" beats "A is a 6.5/10."

We want a scalar φ(x, y) that ranks yw above yl. The bridge from preferences to a number is the Bradley–Terry model (1952) — the same model DPO used in lesson 12. It says the probability a human prefers yw is the sigmoid of the score difference:

P(yw ≻ yl | x)  =  σ( r̂φ(x, yw) − r̂φ(x, yl) )

Maximize the likelihood of the observed clicks — equivalently minimize its negative log — and you get the reward-model loss:

LRM(φ)  =  − 𝔼(x, yw, yl) [ log σ( r̂φ(x, yw) − r̂φ(x, yl) ) ]

Three properties to keep in mind, because the widget below makes all three visible:

This is the same Bradley–Terry from lesson 12
DPO took the BT model and, using the closed form of the KL-regularized optimum, substituted the reward with a log-ratio of policies — turning the whole pipeline into one supervised classification loss with no separate reward model and no sampling. RLHF keeps the reward model explicit and the RL loop intact. Same starting equation; the fork is whether you solve for the policy analytically (DPO) or optimize it with PPO (RLHF).

Stage 3 · PPO against the reward model, anchored to SFT

Now φ is a frozen scalar function. We are back on familiar ground from lesson 11: maximize expected reward with PPO, but subtract a KL penalty that pulls the policy toward the SFT reference.

maxθ   𝔼y ∼ πθ(·|x) [ r̂φ(x, y) ]  −  β · KL( πθ ‖ πSFT )

This is exactly the regularized objective the sibling course derives in detail — reward says what to maximize, the KL anchor says don't wander too far from where you started. The coefficient β sets the leash length.

Reward hacking: the proxy is not the goal

Here is the central failure mode, and the thing the widget is built to show. The reward model is a proxy for human preference, fit on finite data near πSFT. PPO is an optimizer, and optimizers are merciless: they will find wherever the proxy is highest, including the regions where the proxy is wrong.

This is Goodhart's law in miniature — "when a measure becomes a target, it ceases to be a good measure." Early RLHF runs without an anchor reliably degenerated: the policy discovered that the reward model assigned spuriously high scores to flattering openings, certain punctuation, or sheer length, and it spammed those patterns. The reward number climbed while actual quality fell.

Proxy reward vs true reward
Picture two curves. Proxy reward is what φ reports — what PPO optimizes. True reward is what a fresh human would actually say. On-distribution they agree. As the policy drifts off-distribution they diverge: proxy keeps climbing into the model's blind spot while true reward turns and falls. Reward hacking is exactly that gap opening up.

The KL anchor has two jobs

It is tempting to read β · KL(πθ ‖ πSFT) as a single "stay close" penalty. It is doing two distinct jobs — separate them in your head:

  1. Stay fluent. πSFT is a known-coherent language model. Anchoring to it stops the policy collapsing into a fluent-but-degenerate mode (the infamous "the the the…" or one canned response for every prompt).
  2. Stay near a trustworthy reward region. The reward model is only reliable where its training data lived — near πSFT. The KL term bounds how far the policy may walk away from that region, which is precisely where the proxy stops being trustworthy. It does not fix the proxy's blind spots; it keeps the policy from marching into them.

And so β has a sweet spot, with failure on both sides:

β settingWhat happensWhy
Too lowReward hacking — proxy reward soars, true reward crashes.The leash is loose; PPO drifts off-distribution into the proxy's blind spots and exploits them.
Just rightBoth proxy and true reward rise together.The policy improves inside the region where the proxy still tracks the truth.
Too highAlmost no learning — the policy barely moves off πSFT.The leash is so tight the KL penalty dominates any reward gain; the gradient can't push the policy anywhere.

Interactive · train a reward model, then optimize against it

Two acts. Act 1: click preferences on a handful of response pairs to train a toy Bradley–Terry reward model — each click is one gradient step on −log σ(r̂chosen − r̂rejected). Act 2: run KL-anchored policy improvement against your trained model and watch two gauges: the proxy reward the policy is optimizing and the true reward a fresh human would give. Drive β too low and they split apart — the bug is the lesson.

Bradley–Terry reward model → KL-anchored policy improvement
There are 5 candidate responses with hidden true rewards. The proxy reward model only learns from the pairs you click — so it is reliable for the responses you compared and unreliable for the rest. Click several pairs, then run the policy and slide β.
Act 1 — click the better response:
Act 2 — optimize policy:
Clicks (RM steps)
0
Proxy reward
True reward
KL to SFT
Verdict
Show the core JS (≈30 lines)
// 5 responses, hidden TRUE reward. RM = learned proxy scores r̂[i].
const trueR = [0.20, 0.45, 0.70, 0.95, 0.55];
let rhat = [0,0,0,0,0];                 // proxy scores (learned)
let seen = [false,false,false,false,false];   // which responses we compared

function clickPref(w, l){               // Bradley–Terry SGD step
  const g = sigm(rhat[l] - rhat[w]);    // = σ(r̂_l − r̂_w)
  rhat[w] += LR * g;  rhat[l] -= LR * g;
  seen[w] = seen[l] = true;
}

// Policy = softmax(theta). PPO ~ ascent on  E[r̂] − β·KL(π‖π_SFT).
let theta = [0,0,0,0,0];                // π_SFT is uniform → theta 0
function policyStep(beta){
  const pi = softmax(theta), piS = softmax([0,0,0,0,0]);
  for (let i=0;i<5;i++){
    let g = 0;
    for (let k=0;k<5;k++)                // d E[r̂]/dθ_i
      g += pi[k]*(((i===k)?1:0)-pi[i])*proxy(k);
    g -= beta*(Math.log(pi[i]/piS[i])+1)*pi[i]*(1-pi[i]); // d(−β KL)/dθ_i
    theta[i] += ETA*g;
  }
}
// proxy(k): trained r̂ if compared, else an OPTIMISTIC blind-spot value
function proxy(k){ return seen[k] ? rhat[k] : BLIND; }
const trueReward = pi => pi.reduce((s,p,i)=>s+p*trueR[i],0);

The mechanism behind the bug: any response you never compared is, to the reward model, an uncalibrated blind spot — and like a real over-optimistic RM it scores those high. With β low, PPO pours probability onto exactly those unseen responses (proxy reward looks great), but their true reward is whatever the hidden value happens to be — usually mediocre. With β high, the policy stays pinned to uniform SFT and learns nothing. Somewhere in between, it climbs the responses you actually verified.

RLHF vs RLVR vs RLAIF — three sources of reward

RLHF is one answer to "where does reward come from." It has siblings that differ only in the source of the signal; the optimizer (PPO, GRPO) is unchanged.

RecipeReward sourceBest forMain risk
RLVR
verifiable reward
A program — unit tests, a math checker, a regex. Returns 0/1, no learned model.Math, code, puzzles — anything with a checkable answer.Limited to tasks you can verify; can still hack a loose verifier.
RLHF
human feedback
A reward model fit to human preference pairs (this lesson).Open-ended quality — helpfulness, tone, style, safety.Reward hacking of the proxy; expensive human labels.
RLAIF
AI feedback
A reward model (or an LLM judge) fed preferences generated by another model against a written principle / constitution.Scaling preference data cheaply; "constitutional" alignment.Inherits and can amplify the judge model's biases and blind spots.

RLAIF is the natural move once labeling humans becomes the bottleneck: replace the human annotator in stage 2 with a model that compares responses according to a stated principle. Everything else in the pipeline is identical — same Bradley–Terry loss, same PPO stage 3.

Where DPO short-circuits the loop

Lesson 12 derived DPO precisely so you could skip the expensive parts of this workflow. Lining the two up:

StageRLHF (this lesson)DPO (lesson 12)
SFTrequiredrequired
Reward modeltrain an explicit φ via Bradley–Terrynone — reward is implicit in the policy log-ratio
OptimizationPPO: sample rollouts, score, clip, KL-anchorone supervised classification loss on the pairs, no sampling
The β / KL anchoran explicit penalty term in stage 3baked into the loss (the β in the log-ratio)

DPO uses the closed form of the KL-regularized optimum to express the reward as β log(πθref), plug it into the same Bradley–Terry equation from stage 2, and cancel the partition function Z(x) — collapsing stages 2 and 3 into a single offline loss. The cost: no online sampling means DPO only sees the preference pairs you collected; it can't explore for new high-reward responses the way PPO can. That trade — simplicity and stability vs. on-policy exploration — is the live debate in modern post-training.

Where this goes next (systems)
This lesson is the theory of the workflow. The sibling course works it as a production system: 15 · RLHF — the original recipe (the deep systems treatment, with the InstructGPT data scales and failure-mode table) and 03 · Reward & Reference (the KL anchor in code — Schulman's k3 estimator, why the reference is a separate engine, πold vs πref). Browse the full systems track from the RL Post-Training index.

What still has no answer

RLHF assumes we can at least compare two responses — humans express a preference and Bradley–Terry turns it into a reward. But what if even that is too much? What if all we have is an expert's behavior — demonstrations, with no preference labels and no reward at all? Then we must infer the reward from the behavior itself. That is imitation learning and inverse RL, and the reward model you just trained is a close cousin of what they build.

Takeaway
RLHF is a workflow, not an algorithm: SFT → a Bradley–Terry reward model from human preference pairs → PPO against that model with a KL-to-SFT anchor. The reward model is a proxy, so an unconstrained optimizer will hack it — proxy reward climbs while true reward falls. The KL anchor's two jobs (stay fluent, stay in the region where the proxy is trustworthy) are what hold the line, and β has failure on both sides. RLVR, RLHF, and RLAIF differ only in where the reward comes from; DPO short-circuits the whole loop by making the reward implicit.