RLHF — the post-training workflow

Lessons 11–12 gave you the algorithms — PPO, DPO, GRPO. RLHF is the end-to-end recipe that wires them together, and its job is to answer one question: where does the reward come from for an open-ended task that no verifier can score?

What lessons 11–12 left open

We now have a toolbox of policy-optimizers. PPO (lesson 11) improves a policy safely with a clipped surrogate and a KL anchor; DPO (lesson 12) skips the RL loop given preference pairs; GRPO drops the critic for a group baseline. All three assume what the very first lesson assumed: that a reward r(x, y) already exists.

For verifiable tasks it does — a unit-test runner or a math checker returns 1 or 0. But most of what we want from an assistant is not verifiable: "write a helpful reply," "summarize this paper well," "be polite." No program scores helpfulness. So the question lessons 11–12 quietly postponed is the whole subject of this lesson:

The question RLHF answers

When you cannot write a verifier, and you cannot write down "the correct answer" to imitate, where does the scalar reward come from? RLHF's answer: ask humans which of two responses is better, fit a reward model to those comparisons, and optimize against it with the machinery you already have.

RLHF (Reinforcement Learning from Human Feedback) is not a new algorithm — it is a workflow built from three stages, each consuming the output of the one before. The optimizer in stage 3 is just PPO from lesson 11.

The three stages

Read the dashed arrows: π_SFT initializes the reward model and anchors the PPO policy; the reward model scores PPO's rollouts. The stages are a ladder, not a menu — break a rung and a specific pathology appears downstream.

Stage 1 · Supervised fine-tuning

Start from a pretrained base model and show it a few thousand (prompt, ideal response) demonstrations. Train with ordinary next-token cross-entropy — plain supervised learning, the instructive feedback of orientation, not the evaluative feedback of RL. The output π_SFT is a competent, bland assistant that has learned the format: follow instructions, stop at the end, don't echo the prompt.

Why bother, if RL is coming next? Because the later stages both need a sane starting point: the reward model is built on top of π_SFT, and the PPO policy is initialized from it and KL-anchored to it. RL is a local search — it polishes a good policy, it does not conjure one from noise.

Stage 2 · The reward model, from preference pairs

Collect human comparisons: a prompt x and two responses, of which the human marks one the winner y_w and one the loser y_l. Comparisons are far easier and more reliable for humans than absolute scores — "B is better than A" beats "A is a 6.5/10."

We want a scalar r̂_φ(x, y) that ranks y_w above y_l. The bridge from preferences to a number is the Bradley–Terry model (1952) — the same model DPO used in lesson 12. It says the probability a human prefers y_w is the sigmoid of the score difference:

P(y_w ≻ y_l | x) = σ( r̂_φ(x, y_w) − r̂_φ(x, y_l) )

Maximize the likelihood of the observed clicks — equivalently minimize its negative log — and you get the reward-model loss:

L_RM(φ) = − 𝔼_{(x, y_w, y_l)} [ log σ( r̂_φ(x, y_w) − r̂_φ(x, y_l) ) ]

Three properties to keep in mind, because the widget below makes all three visible:

Only differences matter. Add a constant to every score and the loss is unchanged. The reward model has a free additive offset — never read raw values, only gaps.
Hard pairs do the work. The gradient weight on a pair is σ(r̂_l − r̂_w): pairs already ranked correctly contribute almost nothing; close or wrong pairs dominate. This is just logistic regression on the score gap.
It is a proxy. The reward model is trained on a finite sample of comparisons from responses near π_SFT. It is right on-distribution and unknown off-distribution — remember this for stage 3.

This is the same Bradley–Terry from lesson 12

DPO took the BT model and, using the closed form of the KL-regularized optimum, substituted the reward with a log-ratio of policies — turning the whole pipeline into one supervised classification loss with no separate reward model and no sampling. RLHF keeps the reward model explicit and the RL loop intact. Same starting equation; the fork is whether you solve for the policy analytically (DPO) or optimize it with PPO (RLHF).

Stage 3 · PPO against the reward model, anchored to SFT

Now r̂_φ is a frozen scalar function. We are back on familiar ground from lesson 11: maximize expected reward with PPO, but subtract a KL penalty that pulls the policy toward the SFT reference.

max_θ 𝔼_{y ∼ π_θ(·|x)} [ r̂_φ(x, y) ] − β · KL( π_θ ‖ π_SFT )

This is exactly the regularized objective the sibling course derives in detail — reward says what to maximize, the KL anchor says don't wander too far from where you started. The coefficient β sets the leash length.

Reward hacking: the proxy is not the goal

Here is the central failure mode, and the thing the widget is built to show. The reward model is a proxy for human preference, fit on finite data near π_SFT. PPO is an optimizer, and optimizers are merciless: they will find wherever the proxy is highest, including the regions where the proxy is wrong.

This is Goodhart's law in miniature — "when a measure becomes a target, it ceases to be a good measure." Early RLHF runs without an anchor reliably degenerated: the policy discovered that the reward model assigned spuriously high scores to flattering openings, certain punctuation, or sheer length, and it spammed those patterns. The reward number climbed while actual quality fell.

Proxy reward vs true reward

Picture two curves. Proxy reward is what r̂_φ reports — what PPO optimizes. True reward is what a fresh human would actually say. On-distribution they agree. As the policy drifts off-distribution they diverge: proxy keeps climbing into the model's blind spot while true reward turns and falls. Reward hacking is exactly that gap opening up.

The KL anchor has two jobs

It is tempting to read β · KL(π_θ ‖ π_SFT) as a single "stay close" penalty. It is doing two distinct jobs — separate them in your head:

Stay fluent. π_SFT is a known-coherent language model. Anchoring to it stops the policy collapsing into a fluent-but-degenerate mode (the infamous "the the the…" or one canned response for every prompt).
Stay near a trustworthy reward region. The reward model is only reliable where its training data lived — near π_SFT. The KL term bounds how far the policy may walk away from that region, which is precisely where the proxy stops being trustworthy. It does not fix the proxy's blind spots; it keeps the policy from marching into them.

And so β has a sweet spot, with failure on both sides:

β setting	What happens	Why
Too low	Reward hacking — proxy reward soars, true reward crashes.	The leash is loose; PPO drifts off-distribution into the proxy's blind spots and exploits them.
Just right	Both proxy and true reward rise together.	The policy improves inside the region where the proxy still tracks the truth.
Too high	Almost no learning — the policy barely moves off π_SFT.	The leash is so tight the KL penalty dominates any reward gain; the gradient can't push the policy anywhere.

Interactive · train a reward model, then optimize against it

Two acts. Act 1: click preferences on a handful of response pairs to train a toy Bradley–Terry reward model — each click is one gradient step on −log σ(r̂_chosen − r̂_rejected). Act 2: run KL-anchored policy improvement against your trained model and watch two gauges: the proxy reward the policy is optimizing and the true reward a fresh human would give. Drive β too low and they split apart — the bug is the lesson.

Bradley–Terry reward model → KL-anchored policy improvement

There are 5 candidate responses with hidden true rewards. The proxy reward model only learns from the pairs you click — so it is reliable for the responses you compared and unreliable for the rest. Click several pairs, then run the policy and slide β.

Act 1 — click the better response:

Act 2 — optimize policy: β: 0.100

Clicks (RM steps)

Proxy reward

—

True reward

—

KL to SFT

—

Verdict

—

Show the core JS (≈30 lines)

// 5 responses, hidden TRUE reward. RM = learned proxy scores r̂[i].
const trueR = [0.20, 0.45, 0.70, 0.95, 0.55];
let rhat = [0,0,0,0,0];                 // proxy scores (learned)
let seen = [false,false,false,false,false];   // which responses we compared

function clickPref(w, l){               // Bradley–Terry SGD step
  const g = sigm(rhat[l] - rhat[w]);    // = σ(r̂_l − r̂_w)
  rhat[w] += LR * g;  rhat[l] -= LR * g;
  seen[w] = seen[l] = true;
}

// Policy = softmax(theta). PPO ~ ascent on  E[r̂] − β·KL(π‖π_SFT).
let theta = [0,0,0,0,0];                // π_SFT is uniform → theta 0
function policyStep(beta){
  const pi = softmax(theta), piS = softmax([0,0,0,0,0]);
  for (let i=0;i<5;i++){
    let g = 0;
    for (let k=0;k<5;k++)                // d E[r̂]/dθ_i
      g += pi[k]*(((i===k)?1:0)-pi[i])*proxy(k);
    g -= beta*(Math.log(pi[i]/piS[i])+1)*pi[i]*(1-pi[i]); // d(−β KL)/dθ_i
    theta[i] += ETA*g;
  }
}
// proxy(k): trained r̂ if compared, else an OPTIMISTIC blind-spot value
function proxy(k){ return seen[k] ? rhat[k] : BLIND; }
const trueReward = pi => pi.reduce((s,p,i)=>s+p*trueR[i],0);

The mechanism behind the bug: any response you never compared is, to the reward model, an uncalibrated blind spot — and like a real over-optimistic RM it scores those high. With β low, PPO pours probability onto exactly those unseen responses (proxy reward looks great), but their true reward is whatever the hidden value happens to be — usually mediocre. With β high, the policy stays pinned to uniform SFT and learns nothing. Somewhere in between, it climbs the responses you actually verified.

RLHF vs RLVR vs RLAIF — three sources of reward

RLHF is one answer to "where does reward come from." It has siblings that differ only in the source of the signal; the optimizer (PPO, GRPO) is unchanged.

Recipe	Reward source	Best for	Main risk
RLVR verifiable reward	A program — unit tests, a math checker, a regex. Returns 0/1, no learned model.	Math, code, puzzles — anything with a checkable answer.	Limited to tasks you can verify; can still hack a loose verifier.
RLHF human feedback	A reward model fit to human preference pairs (this lesson).	Open-ended quality — helpfulness, tone, style, safety.	Reward hacking of the proxy; expensive human labels.
RLAIF AI feedback	A reward model (or an LLM judge) fed preferences generated by another model against a written principle / constitution.	Scaling preference data cheaply; "constitutional" alignment.	Inherits and can amplify the judge model's biases and blind spots.

RLAIF is the natural move once labeling humans becomes the bottleneck: replace the human annotator in stage 2 with a model that compares responses according to a stated principle. Everything else in the pipeline is identical — same Bradley–Terry loss, same PPO stage 3.

Where DPO short-circuits the loop

Lesson 12 derived DPO precisely so you could skip the expensive parts of this workflow. Lining the two up:

Stage	RLHF (this lesson)	DPO (lesson 12)
SFT	required	required
Reward model	train an explicit r̂_φ via Bradley–Terry	none — reward is implicit in the policy log-ratio
Optimization	PPO: sample rollouts, score, clip, KL-anchor	one supervised classification loss on the pairs, no sampling
The β / KL anchor	an explicit penalty term in stage 3	baked into the loss (the β in the log-ratio)

DPO uses the closed form of the KL-regularized optimum to express the reward as β log(π_θ/π_ref), plug it into the same Bradley–Terry equation from stage 2, and cancel the partition function Z(x) — collapsing stages 2 and 3 into a single offline loss. The cost: no online sampling means DPO only sees the preference pairs you collected; it can't explore for new high-reward responses the way PPO can. That trade — simplicity and stability vs. on-policy exploration — is the live debate in modern post-training.

Where this goes next (systems)

This lesson is the theory of the workflow. The sibling course works it as a production system: 15 · RLHF — the original recipe (the deep systems treatment, with the InstructGPT data scales and failure-mode table) and 03 · Reward & Reference (the KL anchor in code — Schulman's k3 estimator, why the reference is a separate engine, π_old vs π_ref). Browse the full systems track from the RL Post-Training index.

What still has no answer

RLHF assumes we can at least compare two responses — humans express a preference and Bradley–Terry turns it into a reward. But what if even that is too much? What if all we have is an expert's behavior — demonstrations, with no preference labels and no reward at all? Then we must infer the reward from the behavior itself. That is imitation learning and inverse RL, and the reward model you just trained is a close cousin of what they build.

Takeaway

RLHF is a workflow, not an algorithm: SFT → a Bradley–Terry reward model from human preference pairs → PPO against that model with a KL-to-SFT anchor. The reward model is a proxy, so an unconstrained optimizer will hack it — proxy reward climbs while true reward falls. The KL anchor's two jobs (stay fluent, stay in the region where the proxy is trustworthy) are what hold the line, and β has failure on both sides. RLVR, RLHF, and RLAIF differ only in where the reward comes from; DPO short-circuits the whole loop by making the reward implicit.