RLHF — the post-training workflow
Lessons 11–12 gave you the algorithms — PPO, DPO, GRPO. RLHF is the end-to-end recipe that wires them together, and its job is to answer one question: where does the reward come from for an open-ended task that no verifier can score?
What lessons 11–12 left open
We now have a toolbox of policy-optimizers. PPO (lesson 11) improves a policy safely with a clipped surrogate and a KL anchor; DPO (lesson 12) skips the RL loop given preference pairs; GRPO drops the critic for a group baseline. All three assume what the very first lesson assumed: that a reward r(x, y) already exists.
For verifiable tasks it does — a unit-test runner or a math checker returns 1 or 0. But most of what we want from an assistant is not verifiable: "write a helpful reply," "summarize this paper well," "be polite." No program scores helpfulness. So the question lessons 11–12 quietly postponed is the whole subject of this lesson:
RLHF (Reinforcement Learning from Human Feedback) is not a new algorithm — it is a workflow built from three stages, each consuming the output of the one before. The optimizer in stage 3 is just PPO from lesson 11.
The three stages
Read the dashed arrows: πSFT initializes the reward model and anchors the PPO policy; the reward model scores PPO's rollouts. The stages are a ladder, not a menu — break a rung and a specific pathology appears downstream.
Stage 1 · Supervised fine-tuning
Start from a pretrained base model and show it a few thousand (prompt, ideal response) demonstrations. Train with ordinary next-token cross-entropy — plain supervised learning, the instructive feedback of orientation, not the evaluative feedback of RL. The output πSFT is a competent, bland assistant that has learned the format: follow instructions, stop at the end, don't echo the prompt.
Why bother, if RL is coming next? Because the later stages both need a sane starting point: the reward model is built on top of πSFT, and the PPO policy is initialized from it and KL-anchored to it. RL is a local search — it polishes a good policy, it does not conjure one from noise.
Stage 2 · The reward model, from preference pairs
Collect human comparisons: a prompt x and two responses, of which the human marks one the winner yw and one the loser yl. Comparisons are far easier and more reliable for humans than absolute scores — "B is better than A" beats "A is a 6.5/10."
We want a scalar r̂φ(x, y) that ranks yw above yl. The bridge from preferences to a number is the Bradley–Terry model (1952) — the same model DPO used in lesson 12. It says the probability a human prefers yw is the sigmoid of the score difference:
Maximize the likelihood of the observed clicks — equivalently minimize its negative log — and you get the reward-model loss:
Three properties to keep in mind, because the widget below makes all three visible:
- Only differences matter. Add a constant to every score and the loss is unchanged. The reward model has a free additive offset — never read raw values, only gaps.
- Hard pairs do the work. The gradient weight on a pair is σ(r̂l − r̂w): pairs already ranked correctly contribute almost nothing; close or wrong pairs dominate. This is just logistic regression on the score gap.
- It is a proxy. The reward model is trained on a finite sample of comparisons from responses near πSFT. It is right on-distribution and unknown off-distribution — remember this for stage 3.
Stage 3 · PPO against the reward model, anchored to SFT
Now r̂φ is a frozen scalar function. We are back on familiar ground from lesson 11: maximize expected reward with PPO, but subtract a KL penalty that pulls the policy toward the SFT reference.
This is exactly the regularized objective the sibling course derives in detail — reward says what to maximize, the KL anchor says don't wander too far from where you started. The coefficient β sets the leash length.
Reward hacking: the proxy is not the goal
Here is the central failure mode, and the thing the widget is built to show. The reward model is a proxy for human preference, fit on finite data near πSFT. PPO is an optimizer, and optimizers are merciless: they will find wherever the proxy is highest, including the regions where the proxy is wrong.
This is Goodhart's law in miniature — "when a measure becomes a target, it ceases to be a good measure." Early RLHF runs without an anchor reliably degenerated: the policy discovered that the reward model assigned spuriously high scores to flattering openings, certain punctuation, or sheer length, and it spammed those patterns. The reward number climbed while actual quality fell.
The KL anchor has two jobs
It is tempting to read β · KL(πθ ‖ πSFT) as a single "stay close" penalty. It is doing two distinct jobs — separate them in your head:
- Stay fluent. πSFT is a known-coherent language model. Anchoring to it stops the policy collapsing into a fluent-but-degenerate mode (the infamous "the the the…" or one canned response for every prompt).
- Stay near a trustworthy reward region. The reward model is only reliable where its training data lived — near πSFT. The KL term bounds how far the policy may walk away from that region, which is precisely where the proxy stops being trustworthy. It does not fix the proxy's blind spots; it keeps the policy from marching into them.
And so β has a sweet spot, with failure on both sides:
| β setting | What happens | Why |
|---|---|---|
| Too low | Reward hacking — proxy reward soars, true reward crashes. | The leash is loose; PPO drifts off-distribution into the proxy's blind spots and exploits them. |
| Just right | Both proxy and true reward rise together. | The policy improves inside the region where the proxy still tracks the truth. |
| Too high | Almost no learning — the policy barely moves off πSFT. | The leash is so tight the KL penalty dominates any reward gain; the gradient can't push the policy anywhere. |
Interactive · train a reward model, then optimize against it
Two acts. Act 1: click preferences on a handful of response pairs to train a toy Bradley–Terry reward model — each click is one gradient step on −log σ(r̂chosen − r̂rejected). Act 2: run KL-anchored policy improvement against your trained model and watch two gauges: the proxy reward the policy is optimizing and the true reward a fresh human would give. Drive β too low and they split apart — the bug is the lesson.
The mechanism behind the bug: any response you never compared is, to the reward model, an uncalibrated blind spot — and like a real over-optimistic RM it scores those high. With β low, PPO pours probability onto exactly those unseen responses (proxy reward looks great), but their true reward is whatever the hidden value happens to be — usually mediocre. With β high, the policy stays pinned to uniform SFT and learns nothing. Somewhere in between, it climbs the responses you actually verified.
RLHF vs RLVR vs RLAIF — three sources of reward
RLHF is one answer to "where does reward come from." It has siblings that differ only in the source of the signal; the optimizer (PPO, GRPO) is unchanged.
| Recipe | Reward source | Best for | Main risk |
|---|---|---|---|
| RLVR verifiable reward | A program — unit tests, a math checker, a regex. Returns 0/1, no learned model. | Math, code, puzzles — anything with a checkable answer. | Limited to tasks you can verify; can still hack a loose verifier. |
| RLHF human feedback | A reward model fit to human preference pairs (this lesson). | Open-ended quality — helpfulness, tone, style, safety. | Reward hacking of the proxy; expensive human labels. |
| RLAIF AI feedback | A reward model (or an LLM judge) fed preferences generated by another model against a written principle / constitution. | Scaling preference data cheaply; "constitutional" alignment. | Inherits and can amplify the judge model's biases and blind spots. |
RLAIF is the natural move once labeling humans becomes the bottleneck: replace the human annotator in stage 2 with a model that compares responses according to a stated principle. Everything else in the pipeline is identical — same Bradley–Terry loss, same PPO stage 3.
Where DPO short-circuits the loop
Lesson 12 derived DPO precisely so you could skip the expensive parts of this workflow. Lining the two up:
| Stage | RLHF (this lesson) | DPO (lesson 12) |
|---|---|---|
| SFT | required | required |
| Reward model | train an explicit r̂φ via Bradley–Terry | none — reward is implicit in the policy log-ratio |
| Optimization | PPO: sample rollouts, score, clip, KL-anchor | one supervised classification loss on the pairs, no sampling |
| The β / KL anchor | an explicit penalty term in stage 3 | baked into the loss (the β in the log-ratio) |
DPO uses the closed form of the KL-regularized optimum to express the reward as β log(πθ/πref), plug it into the same Bradley–Terry equation from stage 2, and cancel the partition function Z(x) — collapsing stages 2 and 3 into a single offline loss. The cost: no online sampling means DPO only sees the preference pairs you collected; it can't explore for new high-reward responses the way PPO can. That trade — simplicity and stability vs. on-policy exploration — is the live debate in modern post-training.
What still has no answer
RLHF assumes we can at least compare two responses — humans express a preference and Bradley–Terry turns it into a reward. But what if even that is too much? What if all we have is an expert's behavior — demonstrations, with no preference labels and no reward at all? Then we must infer the reward from the behavior itself. That is imitation learning and inverse RL, and the reward model you just trained is a close cousin of what they build.