RLHF — the original recipe Christiano '17 · InstructGPT '22
Before verifiable rewards there were human rewards. Preferences become a scalar reward model; the reward model trains a policy with PPO. Three stages, one ladder. Every modern preference algorithm is a shortcut around at least one rung.
The problem RLHF solves
For most tasks worth doing — write a helpful email, summarize a paper, follow a vague instruction — there is no verifier. You cannot unit-test "is this response helpful?" You can, however, ask a human: between these two responses, which is better? That comparison is almost always easier than scoring either response in isolation.
RLHF (Reinforcement Learning from Human Feedback) is the canonical pipeline for turning that comparison signal into a policy update. It has three stages, each consuming the previous one's output:
The three stages are not optional reorderings — each consumes both data from above and the output of the previous stage. πSFT initializes r̂φ; r̂φ scores rollouts during PPO; πSFT also anchors the KL term in PPO. Break any one stage and a specific pathology shows up downstream.
Stage 1 · Supervised fine-tuning (SFT)
Start with a pretrained base model. Show it a few thousand (prompt, ideal response) demonstrations written by skilled humans. Train with the standard next-token cross-entropy loss until it learns the response format — that it should follow instructions, end with EOS, not regurgitate the prompt. The result πSFT is a competent-but-bland assistant.
Stage 2 · The reward model
Here is the core idea, and the one place this lesson asks you to do the math.
You collect a dataset of human pairwise judgements: prompt x, two responses yw (winner) and yl (loser). You want a scalar function r̂φ(x, y) such that r̂φ(x, yw) > r̂φ(x, yl) on as many pairs as possible.
The classical model from Bradley & Terry (1952) says: if two items have latent scores sw and sl, the probability the human picks yw is the sigmoid of the score difference:
Take the negative log-likelihood, average over the preference dataset, and you have the reward-model loss in its production form:
Concretely, r̂φ is the SFT model with a scalar regression head bolted to the final hidden state of the last token. Two forward passes per training example. The loss has three properties worth pausing on:
- It only depends on the difference. Adding a constant to every reward leaves the loss unchanged. The reward model has a free additive offset; do not interpret raw reward values, only differences.
- It is convex in the score difference. Gradient pushes r̂w up and r̂l down with weight σ(r̂l − r̂w) — pairs the model already gets right contribute almost no gradient, hard pairs contribute the most. This is just logistic regression on response embeddings.
- The score scale is meaningful (but only relative). σ(Δ = 1) ≈ 0.73; σ(Δ = 3) ≈ 0.95. A well-trained RM puts most preferences at |Δ| in [0.5, 4]. If |Δ| is much bigger, the RM is overconfident; much smaller, undertrained.
Interactive · fit a Bradley–Terry reward on a 4-response toy
Below: four responses to one prompt with hidden ground-truth scores. Each "round" samples a random pair, the simulated human picks the winner with probability σ(Δ), and one gradient step is taken on the BT loss. Watch the model's scores converge — and watch the agreement rate plateau short of 100% because some pairs are genuinely close.
Stage 3 · PPO against the learned reward
Now r̂φ is a frozen scalar function over (prompt, response). We're back in familiar territory from lessons 04 and 10: optimize a policy πθ to maximize expected reward, with PPO clipping and a KL anchor to πSFT:
The KL term is doing two distinct jobs and it's worth separating them in your head:
- Distribution anchor. Keep πθ close to a known-reasonable policy so it doesn't collapse into a fluent-but-degenerate mode.
- Off-distribution defence. The reward model was trained on responses sampled from πSFT. If πθ drifts far away, r̂φ is being asked to score inputs it has never seen. That's where reward hacking happens — the policy finds inputs where the reward model is miscalibrated upward, and exploits them. β tunes how loose the leash is.
Why this is three stages and not one
The natural question: why not skip Stage 2 and do "RL from human feedback" by polling humans for every rollout? Answer: economics. A human comparison takes 30–120 seconds; a reward-model forward pass takes milliseconds. PPO needs millions of rollout evaluations per training run. The reward model is a learned amortization of the human — slow and expensive to build, then cheap to query.
This same logic — "make the slow signal cheap by training a model to imitate it" — repeats in every preference algorithm we'll meet in lesson 16, and in every search-based reasoning recipe in lesson 17.
Trade-offs and known failure modes
| Cost / risk | What goes wrong | Mitigation |
|---|---|---|
| Three sequential stages | Bugs compound; iteration is slow; preference data must be re-collected for each new task. | Reuse preference data; cache RM checkpoints; or skip to DPO (lesson 16). |
| RM is wrong off-distribution | Policy reward-hacks RM blind spots. | KL anchor (β), early stopping on KL, sample-efficient RLHF. |
| Preference data is biased | Annotators reward fluency, length, confident tone — not correctness. | Length penalties, multi-objective rewards, AI-feedback bootstrapping. |
| RM scale doesn't match policy scale | A 7B RM scoring a 70B policy gets fooled by sophistication it can't evaluate. | Scale RM with policy; ensemble RMs; constitutional / rule-based supplements. |
| Mode collapse | Policy finds one high-reward response and emits it for every prompt. | Entropy bonus, larger β, more diverse prompts during RL. |
Why this pipeline still matters
For verifiable tasks (math, code) we don't need RLHF — that's what RLVR (lessons 09–14) is for. But the majority of an assistant's behavior is not verifiable: tone, helpfulness, harmlessness, format, style. Every modern frontier model still uses preference signal somewhere — even DeepSeek-R1 mixes RLVR for reasoning with preference-based RL for general helpfulness in its final stage (lesson 23). The RLHF ladder isn't obsolete; it's been factored into pieces that DPO, KTO, and friends rearrange in lesson 16.