RLHF — the original recipe Christiano '17 · InstructGPT '22

Before verifiable rewards there were human rewards. Preferences become a scalar reward model; the reward model trains a policy with PPO. Three stages, one ladder. Every modern preference algorithm is a shortcut around at least one rung.

Where this lesson sits

Part I built the framework, Part II built the algorithms — all assuming a verifier hands you a 0/1 reward. Part III opens by asking: what if there is no verifier? RLHF is the foundational answer. It introduces the reward model, a learned scalar function trained from human preferences. Lessons 16 (DPO), 17 (PRM), and 18 (environments) are all variations on this same question — where the reward signal comes from.

The problem RLHF solves

For most tasks worth doing — write a helpful email, summarize a paper, follow a vague instruction — there is no verifier. You cannot unit-test "is this response helpful?" You can, however, ask a human: between these two responses, which is better? That comparison is almost always easier than scoring either response in isolation.

RLHF (Reinforcement Learning from Human Feedback) is the canonical pipeline for turning that comparison signal into a policy update. It has three stages, each consuming the previous one's output:

The three stages are not optional reorderings — each consumes both data from above and the output of the previous stage. π_SFT initializes r̂_φ; r̂_φ scores rollouts during PPO; π_SFT also anchors the KL term in PPO. Break any one stage and a specific pathology shows up downstream.

Stage 1 · Supervised fine-tuning (SFT)

Start with a pretrained base model. Show it a few thousand (prompt, ideal response) demonstrations written by skilled humans. Train with the standard next-token cross-entropy loss until it learns the response format — that it should follow instructions, end with EOS, not regurgitate the prompt. The result π_SFT is a competent-but-bland assistant.

Why this stage exists at all

Two reasons, both load-bearing. (a) The reward model in Stage 2 is initialized from π_SFT — it needs the response format already in place, otherwise the scalar head is trying to score gibberish. (b) The PPO policy in Stage 3 is also initialized from π_SFT and KL-anchored to it. If π_SFT is the wrong shape, PPO drifts wildly looking for any plateau.

Stage 2 · The reward model

Here is the core idea, and the one place this lesson asks you to do the math.

You collect a dataset of human pairwise judgements: prompt x, two responses y_w (winner) and y_l (loser). You want a scalar function r̂_φ(x, y) such that r̂_φ(x, y_w) > r̂_φ(x, y_l) on as many pairs as possible.

The classical model from Bradley & Terry (1952) says: if two items have latent scores s_w and s_l, the probability the human picks y_w is the sigmoid of the score difference:

P(y_w ≻ y_l | x) = σ( r̂_φ(x, y_w) − r̂_φ(x, y_l) )

Take the negative log-likelihood, average over the preference dataset, and you have the reward-model loss in its production form:

L_RM(φ) = − 𝔼_{(x, y_w, y_l)} [ log σ( r̂_φ(x, y_w) − r̂_φ(x, y_l) ) ]

Concretely, r̂_φ is the SFT model with a scalar regression head bolted to the final hidden state of the last token. Two forward passes per training example. The loss has three properties worth pausing on:

It only depends on the difference. Adding a constant to every reward leaves the loss unchanged. The reward model has a free additive offset; do not interpret raw reward values, only differences.
It is convex in the score difference. Gradient pushes r̂_w up and r̂_l down with weight σ(r̂_l − r̂_w) — pairs the model already gets right contribute almost no gradient, hard pairs contribute the most. This is just logistic regression on response embeddings.
The score scale is meaningful (but only relative). σ(Δ = 1) ≈ 0.73; σ(Δ = 3) ≈ 0.95. A well-trained RM puts most preferences at |Δ| in [0.5, 4]. If |Δ| is much bigger, the RM is overconfident; much smaller, undertrained.

Interactive · fit a Bradley–Terry reward on a 4-response toy

Below: four responses to one prompt with hidden ground-truth scores. Each "round" samples a random pair, the simulated human picks the winner with probability σ(Δ), and one gradient step is taken on the BT loss. Watch the model's scores converge — and watch the agreement rate plateau short of 100% because some pairs are genuinely close.

Stage 3 · PPO against the learned reward

Now r̂_φ is a frozen scalar function over (prompt, response). We're back in familiar territory from lessons 04 and 10: optimize a policy π_θ to maximize expected reward, with PPO clipping and a KL anchor to π_SFT:

max_θ 𝔼_{y ∼ π_θ(·|x)} [ r̂_φ(x, y) ] − β · KL( π_θ ‖ π_SFT )

The KL term is doing two distinct jobs and it's worth separating them in your head:

Distribution anchor. Keep π_θ close to a known-reasonable policy so it doesn't collapse into a fluent-but-degenerate mode.
Off-distribution defence. The reward model was trained on responses sampled from π_SFT. If π_θ drifts far away, r̂_φ is being asked to score inputs it has never seen. That's where reward hacking happens — the policy finds inputs where the reward model is miscalibrated upward, and exploits them. β tunes how loose the leash is.

Reward hacking, explained

Pre-RLHF GPT models trained without a KL anchor reliably converged to responses like "the the the the …" or extremely flattering openings, because the RM had assigned spuriously high scores to those token patterns in its training data and the policy gleefully exploited the bug. The KL anchor doesn't fix the RM's miscalibration; it bounds how far the policy is allowed to walk into it.

Why this is three stages and not one

The natural question: why not skip Stage 2 and do "RL from human feedback" by polling humans for every rollout? Answer: economics. A human comparison takes 30–120 seconds; a reward-model forward pass takes milliseconds. PPO needs millions of rollout evaluations per training run. The reward model is a learned amortization of the human — slow and expensive to build, then cheap to query.

This same logic — "make the slow signal cheap by training a model to imitate it" — repeats in every preference algorithm we'll meet in lesson 16, and in every search-based reasoning recipe in lesson 17.

Trade-offs and known failure modes

Cost / risk	What goes wrong	Mitigation
Three sequential stages	Bugs compound; iteration is slow; preference data must be re-collected for each new task.	Reuse preference data; cache RM checkpoints; or skip to DPO (lesson 16).
RM is wrong off-distribution	Policy reward-hacks RM blind spots.	KL anchor (β), early stopping on KL, sample-efficient RLHF.
Preference data is biased	Annotators reward fluency, length, confident tone — not correctness.	Length penalties, multi-objective rewards, AI-feedback bootstrapping.
RM scale doesn't match policy scale	A 7B RM scoring a 70B policy gets fooled by sophistication it can't evaluate.	Scale RM with policy; ensemble RMs; constitutional / rule-based supplements.
Mode collapse	Policy finds one high-reward response and emits it for every prompt.	Entropy bonus, larger β, more diverse prompts during RL.

Why this pipeline still matters

For verifiable tasks (math, code) we don't need RLHF — that's what RLVR (lessons 09–14) is for. But the majority of an assistant's behavior is not verifiable: tone, helpfulness, harmlessness, format, style. Every modern frontier model still uses preference signal somewhere — even DeepSeek-R1 mixes RLVR for reasoning with preference-based RL for general helpfulness in its final stage (lesson 23). The RLHF ladder isn't obsolete; it's been factored into pieces that DPO, KTO, and friends rearrange in lesson 16.

Takeaway

RLHF is: SFT → Bradley-Terry RM → PPO with KL-to-SFT. Each rung exists because the next rung needs its output. The RM is not just a scorer — it's the bottleneck that decides what "good" means, and where the policy is allowed to look for it.