rl_lessons / 15 · RLHF lineage lesson 1 / 9 · part III

RLHF — the original recipe Christiano '17 · InstructGPT '22

Before verifiable rewards there were human rewards. Preferences become a scalar reward model; the reward model trains a policy with PPO. Three stages, one ladder. Every modern preference algorithm is a shortcut around at least one rung.

Where this lesson sits
Part I built the framework, Part II built the algorithms — all assuming a verifier hands you a 0/1 reward. Part III opens by asking: what if there is no verifier? RLHF is the foundational answer. It introduces the reward model, a learned scalar function trained from human preferences. Lessons 16 (DPO), 17 (PRM), and 18 (environments) are all variations on this same question — where the reward signal comes from.

The problem RLHF solves

For most tasks worth doing — write a helpful email, summarize a paper, follow a vague instruction — there is no verifier. You cannot unit-test "is this response helpful?" You can, however, ask a human: between these two responses, which is better? That comparison is almost always easier than scoring either response in isolation.

RLHF (Reinforcement Learning from Human Feedback) is the canonical pipeline for turning that comparison signal into a policy update. It has three stages, each consuming the previous one's output:

demonstrations ~13k (x, y*) pairs preferences ~33k (x, y_w ≻ y_l) prompts only ~31k x's, no labels 1 · SFT cross-entropy imitate demos 2 · Reward model Bradley–Terry NLL fit preferences 3 · PPO max r̂_φ − β·KL on-policy RL π_SFT r̂_φ π_RLHF inits scores π_SFT also anchors PPO (KL)

The three stages are not optional reorderings — each consumes both data from above and the output of the previous stage. πSFT initializes φ; φ scores rollouts during PPO; πSFT also anchors the KL term in PPO. Break any one stage and a specific pathology shows up downstream.

Stage 1 · Supervised fine-tuning (SFT)

Start with a pretrained base model. Show it a few thousand (prompt, ideal response) demonstrations written by skilled humans. Train with the standard next-token cross-entropy loss until it learns the response format — that it should follow instructions, end with EOS, not regurgitate the prompt. The result πSFT is a competent-but-bland assistant.

Why this stage exists at all
Two reasons, both load-bearing. (a) The reward model in Stage 2 is initialized from πSFT — it needs the response format already in place, otherwise the scalar head is trying to score gibberish. (b) The PPO policy in Stage 3 is also initialized from πSFT and KL-anchored to it. If πSFT is the wrong shape, PPO drifts wildly looking for any plateau.

Stage 2 · The reward model

Here is the core idea, and the one place this lesson asks you to do the math.

You collect a dataset of human pairwise judgements: prompt x, two responses yw (winner) and yl (loser). You want a scalar function φ(x, y) such that φ(x, yw) > r̂φ(x, yl) on as many pairs as possible.

The classical model from Bradley & Terry (1952) says: if two items have latent scores sw and sl, the probability the human picks yw is the sigmoid of the score difference:

P(yw ≻ yl | x)  =  σ( r̂φ(x, yw) − r̂φ(x, yl) )

Take the negative log-likelihood, average over the preference dataset, and you have the reward-model loss in its production form:

LRM(φ)  =  − 𝔼(x, yw, yl) [ log σ( r̂φ(x, yw) − r̂φ(x, yl) ) ]

Concretely, φ is the SFT model with a scalar regression head bolted to the final hidden state of the last token. Two forward passes per training example. The loss has three properties worth pausing on:

Interactive · fit a Bradley–Terry reward on a 4-response toy

Below: four responses to one prompt with hidden ground-truth scores. Each "round" samples a random pair, the simulated human picks the winner with probability σ(Δ), and one gradient step is taken on the BT loss. Watch the model's scores converge — and watch the agreement rate plateau short of 100% because some pairs are genuinely close.

Bradley–Terry reward model on 4 responses
Each step samples a pair (i, j), observes the human pick under BT, and takes one SGD step on −log σ(r̂i − r̂j). Hidden true scores (revealed below): A=0.0, B=1.0, C=2.0, D=2.5. The learned scores are pinned to mean zero on each render (BT is offset-invariant).
Pairs seen
0
Agreement %
Last loss

Stage 3 · PPO against the learned reward

Now φ is a frozen scalar function over (prompt, response). We're back in familiar territory from lessons 04 and 10: optimize a policy πθ to maximize expected reward, with PPO clipping and a KL anchor to πSFT:

maxθ   𝔼y ∼ πθ(·|x) [ r̂φ(x, y) ]  −  β · KL( πθ ‖ πSFT )

The KL term is doing two distinct jobs and it's worth separating them in your head:

  1. Distribution anchor. Keep πθ close to a known-reasonable policy so it doesn't collapse into a fluent-but-degenerate mode.
  2. Off-distribution defence. The reward model was trained on responses sampled from πSFT. If πθ drifts far away, φ is being asked to score inputs it has never seen. That's where reward hacking happens — the policy finds inputs where the reward model is miscalibrated upward, and exploits them. β tunes how loose the leash is.
Reward hacking, explained
Pre-RLHF GPT models trained without a KL anchor reliably converged to responses like "the the the the …" or extremely flattering openings, because the RM had assigned spuriously high scores to those token patterns in its training data and the policy gleefully exploited the bug. The KL anchor doesn't fix the RM's miscalibration; it bounds how far the policy is allowed to walk into it.

Why this is three stages and not one

The natural question: why not skip Stage 2 and do "RL from human feedback" by polling humans for every rollout? Answer: economics. A human comparison takes 30–120 seconds; a reward-model forward pass takes milliseconds. PPO needs millions of rollout evaluations per training run. The reward model is a learned amortization of the human — slow and expensive to build, then cheap to query.

This same logic — "make the slow signal cheap by training a model to imitate it" — repeats in every preference algorithm we'll meet in lesson 16, and in every search-based reasoning recipe in lesson 17.

Trade-offs and known failure modes

Cost / riskWhat goes wrongMitigation
Three sequential stagesBugs compound; iteration is slow; preference data must be re-collected for each new task.Reuse preference data; cache RM checkpoints; or skip to DPO (lesson 16).
RM is wrong off-distributionPolicy reward-hacks RM blind spots.KL anchor (β), early stopping on KL, sample-efficient RLHF.
Preference data is biasedAnnotators reward fluency, length, confident tone — not correctness.Length penalties, multi-objective rewards, AI-feedback bootstrapping.
RM scale doesn't match policy scaleA 7B RM scoring a 70B policy gets fooled by sophistication it can't evaluate.Scale RM with policy; ensemble RMs; constitutional / rule-based supplements.
Mode collapsePolicy finds one high-reward response and emits it for every prompt.Entropy bonus, larger β, more diverse prompts during RL.

Why this pipeline still matters

For verifiable tasks (math, code) we don't need RLHF — that's what RLVR (lessons 09–14) is for. But the majority of an assistant's behavior is not verifiable: tone, helpfulness, harmlessness, format, style. Every modern frontier model still uses preference signal somewhere — even DeepSeek-R1 mixes RLVR for reasoning with preference-based RL for general helpfulness in its final stage (lesson 23). The RLHF ladder isn't obsolete; it's been factored into pieces that DPO, KTO, and friends rearrange in lesson 16.

Takeaway
RLHF is: SFT → Bradley-Terry RM → PPO with KL-to-SFT. Each rung exists because the next rung needs its output. The RM is not just a scorer — it's the bottleneck that decides what "good" means, and where the policy is allowed to look for it.