rl_lessons / 03 · reward & reference lesson 3 / 8

Reward & Reference

Where the reward signal comes from, and why we need a frozen anchor or the policy will reward-hack itself off a cliff.

Two flavors of reward

Modern post-training RL splits cleanly into two reward regimes:

FlavorReward sourceExample tasksFamous example
RLVR (verifiable rewards)A program: unit-test runner, math checker, regex match.Math, code, puzzles.DeepSeek-R1, o-series reasoning models
RLHF (learned reward model)A neural network trained on human preference comparisons.Helpfulness, harmlessness, style.InstructGPT, Claude, the original ChatGPT

This framework is RLVR by default — the toy task is arithmetic, and the verifier is "does the last integer in the response equal the gold answer." From the framework's perspective the two flavors are interchangeable: both are a function that takes a response and returns a scalar. RLHF would simply swap env.verify for a RewardModelEngine.score (forward-only, same shape as the reference engine — that's why the framework calls them out as a shared "forward-only scorer" pattern).

Interactive · the verifier

Type a response to <3+4+5>. The verifier is intentionally lenient — it picks the last integer in the response, ignoring any chain-of-thought before it. This matches how real verifiable-reward systems work: extract the final answer, compare to the gold answer, give 1 or 0.

Verifier on arithmetic
Prompt is <3+4+5>, gold answer is 12. Try variations: 12#, let me think... 12#, 11#, 120#.
Extracted answer
Gold
12
Reward

Why a frozen reference

The minimal loop in lesson 1 has a problem we haven't named yet: reward hacking. If we maximize reward without any constraint, the policy will discover degenerate outputs that score well according to the verifier but are useless. On verifiable arithmetic this is hard to demonstrate (the verifier is tight), but on RLHF it is everywhere: "I love the previous response, here is more of it" reward-hacks helpfulness; repeated apologetic phrases reward-hack harmlessness; etc.

The fix that every modern recipe converges on: keep the policy close to a frozen reference. The reference is just a copy of the SFT checkpoint, before any RL touched it. We add a KL-divergence term to the loss:

J(θ) = 𝔼 [ r(y) ]  −  β · KL( πθ ‖ πref )

Minimizing this trades reward against drift from the reference. The hyperparameter β controls the trade-off; reasoning-RL recipes typically use small values (1e-2 to 1e-1). Too small → reward hacking; too large → the policy can't move enough to learn.

The KL term in practice — Schulman's k3 estimator

Computing KL(πθ ‖ πref) over the full vocabulary at every position is expensive. Schulman's k3 estimator is a per-token unbiased estimate that uses only the log-probabilities of the sampled token under each policy:

k3t  =  exp(log πref(yt) − log πθ(yt)) − (log πref(yt) − log πθ(yt)) − 1

Two important properties: (i) k3 ≥ 0 for every token, no matter the sign of log πref − log πθ; (ii) its expectation under y ∼ πθ equals the true KL: 𝔼y∼πθ[k3] = KL(πθ ‖ πref). Per realization k3 can deviate from the true KL — only the average matches. The cost is one extra forward pass through the reference for every trajectory — exactly the job of reference.py.

Argument order is critical
KL(πθ ‖ πref) and KL(πref ‖ πθ) are not the same quantity. We want the first ("the policy's KL from the anchor"); reversing it can silently invert the optimization. Watch the sign of log πref − log πθ in the formula above.

Interactive · β knob — anchor strength

Below we simulate one training run with arbitrary β. The blue curve is reward; the orange curve is KL(πθ ‖ πref). The mock training uses the regularized objective J = r − β · KL with a synthetic reward model that has a reward-hack: high reward concentrated on outputs that drift far from ref. Slide β to see the tension.

β trades reward against KL drift
At low β the policy zooms toward reward but blows out KL — that's reward hacking. At high β the policy is anchored too hard and reward never moves. Real recipes pick β in the cooperative middle. (The simulation uses a closed-form KL surrogate KL ≈ θ² — Gaussian-style toy — not the k3 sample estimator, which only appears in the actual loss code.)
Final reward
Final KL
Reward-hacked?

Why reference is a separate role in the framework

You could just keep a frozen copy of the model in the trainer. People do. It breaks in three predictable ways:

  1. Memory shape. The reference never needs optimizer state, activation checkpointing, or backward. Its optimal sharding is different from the trainer's. Making it a separate role lets the deployment give it its own (smaller, cheaper) layout — e.g. 2-way tensor parallel for ref versus 8-way FSDP for the trainer.
  2. Correctness. If requires_grad is True or the ref shares the optimizer, weight decay and mixed-precision casting silently drift it. The anchor then anchors the policy to a moving target. reference.ReferenceEngine.__init__ calls freeze() on its model precisely to make this a type-system-level guarantee.
  3. Composability. DPO, reward-model scoring, and ref scoring are all "forward-only, score a batch" roles. With a common interface, swapping a reference model for a reward model is a one-line change in the controller.

πold vs πref — they are different things

Two log-probability tensors appear in the loss. New readers conflate them constantly. Pin this distinction in your head now:

πoldπref
What policy is it?The policy at the moment y was sampled.The frozen SFT checkpoint.
When is it computed?During rollout (lesson 2).After rollout, by the reference engine.
Used inPPO ratio πθ / πold.KL anchor KL(πθ ‖ πref).
Changes over training?Yes — equals the trainer's policy at last weight-sync.No — frozen for the whole RL run.

At step 0 they happen to coincide (both equal the SFT initialization). After step 1 they diverge — πold tracks the trainer's recent past, πref stays put.

Takeaway
Reward says what to maximize; the reference says don't move too far from where you started. Their per-token interplay — reward in the advantage, KL in the loss — is what keeps modern RL from degenerating into reward-hack soup.