Reward & Reference

Where the reward signal comes from, and why we need a frozen anchor or the policy will reward-hack itself off a cliff.

Two flavors of reward

Modern post-training RL splits cleanly into two reward regimes:

Flavor	Reward source	Example tasks	Famous example
RLVR (verifiable rewards)	A program: unit-test runner, math checker, regex match.	Math, code, puzzles.	DeepSeek-R1, o-series reasoning models
RLHF (learned reward model)	A neural network trained on human preference comparisons.	Helpfulness, harmlessness, style.	InstructGPT, Claude, the original ChatGPT

This framework is RLVR by default — the toy task is arithmetic, and the verifier is "does the last integer in the response equal the gold answer." From the framework's perspective the two flavors are interchangeable: both are a function that takes a response and returns a scalar. RLHF would simply swap env.verify for a RewardModelEngine.score (forward-only, same shape as the reference engine — that's why the framework calls them out as a shared "forward-only scorer" pattern).

Interactive · the verifier

Type a response to <3+4+5>. The verifier is intentionally lenient — it picks the last integer in the response, ignoring any chain-of-thought before it. This matches how real verifiable-reward systems work: extract the final answer, compare to the gold answer, give 1 or 0.

Why a frozen reference

The minimal loop in lesson 1 has a problem we haven't named yet: reward hacking. If we maximize reward without any constraint, the policy will discover degenerate outputs that score well according to the verifier but are useless. On verifiable arithmetic this is hard to demonstrate (the verifier is tight), but on RLHF it is everywhere: "I love the previous response, here is more of it" reward-hacks helpfulness; repeated apologetic phrases reward-hack harmlessness; etc.

The fix that every modern recipe converges on: keep the policy close to a frozen reference. The reference is just a copy of the SFT checkpoint, before any RL touched it. We add a KL-divergence term to the loss:

J(θ) = 𝔼 [ r(y) ] − β · KL( π_θ ‖ π_ref )

Minimizing this trades reward against drift from the reference. The hyperparameter β controls the trade-off; reasoning-RL recipes typically use small values (1e-2 to 1e-1). Too small → reward hacking; too large → the policy can't move enough to learn.

The KL term in practice — Schulman's k3 estimator

Computing KL(π_θ ‖ π_ref) over the full vocabulary at every position is expensive. Schulman's k3 estimator is a per-token unbiased estimate that uses only the log-probabilities of the sampled token under each policy:

k3_t = exp(log π_ref(y_t) − log π_θ(y_t)) − (log π_ref(y_t) − log π_θ(y_t)) − 1

Two important properties: (i) k3 ≥ 0 for every token, no matter the sign of log π_ref − log π_θ; (ii) its expectation under y ∼ π_θ equals the true KL: 𝔼_{y∼π_θ}[k3] = KL(π_θ ‖ π_ref). Per realization k3 can deviate from the true KL — only the average matches. The cost is one extra forward pass through the reference for every trajectory — exactly the job of reference.py.

Argument order is critical

KL(π_θ ‖ π_ref) and KL(π_ref ‖ π_θ) are not the same quantity. We want the first ("the policy's KL from the anchor"); reversing it can silently invert the optimization. Watch the sign of log π_ref − log π_θ in the formula above.

Interactive · β knob — anchor strength

Below we simulate one training run with arbitrary β. The blue curve is reward; the orange curve is KL(π_θ ‖ π_ref). The mock training uses the regularized objective J = r − β · KL with a synthetic reward model that has a reward-hack: high reward concentrated on outputs that drift far from ref. Slide β to see the tension.

β trades reward against KL drift

At low β the policy zooms toward reward but blows out KL — that's reward hacking. At high β the policy is anchored too hard and reward never moves. Real recipes pick β in the cooperative middle. (The simulation uses a closed-form KL surrogate KL ≈ θ² — Gaussian-style toy — not the k3 sample estimator, which only appears in the actual loss code.)

β: 0.050

Final reward

—

Final KL

—

Reward-hacked?

—

Why reference is a separate role in the framework

You could just keep a frozen copy of the model in the trainer. People do. It breaks in three predictable ways:

Memory shape. The reference never needs optimizer state, activation checkpointing, or backward. Its optimal sharding is different from the trainer's. Making it a separate role lets the deployment give it its own (smaller, cheaper) layout — e.g. 2-way tensor parallel for ref versus 8-way FSDP for the trainer.
Correctness. If requires_grad is True or the ref shares the optimizer, weight decay and mixed-precision casting silently drift it. The anchor then anchors the policy to a moving target. reference.ReferenceEngine.__init__ calls freeze() on its model precisely to make this a type-system-level guarantee.
Composability. DPO, reward-model scoring, and ref scoring are all "forward-only, score a batch" roles. With a common interface, swapping a reference model for a reward model is a one-line change in the controller.

π_old vs π_ref — they are different things

Two log-probability tensors appear in the loss. New readers conflate them constantly. Pin this distinction in your head now:

	π_old	π_ref
What policy is it?	The policy at the moment y was sampled.	The frozen SFT checkpoint.
When is it computed?	During rollout (lesson 2).	After rollout, by the reference engine.
Used in	PPO ratio π_θ / π_old.	KL anchor KL(π_θ ‖ π_ref).
Changes over training?	Yes — equals the trainer's policy at last weight-sync.	No — frozen for the whole RL run.

At step 0 they happen to coincide (both equal the SFT initialization). After step 1 they diverge — π_old tracks the trainer's recent past, π_ref stays put.

Takeaway

Reward says what to maximize; the reference says don't move too far from where you started. Their per-token interplay — reward in the advantage, KL in the loss — is what keeps modern RL from degenerating into reward-hack soup.