Reward & Reference
Where the reward signal comes from, and why we need a frozen anchor or the policy will reward-hack itself off a cliff.
Two flavors of reward
Modern post-training RL splits cleanly into two reward regimes:
| Flavor | Reward source | Example tasks | Famous example |
|---|---|---|---|
| RLVR (verifiable rewards) | A program: unit-test runner, math checker, regex match. | Math, code, puzzles. | DeepSeek-R1, o-series reasoning models |
| RLHF (learned reward model) | A neural network trained on human preference comparisons. | Helpfulness, harmlessness, style. | InstructGPT, Claude, the original ChatGPT |
This framework is RLVR by default — the toy task is arithmetic, and the verifier is "does the last integer in the response equal the gold answer." From the framework's perspective the two flavors are interchangeable: both are a function that takes a response and returns a scalar. RLHF would simply swap env.verify for a RewardModelEngine.score (forward-only, same shape as the reference engine — that's why the framework calls them out as a shared "forward-only scorer" pattern).
Interactive · the verifier
Type a response to <3+4+5>. The verifier is intentionally lenient — it picks the last integer in the response, ignoring any chain-of-thought before it. This matches how real verifiable-reward systems work: extract the final answer, compare to the gold answer, give 1 or 0.
Why a frozen reference
The minimal loop in lesson 1 has a problem we haven't named yet: reward hacking. If we maximize reward without any constraint, the policy will discover degenerate outputs that score well according to the verifier but are useless. On verifiable arithmetic this is hard to demonstrate (the verifier is tight), but on RLHF it is everywhere: "I love the previous response, here is more of it" reward-hacks helpfulness; repeated apologetic phrases reward-hack harmlessness; etc.
The fix that every modern recipe converges on: keep the policy close to a frozen reference. The reference is just a copy of the SFT checkpoint, before any RL touched it. We add a KL-divergence term to the loss:
Minimizing this trades reward against drift from the reference. The hyperparameter β controls the trade-off; reasoning-RL recipes typically use small values (1e-2 to 1e-1). Too small → reward hacking; too large → the policy can't move enough to learn.
The KL term in practice — Schulman's k3 estimator
Computing KL(πθ ‖ πref) over the full vocabulary at every position is expensive. Schulman's k3 estimator is a per-token unbiased estimate that uses only the log-probabilities of the sampled token under each policy:
Two important properties: (i) k3 ≥ 0 for every token, no matter the sign of log πref − log πθ; (ii) its expectation under y ∼ πθ equals the true KL: 𝔼y∼πθ[k3] = KL(πθ ‖ πref). Per realization k3 can deviate from the true KL — only the average matches. The cost is one extra forward pass through the reference for every trajectory — exactly the job of reference.py.
Interactive · β knob — anchor strength
Below we simulate one training run with arbitrary β. The blue curve is reward; the orange curve is KL(πθ ‖ πref). The mock training uses the regularized objective J = r − β · KL with a synthetic reward model that has a reward-hack: high reward concentrated on outputs that drift far from ref. Slide β to see the tension.
Why reference is a separate role in the framework
You could just keep a frozen copy of the model in the trainer. People do. It breaks in three predictable ways:
- Memory shape. The reference never needs optimizer state, activation checkpointing, or backward. Its optimal sharding is different from the trainer's. Making it a separate role lets the deployment give it its own (smaller, cheaper) layout — e.g. 2-way tensor parallel for ref versus 8-way FSDP for the trainer.
- Correctness. If
requires_gradis True or the ref shares the optimizer, weight decay and mixed-precision casting silently drift it. The anchor then anchors the policy to a moving target.reference.ReferenceEngine.__init__callsfreeze()on its model precisely to make this a type-system-level guarantee. - Composability. DPO, reward-model scoring, and ref scoring are all "forward-only, score a batch" roles. With a common interface, swapping a reference model for a reward model is a one-line change in the controller.
πold vs πref — they are different things
Two log-probability tensors appear in the loss. New readers conflate them constantly. Pin this distinction in your head now:
| πold | πref | |
|---|---|---|
| What policy is it? | The policy at the moment y was sampled. | The frozen SFT checkpoint. |
| When is it computed? | During rollout (lesson 2). | After rollout, by the reference engine. |
| Used in | PPO ratio πθ / πold. | KL anchor KL(πθ ‖ πref). |
| Changes over training? | Yes — equals the trainer's policy at last weight-sync. | No — frozen for the whole RL run. |
At step 0 they happen to coincide (both equal the SFT initialization). After step 1 they diverge — πold tracks the trainer's recent past, πref stays put.