RLVR — exploration under a verifier GRPO · DeepSeek-R1
The final stage. A programmatic checker replaces the human rater, fresh samples from the policy replace the static dataset, and a one-line group statistic replaces the value network.
Where DPO ran out of road
Lesson 5 left us with a model that has been pushed toward chosen responses and away from rejected ones under a frozen reference anchor. DPO is elegant — the reward model dissolves algebraically, the loss is closed-form — but it inherits the rigidity of any supervised method. Three concrete gaps:
- Static pair distribution. The training signal is bounded by what is in the pair file. DPO can re-rank within the support of (yw, yl) seen at training time, but it cannot reward a response style it has never seen.
- No exploration. The model never produces samples during DPO training — every gradient comes from log-probabilities of pre-canned strings. There is no mechanism for discovering a new reasoning pattern that scores well; only for tilting probability toward what some human, or some prior model, has already written down.
- Wasteful for verifiable tasks. For arithmetic, code, puzzles, or anything with a programmatic correctness oracle, paying humans to compare responses is silly. The oracle is the reward. Spending preference labels on tasks with a ground truth is like hiring a panel of judges to score a stopwatch.
RLVR — Reinforcement Learning with Verifiable Rewards — collapses all three problems at once: replace the static dataset with on-policy rollouts, replace the human rater with a verifier, and use a clean group statistic in place of any learned value function. The instantiation we will study is GRPO (Group-Relative Policy Optimization), the algorithm behind DeepSeek-R1 and the broader o1-style reasoning-model family.
The setting in five symbols
Three of the symbols carry over verbatim from lesson 5; two are new.
- πθ — the policy we are training. Same MiniGPT, same parameters, same forward pass.
- πref — the frozen reference. As in DPO, this is the SFT-warmed-up checkpoint at the moment RLVR starts.
requires_grad_(False). It anchors the KL term. - x — a prompt, here a chat-templated arithmetic problem such as
<3+4+5>. - y — a response sampled from πθ(·|x) token-by-token. New every step.
- R(x, y) — the verifier. A function from (x, y) to ℝ that returns 1 if the response decodes to the gold answer and 0 otherwise. No learning, no parameters, just code:
def verify(response_text: str, gold: int) -> float:
matches = _DIGITS_RE.findall(response_text)
if not matches:
return 0.0
return 1.0 if int(matches[-1]) == gold else 0.0
For math the verifier parses the last integer in the response and compares to the gold sum. For code it would run unit tests. For puzzles it would diff against a known solution. The defining property is that the verifier is cheap, faithful, and noiseless — it never gives partial credit it does not believe in, and it never disagrees with itself. That is what makes it a strictly better signal than a reward model trained on preferences.
GRPO in six lines
Here is the entire algorithm. Walk through it once now; we will dissect each line below.
1. Sample a prompt x ~ D
2. Roll out K samples y_1, …, y_K ~ π_θ(·|x)
3. Verify each r_i = R(x, y_i) ∈ ℝ
4. Group advantage A_i = (r_i − mean_j r_j) / (std_j r_j + ε)
5. PG loss L_PG = − (1/K) Σ_i A_i · log π_θ(y_i | x)
6. KL anchor L_KL = β · (1/K) Σ_i KL( π_θ ‖ π_ref ) over response tokens
Total L = L_PG + L_KL
That is the whole loop. One optimizer step per group of K rollouts. The toy implementation in 04_rlvr.py uses K=4 and β=0.05 for 800 steps; production recipes use K=16…64 and tens of thousands of steps.
Why group-relative — the disappearance of the value network
The score-function gradient is
This is the formula from lesson 1, lifted to sequences. Estimated naively (one sample), it has crushing variance: a single response touches hundreds of tokens, and a binary reward gives you one bit of signal scattered across all of them. The standard cure is a baseline b(x). Subtracting any function of x alone from the reward leaves the gradient unbiased:
because the policy distribution integrates to one no matter θ. So the centred gradient
is unbiased for any b that does not depend on y. PPO-style RLHF realizes b(x) as a learned value head — a second copy of the trunk with a scalar projection — trained by regressing returns. GRPO realizes b(x) as the sample mean of K rollouts for the same prompt:
This is a Monte-Carlo estimate of 𝔼y∼πθ[R(x,y)]. It depends only on x (the rollouts are i.i.d. given x), so the baseline-zero-mean argument applies. It costs zero extra parameters, has zero warmup, and cannot go stale relative to the policy because it is computed from the policy itself, this step. The standard-deviation denominator
does not change unbiasedness — it is a rescaling — but it makes the loss invariant to reward shifts and to multiplicative reward rescaling. Switch your verifier from {0,1} to {0,10} and the optimizer behaves identically.
The cost of going group-relative: K forward passes per prompt instead of one. That is the basic GRPO-vs-PPO trade. PPO pays in extra parameters and a value-target signal; GRPO pays in extra compute per prompt. For verifiable tasks with sparse terminal reward — where the value head is hardest to train anyway — the GRPO trade is the better one.
A worked group, by hand
Suppose we draw K=4 rollouts for prompt <3+4+5> (gold = 12). The verifier returns:
Mean is 0.5. The population standard deviation is also 0.5. Advantages:
Half the group gets pushed up, half pushed down. The policy gradient is now centred: the optimizer will increase log-probability of y1 and y2, decrease log-probability of y3 and y4, by exactly the magnitude their advantage prescribes. This is the entire learning signal.
The KL anchor — why we keep πref around
Without a KL penalty, the policy is free to walk wherever the reward gradient points. Sounds desirable. It is not. Verifiers have edge cases. The arithmetic verifier in our toy parses the last integer in the response — so a policy that learns to emit the string "42#" always, and ignores the prompt, will get reward 1 on every problem whose gold answer happens to be 42, and reward 0 elsewhere. Across a uniform-difficulty stream of two- and three-summand problems, this is a winning strategy compared to most random outputs. A pure reward-maximizing policy with no anchor will find it. We are then proudly training a model that thinks the universal answer is 42 and have lost everything SFT gave us.
The KL term against the frozen πref stops this. It says: at every token, your distribution over next tokens has to stay near the SFT model's. Local exploration is fine; a wholesale collapse onto a degenerate sequence is not.
The KL is computed only over response tokens, not the prompt. The prompt is identical between πθ and πref conditionally (same input), so its KL is the same loss contribution either way — masking it out is just bookkeeping. In the code:
target_mask = torch.zeros((K, T - 1), dtype=torch.float32, device=device)
target_mask[:, P - 1 : P - 1 + resp_mask.size(1)] = resp_mask
P is the prompt length; the mask is 1 only at target positions corresponding to response tokens that are not past-EOS padding.
Schulman's k3 KL estimator
How do we actually compute the KL? We have access to per-token log-probabilities log πθ(yt|·) and log πref(yt|·) from the two model forwards. The naive estimator is
which is unbiased — its expectation under y ∼ πθ is exactly KL(πθ ‖ πref) — but signed and high-variance. Per token it can be negative, which is uncomfortable both interpretationally and as a loss term you might want to clip or log. Schulman's "k3" estimator fixes this. Let r = log πref − log πθ. Then
Two properties to verify:
- Nonnegativity per token. The function f(r) = er − r − 1 has minimum 0 at r=0, since f′(r) = er − 1 vanishes there and f″(r) = er > 0 is positive everywhere. So f ≥ 0 for all real r, with equality iff πθ = πref at that token.
- Unbiased in expectation. Under y ∼ πθ, 𝔼[er] = Σy πθ(y) · πref(y)/πθ(y) = Σy πref(y) = 1. So 𝔼[KL̂k3] = 1 − 𝔼[r] − 1 = −𝔼[r] = 𝔼[log πθ − log πref] = KL(πθ ‖ πref). The expectations of the naive and k3 estimators agree exactly.
So k3 trades nothing in expectation for always-nonnegative samples. That is a strict variance reduction for a strict expectation-preserving change. Free lunch.
The toy implementation is one Python line:
r = ref_logp - pol_logp
kl_per_tok = torch.exp(r) - r - 1.0 # >= 0 elementwise
gpt_mini lists this as the canonical RLVR gotcha for a reason.
Why REINFORCE-style here — no PPO clipping
Real GRPO, as published, wraps the policy-gradient term in PPO-style importance-ratio clipping. The reason is off-policy reuse: in production, rollouts are produced by an inference engine on one set of GPUs while the trainer updates the policy on another set. Between the moment a rollout is sampled (using policy weights πold) and the moment it contributes to a gradient (using the current weights πθ), the policy has moved. The two distributions differ; the rollout is no longer on-policy. The PPO clip bounds the resulting update.
Our toy in 04_rlvr.py sidesteps all of that. The loop is strictly synchronous: roll out → score → update → next prompt. The rollout policy is the current policy, so the importance ratio πθ(y|x) / πold(y|x) = 1 by construction and the clip is inactive. We skip the clip term for clarity; it would not change a single gradient. In a multi-epoch or asynchronous setting you'd insert one line:
# where it would go, in the loss expression:
ratio = (pol_logp - old_logp.detach()).exp() # importance ratio
clipped = torch.clamp(ratio, 1.0 - eps, 1.0 + eps) # PPO-clip
pg_loss = -torch.min(ratio * adv, clipped * adv).mean()
Everything else — group-relative advantage, KL anchor, k3 estimator, mask over response tokens — is unchanged.
Degenerate groups — when the step does nothing
If all K rollouts in a group receive the same reward, then σ̂ = 0 and every ri − mean = 0. The advantage vector is all zeros. The PG loss vanishes; only the KL term remains; the gradient is purely "stay close to πref", which is uninformative. The toy file detects this and short-circuits:
if rewards.std().item() < 1e-8:
return 0.0, rewards.mean().item(), 0.0 # skip the optimizer step
adv = (rewards - rewards.mean()) / (rewards.std() + 1e-6)
Degenerate groups are not a bug — they are a fact of the training dynamics. They show up in two regimes:
- Early. A bad policy fails on most prompts. For an arithmetic problem with gold answer 14, all four rollouts say
"7#","3#","99#","5#"— all wrong. Rewards are(0,0,0,0). No signal. - Late. A mastered policy succeeds on most prompts. All four rollouts say
"14#". Rewards are(1,1,1,1). No signal — there is nothing to differentiate by.
The interesting middle is where some rollouts succeed and some fail, and the centred advantage tells the policy which sample style worked. The dead-group rate is a useful real-time diagnostic: if it stays at 100% you have either a hopeless initial policy or a verifier that everyone passes; if it drops to zero you have either converged or run out of room to grow.
Interactive · a GRPO rollout group, step by step
The widget below simulates one full GRPO step at a time. The "policy correctness" parameter p is a 1-D stand-in for the policy's competence: rollouts are sampled to be correct with probability p. Hit Resample to draw a fresh group; hit Step to apply one fake gradient update — the step nudges p up by an amount proportional to Σ Ai · [correcti] (the empirical group signal), tempered by the KL coefficient β (larger β means slower adaptation). The "mean reward EMA" line chart and the dead-group counter give you a feel for how RLVR actually unfolds.
Play with three things and notice the dynamics:
- Hold β = 0 and step repeatedly. The policy correctness climbs fast — and you may notice it crosses 1.0 (impossible for a real policy) because there is nothing pulling it back. That is the collapse failure mode.
- Crank β high. Each step barely moves the policy. The mean reward EMA grows linearly instead of accelerating. Strong anchor, slow learning.
- Reduce K to 2. Dead-group rate spikes — half your steps do nothing. Increase to K=8 and the rate plummets at K× the compute.
What β and K actually control
| Knob | Small | Large | Where it bites |
|---|---|---|---|
| β | weak anchor | strong anchor | Small → faster adaptation but risk of collapse onto a verifier exploit. Large → slow but stable. Typical: 0.01–0.1. |
| K | high-variance advantage; many dead groups | low variance, K× compute | Small → step contributes nothing on dead groups, gradients noisy. Large → reliable advantage estimate. Papers use 16–64. |
| temperature | greedy, low entropy | diverse, exploratory | Too low collapses the group (every sample is the same response); too high makes everything garbage. Match to expected response length. |
| learning rate | safe | fast but unstable | RLVR LR is typically 10×–100× smaller than SFT LR — the gradient is already a high-variance Monte Carlo estimate; piling LR on top makes it worse. Toy uses 1e-5. |
Warmup matters — the cold-start problem
RL needs a base that occasionally succeeds. A randomly initialized policy emits gibberish: for the toy task it would produce strings like "+5_##" that the verifier can't even parse. Every reward is 0. Every group is degenerate. Every gradient is zero. Training cannot start.
That is why 04_rlvr.py runs a brief SFT warmup before the RLVR loop:
print("── SFT warmup (so the initial policy produces parseable output) ──")
sft = train_warmup(tok, T_max=T_max, steps=600, B=64, device=device)
policy = copy.deepcopy(sft)
ref = sft
After 600 steps of warmup the policy reliably emits digits + #. The verifier can now grade it. Some answers are right, some wrong — a productive split. The RLVR loop starts there. πref is set to the same warmed-up model and frozen.
This is a general lesson about RL post-training: you need a base that occasionally succeeds, otherwise the reward signal is too sparse to bootstrap. Real R1 starts from a strong base model. Our 4-layer toy starts from a tiny SFT. Same principle. The "Zero" variants (R1-Zero, etc.) demonstrate that with a sufficiently capable base and a verifier, you can skip explicit SFT entirely — but you can never skip the implicit prior that the base provides.
DPO vs RLVR — same anchor, different oracle
DPO and RLVR share more than they differ. Both keep a frozen πref as a KL anchor. Both train πθ via a weighted log-probability objective. Both inherit their initial weights from an SFT checkpoint. The difference is in the oracle they consult and the dataset shape that follows from it.
| DPO (Lesson 5) | RLVR (this lesson) | |
|---|---|---|
| Dataset shape | static (x, yw, yl) | x only; ys sampled fresh from πθ |
| Oracle | human (or model) preference | programmatic verifier |
| Number of samples per x | 2 (the pair) | K (typically 16–64) |
| Exploration | none — re-ranks within pair distribution | yes — model generates novel responses |
| Gradient flavor | closed-form, supervised-like | Monte Carlo policy gradient |
| KL anchor | implicit (β·log(πθ/πref) margin) | explicit (KL term added to loss) |
| Failure mode | over-fits the pairs; can't go beyond them | reward hacking; KL collapse without anchor |
| Best for | general assistant tuning, style | math, code, puzzles, reasoning |
Modern recipes use both, often in sequence: DPO for assistant style, then RLVR for hard reasoning targets. The choice is not ideological; it is "do you have a verifier?". If yes, RLVR. If no, DPO (or RLHF with a learned reward model).
Trade-offs you have to think about
Reward hacking
Any time you replace a goal ("be a good math solver") with a proxy ("pass this verifier"), you invite Goodhart's law. The arithmetic verifier in 04_rlvr.py parses the last integer. A policy could learn to emit "5+7=12; the answer is 12" for prompts whose gold is 12 — fine — but it could also learn to emit a long reasoning trace that ends with the wrong intermediate but the right final digit. The verifier cannot tell the difference. For code, a policy can write tests-passing code that fails on every input not in the tests. Mitigations include test-suite breadth, multiple verifiers, and the KL anchor itself — staying close to a sensible-looking reference makes exotic exploits less reachable.
Verifier strength
The verifier is the ceiling on what RLVR can teach. If the verifier accepts wrong-but-confident answers, the policy will learn to give them. If the verifier rejects right-but-unusual answers, the policy will learn to avoid them. For math, "compare to the gold integer" is rock-solid. For code, "pass these unit tests" is solid only if the tests cover edge cases. For reasoning, "agree with the answer key" works for short answers but breaks for open-ended outputs — which is why preferences (DPO, RLHF) still dominate for those.
KL coefficient tuning
Too small and the policy collapses; too large and it never moves. There is no principled value of β; it depends on the geometry of the verifier landscape near πref. A practical procedure: start at β=0.1, monitor mean KL per rollout, and tune. If KL grows unboundedly, increase β. If KL stays at zero and reward does not move, decrease β. The toy uses β=0.05, which is on the looser side appropriate for a small model with a forgiving task.
K vs steps
Compute budget says you can do N total rollouts. Should you do N/4 steps with K=4, or N/16 steps with K=16? At K=4 you take more, noisier steps. At K=16 you take fewer, cleaner steps. Empirically the sweet spot is around K=16 in the original DeepSeek-Math paper. Smaller groups are wasted on dead-group rate; larger groups have diminishing returns from variance reduction.
On-policy vs off-policy
Strictly on-policy (each rollout consumed once, immediately) is the safest. It is also the slowest because rollouts cannot be reused. Production systems run several gradient updates per batch of rollouts; that is "off-policy" reuse and is where the PPO clip earns its keep. Our toy is strictly on-policy.
Temperature
Lower the sampling temperature and you get less group diversity — more degenerate groups. Raise it and you get more garbage rollouts. For verifiable tasks, temperature 0.7–1.0 is typical. The toy uses 1.0.
Length-bias of advantage normalization
Dividing by the group standard deviation amplifies signals from low-variance groups. The lowest-variance groups are the easy ones (everyone right) and the hardest ones (everyone wrong) — neither is the most informative. Dr.GRPO drops the std denominator to remove this bias; DAPO keeps it but adjusts elsewhere. The toy keeps it, in the interest of staying close to the canonical formulation. Detailed treatment lives in the sibling lesson RL/lessons/14_drgrpo.html.
What modern recipes do
State-of-the-art reasoning models — R1, R1-Zero, Qwen2.5-Math, the o-series sketch — share a recognizable backbone:
- Strong pretrain. A capable base model that already does some arithmetic and code.
- SFT warmup on reasoning-style demonstrations: chain-of-thought traces, code with explanations.
- RLVR with a verifier on math, code, puzzles. Often with curriculum: start on easy problems, raise difficulty as accuracy stabilizes.
- Optional DPO on the side, for general-assistant style and instruction-following niceties.
The RLVR stage is where the largest reasoning-capability gains come from. R1-Zero famously skipped SFT entirely and went straight from base model to RLVR — a stress test for the cold-start argument. It worked because the base model was already strong enough to sometimes succeed; the verifier and the group-relative gradient did the rest.
Looking back at the whole pipeline
You've now seen all five stages. Re-read them as deltas:
- Pretrain. Loss is −Σt log pθ(xt|x<t) over raw text. Output: a model of P(text).
- SFT. Same loss shape, but mask the prompt: −Σt∈response log pθ(xt|x<t). Output: P(response | prompt).
- CoT. Same loss again. Only the response shape changes — it now includes a reasoning trace before the answer. Output: a model that can spend more tokens on harder problems.
- DPO. Pairwise reformulation: −log σ(β·(logratiow − logratiol)). Output: preferences embedded in the policy with a closed-form objective, no reward model.
- RLVR. Drop the static dataset; sample rollouts from πθ; score them with a verifier; weight log-probabilities by group-relative advantages; anchor with KL. Output: a model improved by exploration.
One observation worth pausing on: every loss in the pipeline is, up to a sign and a weighting, a sum of log-probabilities of tokens under the policy with the prompt masked out. Pretraining sets the weights to 1 over all tokens. SFT sets them to 1 over response tokens, 0 elsewhere. CoT keeps the same scheme on richer response sequences. DPO sets them to +β for winner-response tokens and −β for loser-response tokens, with a sigmoid wrapping the difference. RLVR sets them to the per-rollout advantage Ai, sampled fresh, plus a per-token KL term.
That's the whole game. Five stages, one underlying machine. The architecture in model.py never knew about any of it.