RLVR — exploration under a verifier GRPO · DeepSeek-R1

The final stage. A programmatic checker replaces the human rater, fresh samples from the policy replace the static dataset, and a one-line group statistic replaces the value network.

Where DPO ran out of road

Lesson 5 left us with a model that has been pushed toward chosen responses and away from rejected ones under a frozen reference anchor. DPO is elegant — the reward model dissolves algebraically, the loss is closed-form — but it inherits the rigidity of any supervised method. Three concrete gaps:

Static pair distribution. The training signal is bounded by what is in the pair file. DPO can re-rank within the support of (y_w, y_l) seen at training time, but it cannot reward a response style it has never seen.
No exploration. The model never produces samples during DPO training — every gradient comes from log-probabilities of pre-canned strings. There is no mechanism for discovering a new reasoning pattern that scores well; only for tilting probability toward what some human, or some prior model, has already written down.
Wasteful for verifiable tasks. For arithmetic, code, puzzles, or anything with a programmatic correctness oracle, paying humans to compare responses is silly. The oracle is the reward. Spending preference labels on tasks with a ground truth is like hiring a panel of judges to score a stopwatch.

RLVR — Reinforcement Learning with Verifiable Rewards — collapses all three problems at once: replace the static dataset with on-policy rollouts, replace the human rater with a verifier, and use a clean group statistic in place of any learned value function. The instantiation we will study is GRPO (Group-Relative Policy Optimization), the algorithm behind DeepSeek-R1 and the broader o1-style reasoning-model family.

One-sentence reading of the whole pipeline so far

Pretrain learns P(text). SFT carves out P(response | prompt). CoT injects reasoning tokens to buy serial compute. DPO tilts P(response | prompt) using pairwise preferences with a frozen KL anchor. RLVR draws fresh samples, scores them with a verifier, and pushes the policy uphill in expected reward — still under a KL anchor.

The setting in five symbols

Three of the symbols carry over verbatim from lesson 5; two are new.

π_θ — the policy we are training. Same MiniGPT, same parameters, same forward pass.
π_ref — the frozen reference. As in DPO, this is the SFT-warmed-up checkpoint at the moment RLVR starts. requires_grad_(False). It anchors the KL term.
x — a prompt, here a chat-templated arithmetic problem such as <3+4+5>.
y — a response sampled from π_θ(·|x) token-by-token. New every step.
R(x, y) — the verifier. A function from (x, y) to ℝ that returns 1 if the response decodes to the gold answer and 0 otherwise. No learning, no parameters, just code:

def verify(response_text: str, gold: int) -> float:
    matches = _DIGITS_RE.findall(response_text)
    if not matches:
        return 0.0
    return 1.0 if int(matches[-1]) == gold else 0.0

For math the verifier parses the last integer in the response and compares to the gold sum. For code it would run unit tests. For puzzles it would diff against a known solution. The defining property is that the verifier is cheap, faithful, and noiseless — it never gives partial credit it does not believe in, and it never disagrees with itself. That is what makes it a strictly better signal than a reward model trained on preferences.

Why a verifier dominates a reward model when you have one

A reward model is a finite-capacity neural network trained on a finite preference dataset. It hallucinates. It is biased toward styles its raters preferred. It can be reward-hacked: the policy finds adversarial responses that score high under the model and low under any human. A verifier has none of these failure modes within its domain — it cannot be fooled by a response that does not actually parse to the gold answer. The cost of that strength is generality: a verifier only exists where correctness is programmatically checkable.

GRPO in six lines

Here is the entire algorithm. Walk through it once now; we will dissect each line below.

1. Sample a prompt    x  ~ D
2. Roll out K samples y_1, …, y_K  ~  π_θ(·|x)
3. Verify each        r_i  =  R(x, y_i)        ∈ ℝ
4. Group advantage    A_i  =  (r_i − mean_j r_j) / (std_j r_j + ε)
5. PG loss            L_PG =  − (1/K) Σ_i  A_i · log π_θ(y_i | x)
6. KL anchor          L_KL =   β · (1/K) Σ_i  KL( π_θ ‖ π_ref )   over response tokens
   Total              L    =  L_PG + L_KL

That is the whole loop. One optimizer step per group of K rollouts. The toy implementation in 04_rlvr.py uses K=4 and β=0.05 for 800 steps; production recipes use K=16…64 and tens of thousands of steps.

Why group-relative — the disappearance of the value network

The score-function gradient is

∇_θ J(θ) = 𝔼_{y∼π_θ} [ R(x, y) · ∇_θ log π_θ(y|x) ].

This is the formula from lesson 1, lifted to sequences. Estimated naively (one sample), it has crushing variance: a single response touches hundreds of tokens, and a binary reward gives you one bit of signal scattered across all of them. The standard cure is a baseline b(x). Subtracting any function of x alone from the reward leaves the gradient unbiased:

𝔼_{y∼π_θ} [ b(x) · ∇_θ log π_θ(y|x) ] = b(x) · ∇_θ 𝔼_y[1] = b(x) · ∇_θ 1 = 0,

because the policy distribution integrates to one no matter θ. So the centred gradient

∇_θ J = 𝔼_{y∼π_θ} [ ( R(x, y) − b(x) ) · ∇_θ log π_θ(y|x) ]

is unbiased for any b that does not depend on y. PPO-style RLHF realizes b(x) as a learned value head — a second copy of the trunk with a scalar projection — trained by regressing returns. GRPO realizes b(x) as the sample mean of K rollouts for the same prompt:

b̂(x) = (1/K) Σ_j=1..K r_j.

This is a Monte-Carlo estimate of 𝔼_{y∼π_θ}[R(x,y)]. It depends only on x (the rollouts are i.i.d. given x), so the baseline-zero-mean argument applies. It costs zero extra parameters, has zero warmup, and cannot go stale relative to the policy because it is computed from the policy itself, this step. The standard-deviation denominator

A_i = ( r_i − b̂(x) ) / ( σ̂(x) + ε )

does not change unbiasedness — it is a rescaling — but it makes the loss invariant to reward shifts and to multiplicative reward rescaling. Switch your verifier from {0,1} to {0,10} and the optimizer behaves identically.

The cost of going group-relative: K forward passes per prompt instead of one. That is the basic GRPO-vs-PPO trade. PPO pays in extra parameters and a value-target signal; GRPO pays in extra compute per prompt. For verifiable tasks with sparse terminal reward — where the value head is hardest to train anyway — the GRPO trade is the better one.

A subtlety

Strictly, A_i uses r_i in its own baseline through the group mean, so the estimator is biased at finite K. The bias is O(1/K) and vanishes asymptotically. In practice at K ≥ 4 the bias is much smaller than the variance you'd pay to remove it. The strictly-unbiased fix — using the mean of the other K−1 rollouts — is RLOO (Leave-One-Out), and it costs the same compute. The DeepSeek papers use the in-group mean and accept the bias.

A worked group, by hand

Suppose we draw K=4 rollouts for prompt <3+4+5> (gold = 12). The verifier returns:

r₁ = 1 r₂ = 1 r₃ = 0 r₄ = 0

Mean is 0.5. The population standard deviation is also 0.5. Advantages:

A = ( 1 − 0.5, 1 − 0.5, 0 − 0.5, 0 − 0.5 ) / ( 0.5 + ε ) ≈ ( +1, +1, −1, −1 ).

Half the group gets pushed up, half pushed down. The policy gradient is now centred: the optimizer will increase log-probability of y₁ and y₂, decrease log-probability of y₃ and y₄, by exactly the magnitude their advantage prescribes. This is the entire learning signal.

The KL anchor — why we keep π_ref around

Without a KL penalty, the policy is free to walk wherever the reward gradient points. Sounds desirable. It is not. Verifiers have edge cases. The arithmetic verifier in our toy parses the last integer in the response — so a policy that learns to emit the string "42#" always, and ignores the prompt, will get reward 1 on every problem whose gold answer happens to be 42, and reward 0 elsewhere. Across a uniform-difficulty stream of two- and three-summand problems, this is a winning strategy compared to most random outputs. A pure reward-maximizing policy with no anchor will find it. We are then proudly training a model that thinks the universal answer is 42 and have lost everything SFT gave us.

The KL term against the frozen π_ref stops this. It says: at every token, your distribution over next tokens has to stay near the SFT model's. Local exploration is fine; a wholesale collapse onto a degenerate sequence is not.

The KL is computed only over response tokens, not the prompt. The prompt is identical between π_θ and π_ref conditionally (same input), so its KL is the same loss contribution either way — masking it out is just bookkeeping. In the code:

target_mask = torch.zeros((K, T - 1), dtype=torch.float32, device=device)
target_mask[:, P - 1 : P - 1 + resp_mask.size(1)] = resp_mask

P is the prompt length; the mask is 1 only at target positions corresponding to response tokens that are not past-EOS padding.

Schulman's k3 KL estimator

How do we actually compute the KL? We have access to per-token log-probabilities log π_θ(y_t|·) and log π_ref(y_t|·) from the two model forwards. The naive estimator is

KL̂_naive = log π_θ − log π_ref,

which is unbiased — its expectation under y ∼ π_θ is exactly KL(π_θ ‖ π_ref) — but signed and high-variance. Per token it can be negative, which is uncomfortable both interpretationally and as a loss term you might want to clip or log. Schulman's "k3" estimator fixes this. Let r = log π_ref − log π_θ. Then

KL̂_k3 = e^r − r − 1.

Two properties to verify:

Nonnegativity per token. The function f(r) = e^r − r − 1 has minimum 0 at r=0, since f′(r) = e^r − 1 vanishes there and f″(r) = e^r > 0 is positive everywhere. So f ≥ 0 for all real r, with equality iff π_θ = π_ref at that token.
Unbiased in expectation. Under y ∼ π_θ, 𝔼[e^r] = Σ_y π_θ(y) · π_ref(y)/π_θ(y) = Σ_y π_ref(y) = 1. So 𝔼[KL̂_k3] = 1 − 𝔼[r] − 1 = −𝔼[r] = 𝔼[log π_θ − log π_ref] = KL(π_θ ‖ π_ref). The expectations of the naive and k3 estimators agree exactly.

So k3 trades nothing in expectation for always-nonnegative samples. That is a strict variance reduction for a strict expectation-preserving change. Free lunch.

The toy implementation is one Python line:

r = ref_logp - pol_logp
kl_per_tok = torch.exp(r) - r - 1.0     # >= 0 elementwise

Debugging gotcha — sign of r matters

We are computing KL(π_θ ‖ π_ref) with samples from π_θ. The unbiased k3 form uses r = log π_ref − log π_θ. Flip the sign by accident and you are estimating KL(π_ref ‖ π_θ) with samples from the wrong distribution — biased — and the gradient pushes the policy away from π_ref. Training looks fine for a few steps and then the policy explodes. The README of gpt_mini lists this as the canonical RLVR gotcha for a reason.

Why REINFORCE-style here — no PPO clipping

Real GRPO, as published, wraps the policy-gradient term in PPO-style importance-ratio clipping. The reason is off-policy reuse: in production, rollouts are produced by an inference engine on one set of GPUs while the trainer updates the policy on another set. Between the moment a rollout is sampled (using policy weights π_old) and the moment it contributes to a gradient (using the current weights π_θ), the policy has moved. The two distributions differ; the rollout is no longer on-policy. The PPO clip bounds the resulting update.

Our toy in 04_rlvr.py sidesteps all of that. The loop is strictly synchronous: roll out → score → update → next prompt. The rollout policy is the current policy, so the importance ratio π_θ(y|x) / π_old(y|x) = 1 by construction and the clip is inactive. We skip the clip term for clarity; it would not change a single gradient. In a multi-epoch or asynchronous setting you'd insert one line:

# where it would go, in the loss expression:
ratio = (pol_logp - old_logp.detach()).exp()            # importance ratio
clipped = torch.clamp(ratio, 1.0 - eps, 1.0 + eps)      # PPO-clip
pg_loss = -torch.min(ratio * adv, clipped * adv).mean()

Everything else — group-relative advantage, KL anchor, k3 estimator, mask over response tokens — is unchanged.

Degenerate groups — when the step does nothing

If all K rollouts in a group receive the same reward, then σ̂ = 0 and every r_i − mean = 0. The advantage vector is all zeros. The PG loss vanishes; only the KL term remains; the gradient is purely "stay close to π_ref", which is uninformative. The toy file detects this and short-circuits:

if rewards.std().item() < 1e-8:
    return 0.0, rewards.mean().item(), 0.0      # skip the optimizer step
adv = (rewards - rewards.mean()) / (rewards.std() + 1e-6)

Degenerate groups are not a bug — they are a fact of the training dynamics. They show up in two regimes:

Early. A bad policy fails on most prompts. For an arithmetic problem with gold answer 14, all four rollouts say "7#", "3#", "99#", "5#" — all wrong. Rewards are (0,0,0,0). No signal.
Late. A mastered policy succeeds on most prompts. All four rollouts say "14#". Rewards are (1,1,1,1). No signal — there is nothing to differentiate by.

The interesting middle is where some rollouts succeed and some fail, and the centred advantage tells the policy which sample style worked. The dead-group rate is a useful real-time diagnostic: if it stays at 100% you have either a hopeless initial policy or a verifier that everyone passes; if it drops to zero you have either converged or run out of room to grow.

Interactive · a GRPO rollout group, step by step

The widget below simulates one full GRPO step at a time. The "policy correctness" parameter p is a 1-D stand-in for the policy's competence: rollouts are sampled to be correct with probability p. Hit Resample to draw a fresh group; hit Step to apply one fake gradient update — the step nudges p up by an amount proportional to Σ A_i · [correct_i] (the empirical group signal), tempered by the KL coefficient β (larger β means slower adaptation). The "mean reward EMA" line chart and the dead-group counter give you a feel for how RLVR actually unfolds.

GRPO rollout group simulator · prompt = "<3+4+5>" · gold = 12

Each row is one of the K rollouts. Green chip = verifier returned 1. The advantages are computed live from the K rewards. Step takes a fake gradient step; over time the policy correctness parameter climbs and dead groups drop, then come back as the policy nears mastery.

mean(r)

—

std(r)

—

degenerate?

—

policy p

0.20

K: 4 β: 0.05

steps

mean reward EMA

0.00

dead-group rate

0.0%

KL / rollout

0.000

Play with three things and notice the dynamics:

Hold β = 0 and step repeatedly. The policy correctness climbs fast — and you may notice it crosses 1.0 (impossible for a real policy) because there is nothing pulling it back. That is the collapse failure mode.
Crank β high. Each step barely moves the policy. The mean reward EMA grows linearly instead of accelerating. Strong anchor, slow learning.
Reduce K to 2. Dead-group rate spikes — half your steps do nothing. Increase to K=8 and the rate plummets at K× the compute.

What β and K actually control

Knob	Small	Large	Where it bites
β	weak anchor	strong anchor	Small → faster adaptation but risk of collapse onto a verifier exploit. Large → slow but stable. Typical: 0.01–0.1.
K	high-variance advantage; many dead groups	low variance, K× compute	Small → step contributes nothing on dead groups, gradients noisy. Large → reliable advantage estimate. Papers use 16–64.
temperature	greedy, low entropy	diverse, exploratory	Too low collapses the group (every sample is the same response); too high makes everything garbage. Match to expected response length.
learning rate	safe	fast but unstable	RLVR LR is typically 10×–100× smaller than SFT LR — the gradient is already a high-variance Monte Carlo estimate; piling LR on top makes it worse. Toy uses 1e-5.

Warmup matters — the cold-start problem

RL needs a base that occasionally succeeds. A randomly initialized policy emits gibberish: for the toy task it would produce strings like "+5_##" that the verifier can't even parse. Every reward is 0. Every group is degenerate. Every gradient is zero. Training cannot start.

That is why 04_rlvr.py runs a brief SFT warmup before the RLVR loop:

print("── SFT warmup (so the initial policy produces parseable output) ──")
sft = train_warmup(tok, T_max=T_max, steps=600, B=64, device=device)
policy = copy.deepcopy(sft)
ref    = sft

After 600 steps of warmup the policy reliably emits digits + #. The verifier can now grade it. Some answers are right, some wrong — a productive split. The RLVR loop starts there. π_ref is set to the same warmed-up model and frozen.

This is a general lesson about RL post-training: you need a base that occasionally succeeds, otherwise the reward signal is too sparse to bootstrap. Real R1 starts from a strong base model. Our 4-layer toy starts from a tiny SFT. Same principle. The "Zero" variants (R1-Zero, etc.) demonstrate that with a sufficiently capable base and a verifier, you can skip explicit SFT entirely — but you can never skip the implicit prior that the base provides.

A pragmatic test

Before you start an RLVR run, sample 50 rollouts from your base model on prompts from your training distribution and pass them through the verifier. If fewer than ~5% pass, you do not have enough signal to bootstrap — go back and SFT more. If more than ~95% pass, your verifier is too weak or your task is too easy and RLVR will not teach the model much. The productive band is 10%–80% pass rate.

DPO vs RLVR — same anchor, different oracle

DPO and RLVR share more than they differ. Both keep a frozen π_ref as a KL anchor. Both train π_θ via a weighted log-probability objective. Both inherit their initial weights from an SFT checkpoint. The difference is in the oracle they consult and the dataset shape that follows from it.

	DPO (Lesson 5)	RLVR (this lesson)
Dataset shape	static (x, y_w, y_l)	x only; ys sampled fresh from π_θ
Oracle	human (or model) preference	programmatic verifier
Number of samples per x	2 (the pair)	K (typically 16–64)
Exploration	none — re-ranks within pair distribution	yes — model generates novel responses
Gradient flavor	closed-form, supervised-like	Monte Carlo policy gradient
KL anchor	implicit (β·log(π_θ/π_ref) margin)	explicit (KL term added to loss)
Failure mode	over-fits the pairs; can't go beyond them	reward hacking; KL collapse without anchor
Best for	general assistant tuning, style	math, code, puzzles, reasoning

Modern recipes use both, often in sequence: DPO for assistant style, then RLVR for hard reasoning targets. The choice is not ideological; it is "do you have a verifier?". If yes, RLVR. If no, DPO (or RLHF with a learned reward model).

Trade-offs you have to think about

Reward hacking

Any time you replace a goal ("be a good math solver") with a proxy ("pass this verifier"), you invite Goodhart's law. The arithmetic verifier in 04_rlvr.py parses the last integer. A policy could learn to emit "5+7=12; the answer is 12" for prompts whose gold is 12 — fine — but it could also learn to emit a long reasoning trace that ends with the wrong intermediate but the right final digit. The verifier cannot tell the difference. For code, a policy can write tests-passing code that fails on every input not in the tests. Mitigations include test-suite breadth, multiple verifiers, and the KL anchor itself — staying close to a sensible-looking reference makes exotic exploits less reachable.

Verifier strength

The verifier is the ceiling on what RLVR can teach. If the verifier accepts wrong-but-confident answers, the policy will learn to give them. If the verifier rejects right-but-unusual answers, the policy will learn to avoid them. For math, "compare to the gold integer" is rock-solid. For code, "pass these unit tests" is solid only if the tests cover edge cases. For reasoning, "agree with the answer key" works for short answers but breaks for open-ended outputs — which is why preferences (DPO, RLHF) still dominate for those.

KL coefficient tuning

Too small and the policy collapses; too large and it never moves. There is no principled value of β; it depends on the geometry of the verifier landscape near π_ref. A practical procedure: start at β=0.1, monitor mean KL per rollout, and tune. If KL grows unboundedly, increase β. If KL stays at zero and reward does not move, decrease β. The toy uses β=0.05, which is on the looser side appropriate for a small model with a forgiving task.

K vs steps

Compute budget says you can do N total rollouts. Should you do N/4 steps with K=4, or N/16 steps with K=16? At K=4 you take more, noisier steps. At K=16 you take fewer, cleaner steps. Empirically the sweet spot is around K=16 in the original DeepSeek-Math paper. Smaller groups are wasted on dead-group rate; larger groups have diminishing returns from variance reduction.

On-policy vs off-policy

Strictly on-policy (each rollout consumed once, immediately) is the safest. It is also the slowest because rollouts cannot be reused. Production systems run several gradient updates per batch of rollouts; that is "off-policy" reuse and is where the PPO clip earns its keep. Our toy is strictly on-policy.

Temperature

Lower the sampling temperature and you get less group diversity — more degenerate groups. Raise it and you get more garbage rollouts. For verifiable tasks, temperature 0.7–1.0 is typical. The toy uses 1.0.

Length-bias of advantage normalization

Dividing by the group standard deviation amplifies signals from low-variance groups. The lowest-variance groups are the easy ones (everyone right) and the hardest ones (everyone wrong) — neither is the most informative. Dr.GRPO drops the std denominator to remove this bias; DAPO keeps it but adjusts elsewhere. The toy keeps it, in the interest of staying close to the canonical formulation. Detailed treatment lives in the sibling lesson reinforcement_learning/lessons/35_drgrpo.html.

What modern recipes do

State-of-the-art reasoning models — R1, R1-Zero, Qwen2.5-Math, the o-series sketch — share a recognizable backbone:

Strong pretrain. A capable base model that already does some arithmetic and code.
SFT warmup on reasoning-style demonstrations: chain-of-thought traces, code with explanations.
RLVR with a verifier on math, code, puzzles. Often with curriculum: start on easy problems, raise difficulty as accuracy stabilizes.
Optional DPO on the side, for general-assistant style and instruction-following niceties.

The RLVR stage is where the largest reasoning-capability gains come from. R1-Zero famously skipped SFT entirely and went straight from base model to RLVR — a stress test for the cold-start argument. It worked because the base model was already strong enough to sometimes succeed; the verifier and the group-relative gradient did the rest.

Looking back at the whole pipeline

You've now seen all five stages. Re-read them as deltas:

Pretrain. Loss is −Σ_t log p_θ(x_t|x_<t) over raw text. Output: a model of P(text).
SFT. Same loss shape, but mask the prompt: −Σ_t∈response log p_θ(x_t|x_<t). Output: P(response | prompt).
CoT. Same loss again. Only the response shape changes — it now includes a reasoning trace before the answer. Output: a model that can spend more tokens on harder problems.
DPO. Pairwise reformulation: −log σ(β·(logratio_w − logratio_l)). Output: preferences embedded in the policy with a closed-form objective, no reward model.
RLVR. Drop the static dataset; sample rollouts from π_θ; score them with a verifier; weight log-probabilities by group-relative advantages; anchor with KL. Output: a model improved by exploration.

One observation worth pausing on: every loss in the pipeline is, up to a sign and a weighting, a sum of log-probabilities of tokens under the policy with the prompt masked out. Pretraining sets the weights to 1 over all tokens. SFT sets them to 1 over response tokens, 0 elsewhere. CoT keeps the same scheme on richer response sequences. DPO sets them to +β for winner-response tokens and −β for loser-response tokens, with a sigmoid wrapping the difference. RLVR sets them to the per-rollout advantage A_i, sampled fresh, plus a per-token KL term.

That's the whole game. Five stages, one underlying machine. The architecture in model.py never knew about any of it.

Closing takeaway

RLVR = rollouts + verifier + grouped advantages + KL anchor. No value network, no reward model, no human preferences — just an oracle that says correct or incorrect and K samples whose relative performance defines the signal. The policy improves through exploration under reward pressure, anchored to π_ref so it cannot run away from itself. Everything past this point in the literature — RLOO, DAPO, Dr.GRPO, off-policy variants, curriculum schedulers, length normalization — is a tweak to one of these four ingredients. The recipe is the lesson.