gpt_mini / lessons / 06 · rlvr lesson 6 / 6

RLVR — exploration under a verifier GRPO · DeepSeek-R1

The final stage. A programmatic checker replaces the human rater, fresh samples from the policy replace the static dataset, and a one-line group statistic replaces the value network.

Where DPO ran out of road

Lesson 5 left us with a model that has been pushed toward chosen responses and away from rejected ones under a frozen reference anchor. DPO is elegant — the reward model dissolves algebraically, the loss is closed-form — but it inherits the rigidity of any supervised method. Three concrete gaps:

  1. Static pair distribution. The training signal is bounded by what is in the pair file. DPO can re-rank within the support of (yw, yl) seen at training time, but it cannot reward a response style it has never seen.
  2. No exploration. The model never produces samples during DPO training — every gradient comes from log-probabilities of pre-canned strings. There is no mechanism for discovering a new reasoning pattern that scores well; only for tilting probability toward what some human, or some prior model, has already written down.
  3. Wasteful for verifiable tasks. For arithmetic, code, puzzles, or anything with a programmatic correctness oracle, paying humans to compare responses is silly. The oracle is the reward. Spending preference labels on tasks with a ground truth is like hiring a panel of judges to score a stopwatch.

RLVR — Reinforcement Learning with Verifiable Rewards — collapses all three problems at once: replace the static dataset with on-policy rollouts, replace the human rater with a verifier, and use a clean group statistic in place of any learned value function. The instantiation we will study is GRPO (Group-Relative Policy Optimization), the algorithm behind DeepSeek-R1 and the broader o1-style reasoning-model family.

One-sentence reading of the whole pipeline so far
Pretrain learns P(text). SFT carves out P(response | prompt). CoT injects reasoning tokens to buy serial compute. DPO tilts P(response | prompt) using pairwise preferences with a frozen KL anchor. RLVR draws fresh samples, scores them with a verifier, and pushes the policy uphill in expected reward — still under a KL anchor.

The setting in five symbols

Three of the symbols carry over verbatim from lesson 5; two are new.

def verify(response_text: str, gold: int) -> float:
    matches = _DIGITS_RE.findall(response_text)
    if not matches:
        return 0.0
    return 1.0 if int(matches[-1]) == gold else 0.0

For math the verifier parses the last integer in the response and compares to the gold sum. For code it would run unit tests. For puzzles it would diff against a known solution. The defining property is that the verifier is cheap, faithful, and noiseless — it never gives partial credit it does not believe in, and it never disagrees with itself. That is what makes it a strictly better signal than a reward model trained on preferences.

Why a verifier dominates a reward model when you have one
A reward model is a finite-capacity neural network trained on a finite preference dataset. It hallucinates. It is biased toward styles its raters preferred. It can be reward-hacked: the policy finds adversarial responses that score high under the model and low under any human. A verifier has none of these failure modes within its domain — it cannot be fooled by a response that does not actually parse to the gold answer. The cost of that strength is generality: a verifier only exists where correctness is programmatically checkable.

GRPO in six lines

Here is the entire algorithm. Walk through it once now; we will dissect each line below.

1. Sample a prompt    x  ~ D
2. Roll out K samples y_1, …, y_K  ~  π_θ(·|x)
3. Verify each        r_i  =  R(x, y_i)        ∈ ℝ
4. Group advantage    A_i  =  (r_i − mean_j r_j) / (std_j r_j + ε)
5. PG loss            L_PG =  − (1/K) Σ_i  A_i · log π_θ(y_i | x)
6. KL anchor          L_KL =   β · (1/K) Σ_i  KL( π_θ ‖ π_ref )   over response tokens
   Total              L    =  L_PG + L_KL

That is the whole loop. One optimizer step per group of K rollouts. The toy implementation in 04_rlvr.py uses K=4 and β=0.05 for 800 steps; production recipes use K=16…64 and tens of thousands of steps.

1. rollout K samples y_i 2. verify r_i ∈ ℝ 3. advantage A_i = (r−μ)/σ 4. loss L_PG + βL_KL 5. step Adam 6. repeat — new prompt, new rollouts (on-policy) π_ref (frozen) participates only in the KL term inside step 4 no value head, no reward model, no off-policy buffer

Why group-relative — the disappearance of the value network

The score-function gradient is

θ J(θ)  =  𝔼y∼πθ [ R(x, y) · ∇θ log πθ(y|x) ].

This is the formula from lesson 1, lifted to sequences. Estimated naively (one sample), it has crushing variance: a single response touches hundreds of tokens, and a binary reward gives you one bit of signal scattered across all of them. The standard cure is a baseline b(x). Subtracting any function of x alone from the reward leaves the gradient unbiased:

𝔼y∼πθ [ b(x) · ∇θ log πθ(y|x) ]  =  b(x) · ∇θ 𝔼y[1]  =  b(x) · ∇θ 1  =  0,

because the policy distribution integrates to one no matter θ. So the centred gradient

θ J  =  𝔼y∼πθ [ ( R(x, y) − b(x) ) · ∇θ log πθ(y|x) ]

is unbiased for any b that does not depend on y. PPO-style RLHF realizes b(x) as a learned value head — a second copy of the trunk with a scalar projection — trained by regressing returns. GRPO realizes b(x) as the sample mean of K rollouts for the same prompt:

b̂(x)  =  (1/K) Σj=1..K rj.

This is a Monte-Carlo estimate of 𝔼y∼πθ[R(x,y)]. It depends only on x (the rollouts are i.i.d. given x), so the baseline-zero-mean argument applies. It costs zero extra parameters, has zero warmup, and cannot go stale relative to the policy because it is computed from the policy itself, this step. The standard-deviation denominator

Ai  =  ( ri − b̂(x) ) / ( σ̂(x) + ε )

does not change unbiasedness — it is a rescaling — but it makes the loss invariant to reward shifts and to multiplicative reward rescaling. Switch your verifier from {0,1} to {0,10} and the optimizer behaves identically.

The cost of going group-relative: K forward passes per prompt instead of one. That is the basic GRPO-vs-PPO trade. PPO pays in extra parameters and a value-target signal; GRPO pays in extra compute per prompt. For verifiable tasks with sparse terminal reward — where the value head is hardest to train anyway — the GRPO trade is the better one.

A subtlety
Strictly, Ai uses ri in its own baseline through the group mean, so the estimator is biased at finite K. The bias is O(1/K) and vanishes asymptotically. In practice at K ≥ 4 the bias is much smaller than the variance you'd pay to remove it. The strictly-unbiased fix — using the mean of the other K−1 rollouts — is RLOO (Leave-One-Out), and it costs the same compute. The DeepSeek papers use the in-group mean and accept the bias.

A worked group, by hand

Suppose we draw K=4 rollouts for prompt <3+4+5> (gold = 12). The verifier returns:

r1 = 1 r2 = 1 r3 = 0 r4 = 0

Mean is 0.5. The population standard deviation is also 0.5. Advantages:

A  =  ( 1 − 0.5, 1 − 0.5, 0 − 0.5, 0 − 0.5 ) / ( 0.5 + ε )  ≈  ( +1, +1, −1, −1 ).
rewards r_i advantages A_i (after centering and ÷ std) r_1 = 1 r_2 = 1 r_3 = 0 r_4 = 0 +1.0 +1.0 −1.0 −1.0

Half the group gets pushed up, half pushed down. The policy gradient is now centred: the optimizer will increase log-probability of y1 and y2, decrease log-probability of y3 and y4, by exactly the magnitude their advantage prescribes. This is the entire learning signal.

The KL anchor — why we keep πref around

Without a KL penalty, the policy is free to walk wherever the reward gradient points. Sounds desirable. It is not. Verifiers have edge cases. The arithmetic verifier in our toy parses the last integer in the response — so a policy that learns to emit the string "42#" always, and ignores the prompt, will get reward 1 on every problem whose gold answer happens to be 42, and reward 0 elsewhere. Across a uniform-difficulty stream of two- and three-summand problems, this is a winning strategy compared to most random outputs. A pure reward-maximizing policy with no anchor will find it. We are then proudly training a model that thinks the universal answer is 42 and have lost everything SFT gave us.

The KL term against the frozen πref stops this. It says: at every token, your distribution over next tokens has to stay near the SFT model's. Local exploration is fine; a wholesale collapse onto a degenerate sequence is not.

without KL: policy collapses onto a single token sequence with KL anchor: policy stays a healthy distribution π_ref (was spread) "42" π_ref π_θ stays spread

The KL is computed only over response tokens, not the prompt. The prompt is identical between πθ and πref conditionally (same input), so its KL is the same loss contribution either way — masking it out is just bookkeeping. In the code:

target_mask = torch.zeros((K, T - 1), dtype=torch.float32, device=device)
target_mask[:, P - 1 : P - 1 + resp_mask.size(1)] = resp_mask

P is the prompt length; the mask is 1 only at target positions corresponding to response tokens that are not past-EOS padding.

Schulman's k3 KL estimator

How do we actually compute the KL? We have access to per-token log-probabilities log πθ(yt|·) and log πref(yt|·) from the two model forwards. The naive estimator is

KL̂naive  =  log πθ − log πref,

which is unbiased — its expectation under y ∼ πθ is exactly KL(πθ ‖ πref) — but signed and high-variance. Per token it can be negative, which is uncomfortable both interpretationally and as a loss term you might want to clip or log. Schulman's "k3" estimator fixes this. Let r = log πref − log πθ. Then

KL̂k3  =  er − r − 1.

Two properties to verify:

  1. Nonnegativity per token. The function f(r) = er − r − 1 has minimum 0 at r=0, since f′(r) = er − 1 vanishes there and f″(r) = er > 0 is positive everywhere. So f ≥ 0 for all real r, with equality iff πθ = πref at that token.
  2. Unbiased in expectation. Under y ∼ πθ, 𝔼[er] = Σy πθ(y) · πref(y)/πθ(y) = Σy πref(y) = 1. So 𝔼[KL̂k3] = 1 − 𝔼[r] − 1 = −𝔼[r] = 𝔼[log πθ − log πref] = KL(πθ ‖ πref). The expectations of the naive and k3 estimators agree exactly.

So k3 trades nothing in expectation for always-nonnegative samples. That is a strict variance reduction for a strict expectation-preserving change. Free lunch.

r KL̂ 0 −1 +1 −2 +2 naive (signed, unbounded below) k3 minimum = 0 at r=0

The toy implementation is one Python line:

r = ref_logp - pol_logp
kl_per_tok = torch.exp(r) - r - 1.0     # >= 0 elementwise
Debugging gotcha — sign of r matters
We are computing KL(πθ ‖ πref) with samples from πθ. The unbiased k3 form uses r = log πref − log πθ. Flip the sign by accident and you are estimating KL(πref ‖ πθ) with samples from the wrong distribution — biased — and the gradient pushes the policy away from πref. Training looks fine for a few steps and then the policy explodes. The README of gpt_mini lists this as the canonical RLVR gotcha for a reason.

Why REINFORCE-style here — no PPO clipping

Real GRPO, as published, wraps the policy-gradient term in PPO-style importance-ratio clipping. The reason is off-policy reuse: in production, rollouts are produced by an inference engine on one set of GPUs while the trainer updates the policy on another set. Between the moment a rollout is sampled (using policy weights πold) and the moment it contributes to a gradient (using the current weights πθ), the policy has moved. The two distributions differ; the rollout is no longer on-policy. The PPO clip bounds the resulting update.

Our toy in 04_rlvr.py sidesteps all of that. The loop is strictly synchronous: roll out → score → update → next prompt. The rollout policy is the current policy, so the importance ratio πθ(y|x) / πold(y|x) = 1 by construction and the clip is inactive. We skip the clip term for clarity; it would not change a single gradient. In a multi-epoch or asynchronous setting you'd insert one line:

# where it would go, in the loss expression:
ratio = (pol_logp - old_logp.detach()).exp()            # importance ratio
clipped = torch.clamp(ratio, 1.0 - eps, 1.0 + eps)      # PPO-clip
pg_loss = -torch.min(ratio * adv, clipped * adv).mean()

Everything else — group-relative advantage, KL anchor, k3 estimator, mask over response tokens — is unchanged.

Degenerate groups — when the step does nothing

If all K rollouts in a group receive the same reward, then σ̂ = 0 and every ri − mean = 0. The advantage vector is all zeros. The PG loss vanishes; only the KL term remains; the gradient is purely "stay close to πref", which is uninformative. The toy file detects this and short-circuits:

if rewards.std().item() < 1e-8:
    return 0.0, rewards.mean().item(), 0.0      # skip the optimizer step
adv = (rewards - rewards.mean()) / (rewards.std() + 1e-6)

Degenerate groups are not a bug — they are a fact of the training dynamics. They show up in two regimes:

The interesting middle is where some rollouts succeed and some fail, and the centred advantage tells the policy which sample style worked. The dead-group rate is a useful real-time diagnostic: if it stays at 100% you have either a hopeless initial policy or a verifier that everyone passes; if it drops to zero you have either converged or run out of room to grow.

Interactive · a GRPO rollout group, step by step

The widget below simulates one full GRPO step at a time. The "policy correctness" parameter p is a 1-D stand-in for the policy's competence: rollouts are sampled to be correct with probability p. Hit Resample to draw a fresh group; hit Step to apply one fake gradient update — the step nudges p up by an amount proportional to Σ Ai · [correcti] (the empirical group signal), tempered by the KL coefficient β (larger β means slower adaptation). The "mean reward EMA" line chart and the dead-group counter give you a feel for how RLVR actually unfolds.

GRPO rollout group simulator · prompt = "<3+4+5>" · gold = 12
Each row is one of the K rollouts. Green chip = verifier returned 1. The advantages are computed live from the K rewards. Step takes a fake gradient step; over time the policy correctness parameter climbs and dead groups drop, then come back as the policy nears mastery.
mean(r)
std(r)
degenerate?
policy p
0.20
steps
0
mean reward EMA
0.00
dead-group rate
0.0%
KL / rollout
0.000

Play with three things and notice the dynamics:

What β and K actually control

KnobSmallLargeWhere it bites
βweak anchorstrong anchorSmall → faster adaptation but risk of collapse onto a verifier exploit. Large → slow but stable. Typical: 0.01–0.1.
Khigh-variance advantage; many dead groupslow variance, K× computeSmall → step contributes nothing on dead groups, gradients noisy. Large → reliable advantage estimate. Papers use 16–64.
temperaturegreedy, low entropydiverse, exploratoryToo low collapses the group (every sample is the same response); too high makes everything garbage. Match to expected response length.
learning ratesafefast but unstableRLVR LR is typically 10×–100× smaller than SFT LR — the gradient is already a high-variance Monte Carlo estimate; piling LR on top makes it worse. Toy uses 1e-5.

Warmup matters — the cold-start problem

RL needs a base that occasionally succeeds. A randomly initialized policy emits gibberish: for the toy task it would produce strings like "+5_##" that the verifier can't even parse. Every reward is 0. Every group is degenerate. Every gradient is zero. Training cannot start.

That is why 04_rlvr.py runs a brief SFT warmup before the RLVR loop:

print("── SFT warmup (so the initial policy produces parseable output) ──")
sft = train_warmup(tok, T_max=T_max, steps=600, B=64, device=device)
policy = copy.deepcopy(sft)
ref    = sft

After 600 steps of warmup the policy reliably emits digits + #. The verifier can now grade it. Some answers are right, some wrong — a productive split. The RLVR loop starts there. πref is set to the same warmed-up model and frozen.

This is a general lesson about RL post-training: you need a base that occasionally succeeds, otherwise the reward signal is too sparse to bootstrap. Real R1 starts from a strong base model. Our 4-layer toy starts from a tiny SFT. Same principle. The "Zero" variants (R1-Zero, etc.) demonstrate that with a sufficiently capable base and a verifier, you can skip explicit SFT entirely — but you can never skip the implicit prior that the base provides.

A pragmatic test
Before you start an RLVR run, sample 50 rollouts from your base model on prompts from your training distribution and pass them through the verifier. If fewer than ~5% pass, you do not have enough signal to bootstrap — go back and SFT more. If more than ~95% pass, your verifier is too weak or your task is too easy and RLVR will not teach the model much. The productive band is 10%–80% pass rate.

DPO vs RLVR — same anchor, different oracle

DPO and RLVR share more than they differ. Both keep a frozen πref as a KL anchor. Both train πθ via a weighted log-probability objective. Both inherit their initial weights from an SFT checkpoint. The difference is in the oracle they consult and the dataset shape that follows from it.

DPO (Lesson 5)RLVR (this lesson)
Dataset shapestatic (x, yw, yl)x only; ys sampled fresh from πθ
Oraclehuman (or model) preferenceprogrammatic verifier
Number of samples per x2 (the pair)K (typically 16–64)
Explorationnone — re-ranks within pair distributionyes — model generates novel responses
Gradient flavorclosed-form, supervised-likeMonte Carlo policy gradient
KL anchorimplicit (β·log(πθref) margin)explicit (KL term added to loss)
Failure modeover-fits the pairs; can't go beyond themreward hacking; KL collapse without anchor
Best forgeneral assistant tuning, stylemath, code, puzzles, reasoning

Modern recipes use both, often in sequence: DPO for assistant style, then RLVR for hard reasoning targets. The choice is not ideological; it is "do you have a verifier?". If yes, RLVR. If no, DPO (or RLHF with a learned reward model).

Trade-offs you have to think about

Reward hacking

Any time you replace a goal ("be a good math solver") with a proxy ("pass this verifier"), you invite Goodhart's law. The arithmetic verifier in 04_rlvr.py parses the last integer. A policy could learn to emit "5+7=12; the answer is 12" for prompts whose gold is 12 — fine — but it could also learn to emit a long reasoning trace that ends with the wrong intermediate but the right final digit. The verifier cannot tell the difference. For code, a policy can write tests-passing code that fails on every input not in the tests. Mitigations include test-suite breadth, multiple verifiers, and the KL anchor itself — staying close to a sensible-looking reference makes exotic exploits less reachable.

Verifier strength

The verifier is the ceiling on what RLVR can teach. If the verifier accepts wrong-but-confident answers, the policy will learn to give them. If the verifier rejects right-but-unusual answers, the policy will learn to avoid them. For math, "compare to the gold integer" is rock-solid. For code, "pass these unit tests" is solid only if the tests cover edge cases. For reasoning, "agree with the answer key" works for short answers but breaks for open-ended outputs — which is why preferences (DPO, RLHF) still dominate for those.

KL coefficient tuning

Too small and the policy collapses; too large and it never moves. There is no principled value of β; it depends on the geometry of the verifier landscape near πref. A practical procedure: start at β=0.1, monitor mean KL per rollout, and tune. If KL grows unboundedly, increase β. If KL stays at zero and reward does not move, decrease β. The toy uses β=0.05, which is on the looser side appropriate for a small model with a forgiving task.

K vs steps

Compute budget says you can do N total rollouts. Should you do N/4 steps with K=4, or N/16 steps with K=16? At K=4 you take more, noisier steps. At K=16 you take fewer, cleaner steps. Empirically the sweet spot is around K=16 in the original DeepSeek-Math paper. Smaller groups are wasted on dead-group rate; larger groups have diminishing returns from variance reduction.

On-policy vs off-policy

Strictly on-policy (each rollout consumed once, immediately) is the safest. It is also the slowest because rollouts cannot be reused. Production systems run several gradient updates per batch of rollouts; that is "off-policy" reuse and is where the PPO clip earns its keep. Our toy is strictly on-policy.

Temperature

Lower the sampling temperature and you get less group diversity — more degenerate groups. Raise it and you get more garbage rollouts. For verifiable tasks, temperature 0.7–1.0 is typical. The toy uses 1.0.

Length-bias of advantage normalization

Dividing by the group standard deviation amplifies signals from low-variance groups. The lowest-variance groups are the easy ones (everyone right) and the hardest ones (everyone wrong) — neither is the most informative. Dr.GRPO drops the std denominator to remove this bias; DAPO keeps it but adjusts elsewhere. The toy keeps it, in the interest of staying close to the canonical formulation. Detailed treatment lives in the sibling lesson RL/lessons/14_drgrpo.html.

What modern recipes do

State-of-the-art reasoning models — R1, R1-Zero, Qwen2.5-Math, the o-series sketch — share a recognizable backbone:

  1. Strong pretrain. A capable base model that already does some arithmetic and code.
  2. SFT warmup on reasoning-style demonstrations: chain-of-thought traces, code with explanations.
  3. RLVR with a verifier on math, code, puzzles. Often with curriculum: start on easy problems, raise difficulty as accuracy stabilizes.
  4. Optional DPO on the side, for general-assistant style and instruction-following niceties.

The RLVR stage is where the largest reasoning-capability gains come from. R1-Zero famously skipped SFT entirely and went straight from base model to RLVR — a stress test for the cold-start argument. It worked because the base model was already strong enough to sometimes succeed; the verifier and the group-relative gradient did the rest.

Looking back at the whole pipeline

You've now seen all five stages. Re-read them as deltas:

One observation worth pausing on: every loss in the pipeline is, up to a sign and a weighting, a sum of log-probabilities of tokens under the policy with the prompt masked out. Pretraining sets the weights to 1 over all tokens. SFT sets them to 1 over response tokens, 0 elsewhere. CoT keeps the same scheme on richer response sequences. DPO sets them to for winner-response tokens and −β for loser-response tokens, with a sigmoid wrapping the difference. RLVR sets them to the per-rollout advantage Ai, sampled fresh, plus a per-token KL term.

That's the whole game. Five stages, one underlying machine. The architecture in model.py never knew about any of it.

Closing takeaway
RLVR = rollouts + verifier + grouped advantages + KL anchor. No value network, no reward model, no human preferences — just an oracle that says correct or incorrect and K samples whose relative performance defines the signal. The policy improves through exploration under reward pressure, anchored to π_ref so it cannot run away from itself. Everything past this point in the literature — RLOO, DAPO, Dr.GRPO, off-policy variants, curriculum schedulers, length normalization — is a tweak to one of these four ingredients. The recipe is the lesson.