Dr.GRPO — "GRPO Done Right" Liu et al. 2025
Two divisors in GRPO are doing nothing for variance reduction and silently biasing the gradient. Drop both. The result is the honest REINFORCE-with-group-baseline-and-PPO-clip estimator that the math actually asks for.
The two biases
From the GRPO loss in lesson 11:
Two divisors that look like cosmetic normalization:
- Bias A: dividing the advantage by stdj(rj).
- Bias B: dividing the per-rollout loss by its length |yi|.
Both turn out to push the gradient in directions that have nothing to do with reward. Dr.GRPO's fix is to delete them. Same rollouts, same PPO-clip, same KL anchor — strictly fewer divisions.
Bias A · why /std is wrong
Std-normalization makes the loss reward-scale-invariant, which sounds like a win — until you realize what it does to relative gradient magnitudes across prompts.
Consider two training prompts (binary verifier, K=4):
| Prompt | Rewards | mean | std | GRPO advantages |
|---|---|---|---|---|
| X (informative) | [1, 0, 1, 0] | 0.50 | 0.50 | [+1.00, −1.00, +1.00, −1.00] |
| Y (mostly easy) | [1, 1, 1, 0] | 0.75 | 0.43 | [+0.58, +0.58, +0.58, −1.74] |
The policy wants larger gradient on the prompts where the advantage in reward units is larger. Prompt X is balanced (50% success) — every rollout is informative. Prompt Y is mostly easy (75% success) — only the one failed rollout has signal.
But std-normalization rescales Y so the single failed rollout gets an advantage of magnitude −1.74 — larger than Prompt X's most-informative rollouts. The algorithm silently learns "punish the rare failure on Y" very aggressively, even though Y is an "easy" prompt the policy is already mostly solving. That's a scale bias the raw policy-gradient theorem has no /std in it for.
Dr.GRPO's fix: just delete the divisor.
Bias B · why /|y| is wrong (the catastrophic one for reasoning)
This is the bias that motivates the paper's title. GRPO's per-rollout length normalization means each rollout contributes the same total gradient mass regardless of length. Suppose two correct rollouts on the same prompt:
| Rollout | |y| | A | per-token contribution |
|---|---|---|---|
| yshort | 10 tokens | +0.5 | +0.5 · (1/10) · log π = 0.05 · log π per token |
| ylong | 100 tokens | +0.5 | +0.5 · (1/100) · log π = 0.005 · log π per token |
Per-token, the long correct rollout gets 10× less gradient signal than the short one. Aggregated, both contribute the same total. But the long rollout represents 10× more reasoning work that the policy needs to reinforce — and we're explicitly suppressing the gradient on that reasoning.
Over many updates, the policy is biased toward shorter responses on the same prompt. For long-CoT reasoning tasks where the winning policies are precisely the ones that produce long chains of thought, GRPO has a structural anti-reasoning bias.
Dr.GRPO's fix: drop the per-rollout divisor. Sum over all response tokens in the group, divide by K (the group size):
Every token contributes equally to the loss regardless of which rollout it lived in. Long correct rollouts now produce 10× the gradient of short ones, matching the per-token effort they represent.
Σ|y_i| (total tokens in the group, which varies across batches), Dr.GRPO divides by K (a constant). DAPO's form is effectively length-adaptive LR; Dr.GRPO's is a cleaner plain gradient. Interchangeable at reasonable LRs.
Interactive · length bias, made concrete
Below: two correct rollouts on the same prompt, equal advantage. Slide their lengths and watch per-token gradient under three aggregation schemes. The number to watch is the right-hand column under GRPO — for long rollouts it stays small, regardless.
The Dr.GRPO loss, in code
# From RL/algorithms/05_drgrpo.py — drgrpo_step (annotated diff vs GRPO)
# Fix A: advantages WITHOUT /std.
A = rewards - rewards.mean() # (K,)
# Standard PPO-clip surrogate (unchanged).
ratio = torch.exp(new_logp - old_logp)
A_tok = A.detach().unsqueeze(-1).expand_as(ratio)
s1, s2 = ratio * A_tok, torch.clamp(ratio, 1-ε, 1+ε) * A_tok
pg_tok = -torch.min(s1, s2) * target_mask
# Fix B: NO per-rollout length normalization. Sum across all response
# tokens in the group; divide by K. Every token contributes equally to
# the loss, regardless of which rollout it belongs to or how long that
# rollout is.
pg_loss = pg_tok.sum() / float(K) # NOT (1/K) Σ (1/|y_i|) Σ
# KL anchor with the same aggregation.
kl_tok = compute_kl_k3(new_logp, ref_logp) * target_mask
kl_loss = β * kl_tok.sum() / float(K)
loss = pg_loss + kl_loss
That's the whole diff. Two divisions removed. Same rollouts, same PPO-clip, same KL anchor, same hyperparameters.
What you give up
Two things, both manageable in practice:
- Reward-scale sensitivity. Without /std, switching from {0, 1} to {0, 100} rewards changes the effective LR by 100×. Pick your reward scale once and forget it; or scale the LR if you change rewards.
- Slightly noisier per-step gradients on prompts with low reward variance. The corresponding bias-reduction usually wins on long-CoT tasks; on short-output preference-style tasks, GRPO's std-normalization is a defensible choice.
The lineage at a glance
| Algorithm | Baseline | Clip | Value head | /std | /|yi| |
|---|---|---|---|---|---|
| REINFORCE | none | — | — | — | — |
| PPO | learned Vφ | ✓ (sym) | ✓ | batch-norm | — (already token-level) |
| GRPO | group mean | ✓ (sym) | — | ✓ | ✓ |
| RLOO | LOO mean | — | — | — | — |
| DAPO | group mean | ✓ (asym) | — | ✓ | token-level (replaces /|yi|) |
| Dr.GRPO | group mean | ✓ (sym) | — | — | — |
Six rows of one-line deltas. Read top to bottom and you can recover any of these algorithms from its predecessor by adding or removing one feature.
That closes Part II — six algorithms, one verifiable task, one variance-reduction lineage. Part III (lessons 15–23) moves to production: where reward comes from when there is no verifier (RLHF, DPO), how environments and verifiers are designed, how the cluster is wired, what the inference engine is actually doing, and how the famous recipes (R1, Tülu 3, Qwen, o-series) combine all of the above.