rl_lessons / 14 · Dr.GRPO lesson 14 / 14

Dr.GRPO — "GRPO Done Right" Liu et al. 2025

Two divisors in GRPO are doing nothing for variance reduction and silently biasing the gradient. Drop both. The result is the honest REINFORCE-with-group-baseline-and-PPO-clip estimator that the math actually asks for.

The two biases

From the GRPO loss in lesson 11:

AiGRPO = (ri − mean) / (std + ε)  ,   LGRPO = (1/K) Σi (1/|yi|) Σt Li,t

Two divisors that look like cosmetic normalization:

Both turn out to push the gradient in directions that have nothing to do with reward. Dr.GRPO's fix is to delete them. Same rollouts, same PPO-clip, same KL anchor — strictly fewer divisions.

Bias A · why /std is wrong

Std-normalization makes the loss reward-scale-invariant, which sounds like a win — until you realize what it does to relative gradient magnitudes across prompts.

Consider two training prompts (binary verifier, K=4):

PromptRewardsmeanstdGRPO advantages
X (informative)[1, 0, 1, 0]0.500.50[+1.00, −1.00, +1.00, −1.00]
Y (mostly easy)[1, 1, 1, 0]0.750.43[+0.58, +0.58, +0.58, −1.74]

The policy wants larger gradient on the prompts where the advantage in reward units is larger. Prompt X is balanced (50% success) — every rollout is informative. Prompt Y is mostly easy (75% success) — only the one failed rollout has signal.

But std-normalization rescales Y so the single failed rollout gets an advantage of magnitude −1.74larger than Prompt X's most-informative rollouts. The algorithm silently learns "punish the rare failure on Y" very aggressively, even though Y is an "easy" prompt the policy is already mostly solving. That's a scale bias the raw policy-gradient theorem has no /std in it for.

Dr.GRPO's fix: just delete the divisor.

AiDr.GRPO  =  ri − meanj rj

Bias B · why /|y| is wrong (the catastrophic one for reasoning)

This is the bias that motivates the paper's title. GRPO's per-rollout length normalization means each rollout contributes the same total gradient mass regardless of length. Suppose two correct rollouts on the same prompt:

Rollout|y|Aper-token contribution
yshort10 tokens+0.5+0.5 · (1/10) · log π = 0.05 · log π per token
ylong100 tokens+0.5+0.5 · (1/100) · log π = 0.005 · log π per token

Per-token, the long correct rollout gets 10× less gradient signal than the short one. Aggregated, both contribute the same total. But the long rollout represents 10× more reasoning work that the policy needs to reinforce — and we're explicitly suppressing the gradient on that reasoning.

Over many updates, the policy is biased toward shorter responses on the same prompt. For long-CoT reasoning tasks where the winning policies are precisely the ones that produce long chains of thought, GRPO has a structural anti-reasoning bias.

Dr.GRPO's fix: drop the per-rollout divisor. Sum over all response tokens in the group, divide by K (the group size):

LDr.GRPO  =  (1/K) Σi Σt Li,t · maski,t

Every token contributes equally to the loss regardless of which rollout it lived in. Long correct rollouts now produce 10× the gradient of short ones, matching the per-token effort they represent.

Dr.GRPO vs. DAPO Fix 3
Both DAPO's "token-level loss" (Fix 3 in lesson 13) and Dr.GRPO's no-/|yi| fix attack the same bias. The difference is in the normalizer: DAPO divides by Σ|y_i| (total tokens in the group, which varies across batches), Dr.GRPO divides by K (a constant). DAPO's form is effectively length-adaptive LR; Dr.GRPO's is a cleaner plain gradient. Interchangeable at reasonable LRs.

Interactive · length bias, made concrete

Below: two correct rollouts on the same prompt, equal advantage. Slide their lengths and watch per-token gradient under three aggregation schemes. The number to watch is the right-hand column under GRPO — for long rollouts it stays small, regardless.

Per-token gradient contribution under three aggregations
Both rollouts have A = +0.5 (broadcasted to every response token). The "gradient per token" tells you how strongly the optimizer reinforces each token's log-probability — what the policy actually learns.
GRPO (per-rollout)DAPO (token-level)Dr.GRPO (over K)

The Dr.GRPO loss, in code

# From RL/algorithms/05_drgrpo.py — drgrpo_step (annotated diff vs GRPO)

# Fix A: advantages WITHOUT /std.
A = rewards - rewards.mean()                              # (K,)

# Standard PPO-clip surrogate (unchanged).
ratio  = torch.exp(new_logp - old_logp)
A_tok  = A.detach().unsqueeze(-1).expand_as(ratio)
s1, s2 = ratio * A_tok, torch.clamp(ratio, 1-ε, 1+ε) * A_tok
pg_tok = -torch.min(s1, s2) * target_mask

# Fix B: NO per-rollout length normalization. Sum across all response
# tokens in the group; divide by K. Every token contributes equally to
# the loss, regardless of which rollout it belongs to or how long that
# rollout is.
pg_loss = pg_tok.sum() / float(K)                         # NOT (1/K) Σ (1/|y_i|) Σ

# KL anchor with the same aggregation.
kl_tok  = compute_kl_k3(new_logp, ref_logp) * target_mask
kl_loss = β * kl_tok.sum() / float(K)

loss = pg_loss + kl_loss

That's the whole diff. Two divisions removed. Same rollouts, same PPO-clip, same KL anchor, same hyperparameters.

What you give up

Two things, both manageable in practice:

The lineage at a glance

AlgorithmBaselineClipValue head/std/|yi|
REINFORCEnone
PPOlearned Vφ✓ (sym)batch-norm— (already token-level)
GRPOgroup mean✓ (sym)
RLOOLOO mean
DAPOgroup mean✓ (asym)token-level (replaces /|yi|)
Dr.GRPOgroup mean✓ (sym)

Six rows of one-line deltas. Read top to bottom and you can recover any of these algorithms from its predecessor by adding or removing one feature.

Takeaway
Dr.GRPO is GRPO minus two divisors that looked harmless but bias the gradient: /std (amplifies low-variance groups) and /|yi| (under-reinforces long correct responses). Drop both. The result is the honest REINFORCE-with-group-baseline-and-PPO-clip estimator the derivation actually says you want — and on long-CoT reasoning the length-bias fix is the one that actually moves benchmark numbers.

That closes Part II — six algorithms, one verifiable task, one variance-reduction lineage. Part III (lessons 15–23) moves to production: where reward comes from when there is no verifier (RLHF, DPO), how environments and verifiers are designed, how the cluster is wired, what the inference engine is actually doing, and how the famous recipes (R1, Tülu 3, Qwen, o-series) combine all of the above.