Dr.GRPO — "GRPO Done Right" Liu et al. 2025

Two divisors in GRPO are doing nothing for variance reduction and silently biasing the gradient. Drop both. The result is the honest REINFORCE-with-group-baseline-and-PPO-clip estimator that the math actually asks for.

The two biases

From the GRPO loss in lesson 11:

A_i^GRPO = (r_i − mean) / (std + ε) , L^GRPO = (1/K) Σ_i (1/|y_i|) Σ_t L_i,t

Two divisors that look like cosmetic normalization:

Bias A: dividing the advantage by std_j(r_j).
Bias B: dividing the per-rollout loss by its length |y_i|.

Both turn out to push the gradient in directions that have nothing to do with reward. Dr.GRPO's fix is to delete them. Same rollouts, same PPO-clip, same KL anchor — strictly fewer divisions.

Bias A · why /std is wrong

Std-normalization makes the loss reward-scale-invariant, which sounds like a win — until you realize what it does to relative gradient magnitudes across prompts.

Consider two training prompts (binary verifier, K=4):

Prompt	Rewards	mean	std	GRPO advantages
X (informative)	[1, 0, 1, 0]	0.50	0.50	[+1.00, −1.00, +1.00, −1.00]
Y (mostly easy)	[1, 1, 1, 0]	0.75	0.43	[+0.58, +0.58, +0.58, −1.74]

The policy wants larger gradient on the prompts where the advantage in reward units is larger. Prompt X is balanced (50% success) — every rollout is informative. Prompt Y is mostly easy (75% success) — only the one failed rollout has signal.

But std-normalization rescales Y so the single failed rollout gets an advantage of magnitude −1.74 — larger than Prompt X's most-informative rollouts. The algorithm silently learns "punish the rare failure on Y" very aggressively, even though Y is an "easy" prompt the policy is already mostly solving. That's a scale bias the raw policy-gradient theorem has no /std in it for.

Dr.GRPO's fix: just delete the divisor.

A_i^Dr.GRPO = r_i − mean_j r_j

Bias B · why /|y| is wrong (the catastrophic one for reasoning)

This is the bias that motivates the paper's title. GRPO's per-rollout length normalization means each rollout contributes the same total gradient mass regardless of length. Suppose two correct rollouts on the same prompt:

Rollout	\|y\|	A	per-token contribution
y_short	10 tokens	+0.5	+0.5 · (1/10) · log π = 0.05 · log π per token
y_long	100 tokens	+0.5	+0.5 · (1/100) · log π = 0.005 · log π per token

Per-token, the long correct rollout gets 10× less gradient signal than the short one. Aggregated, both contribute the same total. But the long rollout represents 10× more reasoning work that the policy needs to reinforce — and we're explicitly suppressing the gradient on that reasoning.

Over many updates, the policy is biased toward shorter responses on the same prompt. For long-CoT reasoning tasks where the winning policies are precisely the ones that produce long chains of thought, GRPO has a structural anti-reasoning bias.

Dr.GRPO's fix: drop the per-rollout divisor. Sum over all response tokens in the group, divide by K (the group size):

L^Dr.GRPO = (1/K) Σ_i Σ_t L_i,t · mask_i,t

Every token contributes equally to the loss regardless of which rollout it lived in. Long correct rollouts now produce 10× the gradient of short ones, matching the per-token effort they represent.

Dr.GRPO vs. DAPO Fix 3

Both DAPO's "token-level loss" (Fix 3 in lesson 13) and Dr.GRPO's no-/|y_i| fix attack the same bias. The difference is in the normalizer: DAPO divides by Σ|y_i| (total tokens in the group, which varies across batches), Dr.GRPO divides by K (a constant). DAPO's form is effectively length-adaptive LR; Dr.GRPO's is a cleaner plain gradient. Interchangeable at reasonable LRs.

Interactive · length bias, made concrete

Below: two correct rollouts on the same prompt, equal advantage. Slide their lengths and watch per-token gradient under three aggregation schemes. The number to watch is the right-hand column under GRPO — for long rollouts it stays small, regardless.

Per-token gradient contribution under three aggregations

Both rollouts have A = +0.5 (broadcasted to every response token). The "gradient per token" tells you how strongly the optimizer reinforces each token's log-probability — what the policy actually learns.

|y_short|: 10 |y_long|: 200

	GRPO (per-rollout)	DAPO (token-level)	Dr.GRPO (over K)

—

The Dr.GRPO loss, in code

# From RL/algorithms/05_drgrpo.py — drgrpo_step (annotated diff vs GRPO)

# Fix A: advantages WITHOUT /std.
A = rewards - rewards.mean()                              # (K,)

# Standard PPO-clip surrogate (unchanged).
ratio  = torch.exp(new_logp - old_logp)
A_tok  = A.detach().unsqueeze(-1).expand_as(ratio)
s1, s2 = ratio * A_tok, torch.clamp(ratio, 1-ε, 1+ε) * A_tok
pg_tok = -torch.min(s1, s2) * target_mask

# Fix B: NO per-rollout length normalization. Sum across all response
# tokens in the group; divide by K. Every token contributes equally to
# the loss, regardless of which rollout it belongs to or how long that
# rollout is.
pg_loss = pg_tok.sum() / float(K)                         # NOT (1/K) Σ (1/|y_i|) Σ

# KL anchor with the same aggregation.
kl_tok  = compute_kl_k3(new_logp, ref_logp) * target_mask
kl_loss = β * kl_tok.sum() / float(K)

loss = pg_loss + kl_loss

That's the whole diff. Two divisions removed. Same rollouts, same PPO-clip, same KL anchor, same hyperparameters.

What you give up

Two things, both manageable in practice:

Reward-scale sensitivity. Without /std, switching from {0, 1} to {0, 100} rewards changes the effective LR by 100×. Pick your reward scale once and forget it; or scale the LR if you change rewards.
Slightly noisier per-step gradients on prompts with low reward variance. The corresponding bias-reduction usually wins on long-CoT tasks; on short-output preference-style tasks, GRPO's std-normalization is a defensible choice.

The lineage at a glance

Algorithm	Baseline	Clip	Value head	/std	/\|y_i\|
REINFORCE	none	—	—	—	—
PPO	learned V_φ	✓ (sym)	✓	batch-norm	— (already token-level)
GRPO	group mean	✓ (sym)	—	✓	✓
RLOO	LOO mean	—	—	—	—
DAPO	group mean	✓ (asym)	—	✓	token-level (replaces /\|y_i\|)
Dr.GRPO	group mean	✓ (sym)	—	—	—

Six rows of one-line deltas. Read top to bottom and you can recover any of these algorithms from its predecessor by adding or removing one feature.

Takeaway

Dr.GRPO is GRPO minus two divisors that looked harmless but bias the gradient: /std (amplifies low-variance groups) and /|y_i| (under-reinforces long correct responses). Drop both. The result is the honest REINFORCE-with-group-baseline-and-PPO-clip estimator the derivation actually says you want — and on long-CoT reasoning the length-bias fix is the one that actually moves benchmark numbers.

That closes Part II — six algorithms, one verifiable task, one variance-reduction lineage. Part III (lessons 15–23) moves to production: where reward comes from when there is no verifier (RLHF, DPO), how environments and verifiers are designed, how the cluster is wired, what the inference engine is actually doing, and how the famous recipes (R1, Tülu 3, Qwen, o-series) combine all of the above.