rl_lessons / 12 · RLOO lesson 12 / 14

RLOO — leave-one-out, strictly unbiased Ahmadian et al. 2024 (Cohere)

GRPO's group mean uses ri in its own baseline. RLOO uses the mean of the other K−1 rollouts. Same cost, strictly unbiased — and it lets you drop the clip and the std divisor too.

The flaw RLOO fixes

From the GRPO lesson:

AiGRPO  =  ri − mean(r1, …, rK)

The mean is over all K rollouts including i, so Ai is correlated with the very action yi whose log-prob we're scaling. The policy-gradient theorem's baseline-unbiasedness argument requires a baseline independent of yi. So GRPO is biased at finite K, with bias that decays as O(1/K).

RLOO's fix is one line: use the mean of the other K−1 rollouts.

AiRLOO  =  ri − (1/(K−1)) Σj ≠ i rj

Since the other K−1 rollouts are i.i.d. samples from πθ(·|x) — independent of yi — the baseline is a function of x and quantities independent of yi. Strictly unbiased.

The algebra: RLOO is GRPO times K/(K−1)

RLOO and GRPO advantages are not as different as they look. A few lines of algebra:

AiRLOO = ri − (Σ r − ri)/(K−1) = (K · ri − Σ r)/(K−1) = (K/(K−1)) · (ri − mean(r)) = (K/(K−1)) · AiGRPO-pre-std

So at K=4: AiRLOO = (4/3) · AiGRPO-no-std ≈ 1.33 ×. The difference is a constant scale factor — absorbable into the learning rate.

What's actually new about RLOO
Not the magnitude — the unbiasedness and the philosophical cleanliness. RLOO doesn't divide by std (which we'll see in lesson 14 is a bias). It doesn't use the PPO clip (Cohere's argument: variance is already well-controlled). It doesn't divide by response length. Everything is the honest REINFORCE-with-LOO-baseline gradient, with a KL anchor on top.

Interactive · GRPO vs. RLOO side by side

Slide K and the rewards. The widget shows both advantage formulas and the difference in their per-rollout values.

GRPO vs. RLOO advantage, K rollouts
Numerator is the same; denominator/centering differs. At K=2 the ratio is 2 (LOO advantage is twice GRPO's centered advantage); at K=8 it's 8/7 ≈ 1.14. Asymptotically they coincide.
iriAGRPO (centered, no /std)ARLOOratio
K / (K-1)
mean(r)

The Cohere recipe — and why it's so spare

Ahmadian et al. (2024, "Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs") argue that for LLMs, PPO's clip and value function are over-engineered:

So RLOO is REINFORCE + LOO group baseline + KL anchor. Nothing else. The loss is literally

L  =  − (1/K) Σi AiLOO · Σt log πθ(yi,t | x, yi,<t)  +  β · KL

Every token inside a rollout contributes equally at the rollout's advantage weight. There's no clip, no std normalization, and (notably) no per-rollout length normalization — RLOO never had the /|yi| divisor that Dr.GRPO (lesson 14) will argue against.

Interpretation · counterfactual baseline

The LOO baseline answers: "what reward would a typical rollout from this prompt get, given everything I know except the rollout I'm crediting?" Same principle as LOO cross-validation and the jackknife estimator — a reliable variance-reduction technique with a clean derivation.

RLOO vs. GRPO · when to prefer which

AspectGRPORLOO
Baseline biasO(1/K)Strictly unbiased
Std normalizationYesNo
PPO clipYesNo
Reward-scale sensitivityLow (std-normalized)Higher — need to control reward magnitudes
Multi-step / async trainingComposes with clip + IS ratioOn-policy as published; could add IS+clip but then it's just GRPO without /std
Memory1× policy1× policy

For production: pick RLOO if rewards are well-controlled and you want the cleanest math; pick GRPO/DAPO if you want the PPO-style safety net (clip + std-normalization) for stability in face of reward outliers or multi-step async training.

Subtle but important · LOO is the cleanest derivation, not necessarily the fastest convergence

Empirically GRPO and RLOO are within ~1pp on most LLM benchmarks at comparable hyperparameters. The choice is more about philosophical cleanness and which infrastructure quirks you can tolerate than about which one wins on a leaderboard. People who insist on unbiased estimators use RLOO; people who prefer the safety net of clipping and std-normalization use GRPO. Both work.

Takeaway
RLOO is GRPO's clean version: replace the biased "mean of all K" baseline with the unbiased "mean of other K−1" — then drop everything that variance-reduction-via-clipping was patching. The result is REINFORCE + LOO baseline + KL, full stop. Same accuracy as GRPO at K=4 modulo a 4/3 LR scaling.