RLOO — leave-one-out, strictly unbiased Ahmadian et al. 2024 (Cohere)

GRPO's group mean uses r_i in its own baseline. RLOO uses the mean of the other K−1 rollouts. Same cost, strictly unbiased — and it lets you drop the clip and the std divisor too.

The flaw RLOO fixes

From the GRPO lesson:

A_i^GRPO = r_i − mean(r₁, …, r_K)

The mean is over all K rollouts including i, so A_i is correlated with the very action y_i whose log-prob we're scaling. The policy-gradient theorem's baseline-unbiasedness argument requires a baseline independent of y_i. So GRPO is biased at finite K, with bias that decays as O(1/K).

RLOO's fix is one line: use the mean of the other K−1 rollouts.

A_i^RLOO = r_i − (1/(K−1)) Σ_{j ≠ i} r_j

Since the other K−1 rollouts are i.i.d. samples from π_θ(·|x) — independent of y_i — the baseline is a function of x and quantities independent of y_i. Strictly unbiased.

The algebra: RLOO is GRPO times K/(K−1)

RLOO and GRPO advantages are not as different as they look. A few lines of algebra:

A_i^RLOO = r_i − (Σ r − r_i)/(K−1) = (K · r_i − Σ r)/(K−1) = (K/(K−1)) · (r_i − mean(r)) = (K/(K−1)) · A_i^GRPO-pre-std

So at K=4: A_i^RLOO = (4/3) · A_i^GRPO-no-std ≈ 1.33 ×. The difference is a constant scale factor — absorbable into the learning rate.

What's actually new about RLOO

Not the magnitude — the unbiasedness and the philosophical cleanliness. RLOO doesn't divide by std (which we'll see in lesson 14 is a bias). It doesn't use the PPO clip (Cohere's argument: variance is already well-controlled). It doesn't divide by response length. Everything is the honest REINFORCE-with-LOO-baseline gradient, with a KL anchor on top.

Interactive · GRPO vs. RLOO side by side

Slide K and the rewards. The widget shows both advantage formulas and the difference in their per-rollout values.

GRPO vs. RLOO advantage, K rollouts

Numerator is the same; denominator/centering differs. At K=2 the ratio is 2 (LOO advantage is twice GRPO's centered advantage); at K=8 it's 8/7 ≈ 1.14. Asymptotically they coincide.

K: 4 accuracy: 0.50

i	r_i	A^GRPO (centered, no /std)	A^RLOO	ratio

K / (K-1)

—

mean(r)

—

The Cohere recipe — and why it's so spare

Ahmadian et al. (2024, "Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs") argue that for LLMs, PPO's clip and value function are over-engineered:

LLM rollouts are short compared to classical RL episodes; return variance is small enough that REINFORCE-with-a-group-baseline is already low-variance.
PPO's clip was designed for continuous-control tasks where log π can change wildly step-to-step; for LLMs with a careful LR it rarely fires.
The value head is a second model that costs memory and learns slowly from sparse terminal rewards.

So RLOO is REINFORCE + LOO group baseline + KL anchor. Nothing else. The loss is literally

L = − (1/K) Σ_i A_i^LOO · Σ_t log π_θ(y_i,t | x, y_i,<t) + β · KL

Every token inside a rollout contributes equally at the rollout's advantage weight. There's no clip, no std normalization, and (notably) no per-rollout length normalization — RLOO never had the /|y_i| divisor that Dr.GRPO (lesson 14) will argue against.

Interpretation · counterfactual baseline

The LOO baseline answers: "what reward would a typical rollout from this prompt get, given everything I know except the rollout I'm crediting?" Same principle as LOO cross-validation and the jackknife estimator — a reliable variance-reduction technique with a clean derivation.

RLOO vs. GRPO · when to prefer which

Aspect	GRPO	RLOO
Baseline bias	O(1/K)	Strictly unbiased
Std normalization	Yes	No
PPO clip	Yes	No
Reward-scale sensitivity	Low (std-normalized)	Higher — need to control reward magnitudes
Multi-step / async training	Composes with clip + IS ratio	On-policy as published; could add IS+clip but then it's just GRPO without /std
Memory	1× policy	1× policy

For production: pick RLOO if rewards are well-controlled and you want the cleanest math; pick GRPO/DAPO if you want the PPO-style safety net (clip + std-normalization) for stability in face of reward outliers or multi-step async training.

Subtle but important · LOO is the cleanest derivation, not necessarily the fastest convergence

Empirically GRPO and RLOO are within ~1pp on most LLM benchmarks at comparable hyperparameters. The choice is more about philosophical cleanness and which infrastructure quirks you can tolerate than about which one wins on a leaderboard. People who insist on unbiased estimators use RLOO; people who prefer the safety net of clipping and std-normalization use GRPO. Both work.

Takeaway

RLOO is GRPO's clean version: replace the biased "mean of all K" baseline with the unbiased "mean of other K−1" — then drop everything that variance-reduction-via-clipping was patching. The result is REINFORCE + LOO baseline + KL, full stop. Same accuracy as GRPO at K=4 modulo a 4/3 LR scaling.