RLOO — leave-one-out, strictly unbiased Ahmadian et al. 2024 (Cohere)
GRPO's group mean uses ri in its own baseline. RLOO uses the mean of the other K−1 rollouts. Same cost, strictly unbiased — and it lets you drop the clip and the std divisor too.
The flaw RLOO fixes
From the GRPO lesson:
The mean is over all K rollouts including i, so Ai is correlated with the very action yi whose log-prob we're scaling. The policy-gradient theorem's baseline-unbiasedness argument requires a baseline independent of yi. So GRPO is biased at finite K, with bias that decays as O(1/K).
RLOO's fix is one line: use the mean of the other K−1 rollouts.
Since the other K−1 rollouts are i.i.d. samples from πθ(·|x) — independent of yi — the baseline is a function of x and quantities independent of yi. Strictly unbiased.
The algebra: RLOO is GRPO times K/(K−1)
RLOO and GRPO advantages are not as different as they look. A few lines of algebra:
So at K=4: AiRLOO = (4/3) · AiGRPO-no-std ≈ 1.33 ×. The difference is a constant scale factor — absorbable into the learning rate.
Interactive · GRPO vs. RLOO side by side
Slide K and the rewards. The widget shows both advantage formulas and the difference in their per-rollout values.
The Cohere recipe — and why it's so spare
Ahmadian et al. (2024, "Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs") argue that for LLMs, PPO's clip and value function are over-engineered:
- LLM rollouts are short compared to classical RL episodes; return variance is small enough that REINFORCE-with-a-group-baseline is already low-variance.
- PPO's clip was designed for continuous-control tasks where log π can change wildly step-to-step; for LLMs with a careful LR it rarely fires.
- The value head is a second model that costs memory and learns slowly from sparse terminal rewards.
So RLOO is REINFORCE + LOO group baseline + KL anchor. Nothing else. The loss is literally
Every token inside a rollout contributes equally at the rollout's advantage weight. There's no clip, no std normalization, and (notably) no per-rollout length normalization — RLOO never had the /|yi| divisor that Dr.GRPO (lesson 14) will argue against.
Interpretation · counterfactual baseline
The LOO baseline answers: "what reward would a typical rollout from this prompt get, given everything I know except the rollout I'm crediting?" Same principle as LOO cross-validation and the jackknife estimator — a reliable variance-reduction technique with a clean derivation.
RLOO vs. GRPO · when to prefer which
| Aspect | GRPO | RLOO |
|---|---|---|
| Baseline bias | O(1/K) | Strictly unbiased |
| Std normalization | Yes | No |
| PPO clip | Yes | No |
| Reward-scale sensitivity | Low (std-normalized) | Higher — need to control reward magnitudes |
| Multi-step / async training | Composes with clip + IS ratio | On-policy as published; could add IS+clip but then it's just GRPO without /std |
| Memory | 1× policy | 1× policy |
For production: pick RLOO if rewards are well-controlled and you want the cleanest math; pick GRPO/DAPO if you want the PPO-style safety net (clip + std-normalization) for stability in face of reward outliers or multi-step async training.
Subtle but important · LOO is the cleanest derivation, not necessarily the fastest convergence
Empirically GRPO and RLOO are within ~1pp on most LLM benchmarks at comparable hyperparameters. The choice is more about philosophical cleanness and which infrastructure quirks you can tolerate than about which one wins on a leaderboard. People who insist on unbiased estimators use RLOO; people who prefer the safety net of clipping and std-normalization use GRPO. Both work.