DAPO — four practical fixes on GRPO ByteDance-Seed 2025
Open-source SOTA for verifiable-reward reasoning. Take GRPO. Add four targeted patches addressing four observed failure modes at scale. ~5 lines of code each.
What "practical fix" means here
DAPO doesn't introduce a new algorithm in the formal sense — every advantage is still (r − mean)/(std + ε), the surrogate is still PPO-clip, the anchor is still KL against πref. What DAPO contributes is a list of training failure modes that were holding open-source frontier RL back, and a small patch for each. The four fixes are additive — you can toggle them individually and watch each one's effect.
| # | Fix | Failure mode it addresses |
|---|---|---|
| 1 | Clip-higher | Entropy collapse: symmetric clip caps "boost winning tokens" faster than "suppress losing tokens", so entropy decays and exploration dies. |
| 2 | Dynamic sampling | Wasted compute: at 90% accuracy, ~66% of groups have zero reward variance → zero advantage → useless gradient. |
| 3 | Token-level loss | Length bias: per-rollout mean weights short and long rollouts equally, underweighting reasoning-heavy long correct answers. |
| 4 | Overlong soft penalty | False negatives: hard-zero reward for rollouts that exceeded the length budget penalizes correctly-reasoning-but-slow responses. |
Fix 1 · Clip-higher
Standard PPO clip is symmetric: clip(ρ, 1−ε, 1+ε) with ε = 0.2. The ε bounds look symmetric in log-space — and they are. The asymmetry is state-dependent:
- A high-probability winning token (already at p ≈ 0.9) saturates the +ε ceiling almost immediately — a ratio of 1.2 would push it past p = 1, so the clip activates fast and learning on that token stops.
- A high-probability losing token has full ×(1−ε) headroom to shrink down toward zero — the clip never bites until the token is already nearly extinct.
- Net effect: confident winners get held back; confident losers get extinguished freely. Entropy shrinks faster than rewarded mass grows, and over many updates rollouts become deterministic.
Fix: asymmetric bounds. εlow = 0.2 (unchanged), εhigh = 0.28 (looser).
One extra scalar, same code path. Entropy decays more slowly, exploration survives longer.
Fix 2 · Dynamic sampling
If all K rewards are equal, the advantage is zero for every rollout, the gradient is zero, and the optimizer step is just AdamW's variance estimator absorbing numerical noise. At 90% accuracy this happens for about 0.9K + 0.1K ≈ 0.66 of groups at K=4 — two thirds of compute wasted.
Fix: keep resampling prompts (or refilling the batch from a pool) until you have a group with at least one correct and one incorrect rollout. Every optimizer step then corresponds to a group with real signal.
The trade is more rollouts per optimizer step, but each step is gradient-rich. On AIME-level problems where most prompts are uniformly easy or uniformly hard for the current policy, this is a 3–5× wall-clock speedup versus wasting steps on degenerate groups.
Fix 3 · Token-level loss
The most subtle of the four. GRPO aggregates the per-token loss per-rollout, then averages across K:
This means each rollout contributes the same total gradient mass regardless of length. A 200-token winning rollout and a 2000-token winning rollout have the same total gradient — but the longer one represents 10× more reasoning the policy should reinforce. Long correct rollouts are underweighted; long bad rollouts are underpenalized; the policy drifts toward longer-and-less-rewarded responses.
Fix: sum over tokens across the whole group, divide by total token count:
Now every token contributes equally regardless of which rollout it lived in. Long rollouts dominate the gradient when long rollouts are what the policy actually emitted. (Dr.GRPO, lesson 14, will take this one step further.)
Fix 4 · Overlong soft penalty
If the model rolls past max_new tokens without emitting EOS, classical GRPO gives it reward 0. But the response might have been about to be correct — just slow. The model receives a penalty that has nothing to do with reasoning quality, just with an arbitrary length cap.
Fix: replace the hard zero with a soft, length-dependent penalty:
overlong_penalty(L) = max(0, min(1, (L − L_thresh) / (L_max − L_thresh)))
reward_i = raw_reward_i − λ · overlong_penalty(L_i)
If the response emitted EOS in time, penalty is 0 and the verifier reward passes through. If it ran over, the penalty ramps linearly. The model learns "try to finish on time, but don't collapse your answer just to emit EOS faster".
Interactive · toggle the four fixes
Below: a simulated DAPO run with each fix toggleable. The plot shows reward EMA over 200 steps. Turn off the fixes one at a time — watch which ones the toy task is sensitive to.
DAPO is GRPO plus four conditionals
# From RL/algorithms/04_dapo.py — dapo_step (annotated)
# Fix 2: keep resampling until the group has non-zero reward variance.
trajs, rewards = rollout_with_dynamic_sampling(...)
# Fix 4: soft penalty on overlong responses.
rewards = rewards - λ_overlong * overlong_penalty(lengths, max_new)
# (otherwise standard GRPO advantage)
adv = (rewards - rewards.mean()) / (rewards.std() + ε)
# Fix 1: asymmetric clip.
s2 = torch.clamp(ratio, 1 - clip_low, 1 + clip_high) * A_tok # 1-0.20, 1+0.28
pg_tok = -torch.min(s1, s2) * target_mask
# Fix 3: token-level normalization (sum, then divide by total tokens).
pg_loss = pg_tok.sum() / total_tokens # NOT (1/K) Σ (1/|y_i|) Σ
kl_loss = β * (compute_kl_k3(...) * target_mask).sum() / total_tokens
Each fix is a few lines and can be toggled with a flag. Together they close most of the open-source gap to DeepSeek-R1's unpublished post-training tricks. On AIME-style problems DAPO gives a clean 5–10 pp accuracy improvement over vanilla GRPO at equal compute.