rl_lessons / 13 · DAPO lesson 13 / 14

DAPO — four practical fixes on GRPO ByteDance-Seed 2025

Open-source SOTA for verifiable-reward reasoning. Take GRPO. Add four targeted patches addressing four observed failure modes at scale. ~5 lines of code each.

What "practical fix" means here

DAPO doesn't introduce a new algorithm in the formal sense — every advantage is still (r − mean)/(std + ε), the surrogate is still PPO-clip, the anchor is still KL against πref. What DAPO contributes is a list of training failure modes that were holding open-source frontier RL back, and a small patch for each. The four fixes are additive — you can toggle them individually and watch each one's effect.

#FixFailure mode it addresses
1Clip-higherEntropy collapse: symmetric clip caps "boost winning tokens" faster than "suppress losing tokens", so entropy decays and exploration dies.
2Dynamic samplingWasted compute: at 90% accuracy, ~66% of groups have zero reward variance → zero advantage → useless gradient.
3Token-level lossLength bias: per-rollout mean weights short and long rollouts equally, underweighting reasoning-heavy long correct answers.
4Overlong soft penaltyFalse negatives: hard-zero reward for rollouts that exceeded the length budget penalizes correctly-reasoning-but-slow responses.

Fix 1 · Clip-higher

Standard PPO clip is symmetric: clip(ρ, 1−ε, 1+ε) with ε = 0.2. The ε bounds look symmetric in log-space — and they are. The asymmetry is state-dependent:

Fix: asymmetric bounds. εlow = 0.2 (unchanged), εhigh = 0.28 (looser).

LCLIP-HIGHt  =  − min(  ρ · A,   clip(ρ, 1−εlow, 1+εhigh) · A  )

One extra scalar, same code path. Entropy decays more slowly, exploration survives longer.

Fix 2 · Dynamic sampling

If all K rewards are equal, the advantage is zero for every rollout, the gradient is zero, and the optimizer step is just AdamW's variance estimator absorbing numerical noise. At 90% accuracy this happens for about 0.9K + 0.1K ≈ 0.66 of groups at K=4 — two thirds of compute wasted.

Fix: keep resampling prompts (or refilling the batch from a pool) until you have a group with at least one correct and one incorrect rollout. Every optimizer step then corresponds to a group with real signal.

The trade is more rollouts per optimizer step, but each step is gradient-rich. On AIME-level problems where most prompts are uniformly easy or uniformly hard for the current policy, this is a 3–5× wall-clock speedup versus wasting steps on degenerate groups.

Fix 3 · Token-level loss

The most subtle of the four. GRPO aggregates the per-token loss per-rollout, then averages across K:

LSEQ  =  (1/K) Σi (1/|yi|) Σt Li,t

This means each rollout contributes the same total gradient mass regardless of length. A 200-token winning rollout and a 2000-token winning rollout have the same total gradient — but the longer one represents 10× more reasoning the policy should reinforce. Long correct rollouts are underweighted; long bad rollouts are underpenalized; the policy drifts toward longer-and-less-rewarded responses.

Fix: sum over tokens across the whole group, divide by total token count:

LTOK  =  ( Σi Σt Li,t ) / ( Σi |yi| )

Now every token contributes equally regardless of which rollout it lived in. Long rollouts dominate the gradient when long rollouts are what the policy actually emitted. (Dr.GRPO, lesson 14, will take this one step further.)

Fix 4 · Overlong soft penalty

If the model rolls past max_new tokens without emitting EOS, classical GRPO gives it reward 0. But the response might have been about to be correct — just slow. The model receives a penalty that has nothing to do with reasoning quality, just with an arbitrary length cap.

Fix: replace the hard zero with a soft, length-dependent penalty:

overlong_penalty(L) = max(0, min(1, (L − L_thresh) / (L_max − L_thresh)))
reward_i            = raw_reward_i  −  λ · overlong_penalty(L_i)

If the response emitted EOS in time, penalty is 0 and the verifier reward passes through. If it ran over, the penalty ramps linearly. The model learns "try to finish on time, but don't collapse your answer just to emit EOS faster".

Interactive · toggle the four fixes

Below: a simulated DAPO run with each fix toggleable. The plot shows reward EMA over 200 steps. Turn off the fixes one at a time — watch which ones the toy task is sensitive to.

DAPO ablation — each fix individually
All four fixes on by default. Toggle any off to see its individual contribution. The simulation models entropy collapse (Fix 1), wasted steps (Fix 2), and length-bias damage (Fix 3) as separate damping factors on the reward EMA. (Fix 2's "wasted-step rate" is held at the steady-state ~66% for visual clarity; in real training it ramps up as accuracy rises.)
Final reward EMA
Useful steps
Entropy estimate

DAPO is GRPO plus four conditionals

# From RL/algorithms/04_dapo.py — dapo_step (annotated)

# Fix 2: keep resampling until the group has non-zero reward variance.
trajs, rewards = rollout_with_dynamic_sampling(...)

# Fix 4: soft penalty on overlong responses.
rewards = rewards - λ_overlong * overlong_penalty(lengths, max_new)

# (otherwise standard GRPO advantage)
adv = (rewards - rewards.mean()) / (rewards.std() + ε)

# Fix 1: asymmetric clip.
s2 = torch.clamp(ratio, 1 - clip_low, 1 + clip_high) * A_tok    # 1-0.20, 1+0.28
pg_tok = -torch.min(s1, s2) * target_mask

# Fix 3: token-level normalization (sum, then divide by total tokens).
pg_loss = pg_tok.sum() / total_tokens                            # NOT (1/K) Σ (1/|y_i|) Σ
kl_loss = β * (compute_kl_k3(...) * target_mask).sum() / total_tokens

Each fix is a few lines and can be toggled with a flag. Together they close most of the open-source gap to DeepSeek-R1's unpublished post-training tricks. On AIME-style problems DAPO gives a clean 5–10 pp accuracy improvement over vanilla GRPO at equal compute.

Takeaway
DAPO = GRPO + four patches: asymmetric clip (entropy), dynamic sampling (degenerate groups), token-level loss (length bias), overlong soft penalty (false negatives). Each is targeted at a specific observed failure mode and is ~5 lines of code. Together they are the open-source state-of-the-art for verifiable-reward reasoning as of 2025.