rl_lessons / 10 · PPO lesson 10 / 14

PPO — the classical RLHF workhorse Schulman 2017

Three patches on REINFORCE: a learned baseline, off-policy correction, and a trust region. Powered InstructGPT, the original ChatGPT, early Claude — and is still the reference algorithm every reasoning-RL paper compares to.

The three problems PPO solves at once

REINFORCE's pain pointPPO's patch
No baseline → enormous gradient variance.Learned value function Vφ(s). Subtract its prediction from the reward.
One gradient step per rollout → expensive sampling wasted.Importance sampling ρ = πθold. Re-use the rollouts across multiple SGD epochs.
No trust region → one rare-token update can destabilize the policy.Clipped ratio clip(ρ, 1−ε, 1+ε) + a pessimistic min.

Patch 1 · the value head

Add a second head to the network — typically a single linear projection from the final hidden state to a scalar. Train it with mean-squared error against the observed Monte-Carlo return:

LV(φ)  =  𝔼t ( Vφ(st) − Rt )2

Subtracting Vφ(st) from the reward gives the per-token advantage At = Rt − Vφ(st). Because the baseline depends only on the state (the prefix), it cuts variance without biasing the gradient. For a terminal-reward task, Vφ ends up learning "probability that this prefix ends up in a correct response" — it's a success-probability head.

GAE in one line
Generalized Advantage Estimation (Schulman 2016) blends Vφ and bootstrapped returns with a parameter λ. We use λ=1 here — Monte-Carlo returns, the cleanest case — because the verifier reward is terminal and the math is then exact. Production RLHF often runs at λ ∈ [0.95, 0.99] for bias-variance trade-off.

Memory cost. The value head is small, but the trunk forward/backward is doubled at scale (you have to backprop through the shared transformer twice if the head and policy use separate optimizers, or hold extra activation memory if they share). On a 7B+ model this is the single dominant reason GRPO (next lesson) drops the critic.

Patch 2 · importance sampling

Naive REINFORCE uses each rollout once: sample, score, step, discard. Rollouts are expensive — a 7B model generating a 500-token response per prompt eats more compute than the gradient step that follows. PPO re-uses each rollout for N SGD epochs by correcting for the off-policy bias:

ρt(θ)  =  πθ(yt | x, y<t) / πold(yt | x, y<t)

The estimator becomes 𝔼y∼πold[ρ · A], which is an unbiased estimate of 𝔼y∼πθ[A] by the importance-sampling identity. Now you can take 4–10 SGD steps per rollout batch — a corresponding reduction in rollout-compute-per-update.

Patch 3 · the clipped surrogate

The danger with importance sampling: if πθ drifts far from πold, ρ blows up on rare tokens and one bad gradient wrecks the model. PPO's fix is the clipped surrogate:

LCLIPt  =  − min(  ρt · At,   clip(ρt, 1−ε, 1+ε) · At  )

The min is the magic. It is asymmetric protection — clipping kills the gradient only when it would push the policy further past the trust region in the direction that increases reward. In the corrective direction (snapping back from a rare-token excursion) the unclipped term wins and the gradient flows.

Interactive · the PPO clip table

This is the most important diagnostic to internalize about PPO. Click any cell — see which branch of min() wins, and whether the clip is doing real work.

Which branch wins, for each (sign of A) × (position of ρ)
Click a cell. The cell shows the resulting surrogate, the active branch, and the verbal interpretation. (Convention: clip is helpful only when it caps further drift past the trust region; otherwise the unclipped branch passes through.)
A > 0 (good rollout)A < 0 (bad rollout)
ρ < 1−ε
trainer thinks token
is much rarer than rollout did
click to inspect click to inspect
1−ε ≤ ρ ≤ 1+ε
in-band; clip inactive
click to inspect click to inspect
ρ > 1+ε
trainer thinks token
is much more likely than rollout
click to inspect click to inspect

Read the table as: the clip only kills the gradient in the one corner per sign where it would push further past the trust region. In the diagonal corner (corrective direction) the gradient flows freely.

The full PPO loss

# From RL/algorithms/01_ppo.py — ppo_step (one of N_EPOCHS)
log_ratio = new_logp - old_logp                          # (K, T-1)
ratio     = torch.exp(log_ratio)

# Patch 3: clipped surrogate, pessimistic min.
s1 = ratio * A_t
s2 = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps) * A_t
pg_per_tok = -torch.min(s1, s2) * target_mask
pg_loss    = pg_per_tok.sum() / target_mask.sum().clamp(min=1.0)

# Patch 1: value head MSE against Monte-Carlo returns.
v_err  = (new_values - R_t) ** 2 * target_mask
v_loss = v_err.sum() / target_mask.sum().clamp(min=1.0)

# Universal: KL anchor against the frozen reference.
kl_tok  = compute_kl_k3(new_logp, ref_logp.detach())
kl_loss = beta * (kl_tok * target_mask).sum() / target_mask.sum().clamp(min=1.0)

loss = pg_loss + c_v * v_loss + kl_loss

Three terms: policy clipped surrogate, value MSE, KL anchor. Run it for N_EPOCHS on the same rollout batch (patch 2). That's PPO. Compare it line-for-line with REINFORCE — patches 1, 2, 3 are visible as three independent additions.

Why PPO is still the reference algorithm

Even though GRPO has displaced PPO on the reasoning-RL frontier, PPO remains the reference algorithm in this space because:

What you give up: a second network's worth of memory and the credit-assignment burden of training the value head on sparse terminal rewards. For verifiable-reward tasks this trade looks worse and worse as the policy gets bigger — which is the entire economic case for GRPO, next lesson.

Takeaway
PPO = REINFORCE + (learned baseline) + (importance ratio for off-policy re-use) + (clipped pessimistic surrogate for trust region). Each patch is independent and additive; you can ablate any of them and watch what breaks. On verifiable tasks GRPO will drop the first patch; everything else carries over unchanged.