PPO — the classical RLHF workhorse Schulman 2017
Three patches on REINFORCE: a learned baseline, off-policy correction, and a trust region. Powered InstructGPT, the original ChatGPT, early Claude — and is still the reference algorithm every reasoning-RL paper compares to.
The three problems PPO solves at once
| REINFORCE's pain point | PPO's patch |
|---|---|
| No baseline → enormous gradient variance. | Learned value function Vφ(s). Subtract its prediction from the reward. |
| One gradient step per rollout → expensive sampling wasted. | Importance sampling ρ = πθ/πold. Re-use the rollouts across multiple SGD epochs. |
| No trust region → one rare-token update can destabilize the policy. | Clipped ratio clip(ρ, 1−ε, 1+ε) + a pessimistic min. |
Patch 1 · the value head
Add a second head to the network — typically a single linear projection from the final hidden state to a scalar. Train it with mean-squared error against the observed Monte-Carlo return:
Subtracting Vφ(st) from the reward gives the per-token advantage At = Rt − Vφ(st). Because the baseline depends only on the state (the prefix), it cuts variance without biasing the gradient. For a terminal-reward task, Vφ ends up learning "probability that this prefix ends up in a correct response" — it's a success-probability head.
Memory cost. The value head is small, but the trunk forward/backward is doubled at scale (you have to backprop through the shared transformer twice if the head and policy use separate optimizers, or hold extra activation memory if they share). On a 7B+ model this is the single dominant reason GRPO (next lesson) drops the critic.
Patch 2 · importance sampling
Naive REINFORCE uses each rollout once: sample, score, step, discard. Rollouts are expensive — a 7B model generating a 500-token response per prompt eats more compute than the gradient step that follows. PPO re-uses each rollout for N SGD epochs by correcting for the off-policy bias:
The estimator becomes 𝔼y∼πold[ρ · A], which is an unbiased estimate of 𝔼y∼πθ[A] by the importance-sampling identity. Now you can take 4–10 SGD steps per rollout batch — a corresponding reduction in rollout-compute-per-update.
Patch 3 · the clipped surrogate
The danger with importance sampling: if πθ drifts far from πold, ρ blows up on rare tokens and one bad gradient wrecks the model. PPO's fix is the clipped surrogate:
The min is the magic. It is asymmetric protection — clipping kills the gradient only when it would push the policy further past the trust region in the direction that increases reward. In the corrective direction (snapping back from a rare-token excursion) the unclipped term wins and the gradient flows.
Interactive · the PPO clip table
This is the most important diagnostic to internalize about PPO. Click any cell — see which branch of min() wins, and whether the clip is doing real work.
The full PPO loss
# From RL/algorithms/01_ppo.py — ppo_step (one of N_EPOCHS)
log_ratio = new_logp - old_logp # (K, T-1)
ratio = torch.exp(log_ratio)
# Patch 3: clipped surrogate, pessimistic min.
s1 = ratio * A_t
s2 = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps) * A_t
pg_per_tok = -torch.min(s1, s2) * target_mask
pg_loss = pg_per_tok.sum() / target_mask.sum().clamp(min=1.0)
# Patch 1: value head MSE against Monte-Carlo returns.
v_err = (new_values - R_t) ** 2 * target_mask
v_loss = v_err.sum() / target_mask.sum().clamp(min=1.0)
# Universal: KL anchor against the frozen reference.
kl_tok = compute_kl_k3(new_logp, ref_logp.detach())
kl_loss = beta * (kl_tok * target_mask).sum() / target_mask.sum().clamp(min=1.0)
loss = pg_loss + c_v * v_loss + kl_loss
Three terms: policy clipped surrogate, value MSE, KL anchor. Run it for N_EPOCHS on the same rollout batch (patch 2). That's PPO. Compare it line-for-line with REINFORCE — patches 1, 2, 3 are visible as three independent additions.
Why PPO is still the reference algorithm
Even though GRPO has displaced PPO on the reasoning-RL frontier, PPO remains the reference algorithm in this space because:
- It handles dense per-step rewards (process reward models, multi-turn agentic rewards) naturally via the value function. GRPO can't — it broadcasts a scalar advantage to every token.
- It scales to arbitrary reward signals, not just verifiable 0/1. A reward model that outputs a real-valued preference score plugs in unchanged.
- The clip + trust region is well-studied and behaves predictably across a huge range of hyperparameters. GRPO's group-baseline behaves well too, but inherits PPO's clip rather than replacing it.
What you give up: a second network's worth of memory and the credit-assignment burden of training the value head on sparse terminal rewards. For verifiable-reward tasks this trade looks worse and worse as the policy gets bigger — which is the entire economic case for GRPO, next lesson.