PPO — the classical RLHF workhorse Schulman 2017

Three patches on REINFORCE: a learned baseline, off-policy correction, and a trust region. Powered InstructGPT, the original ChatGPT, early Claude — and is still the reference algorithm every reasoning-RL paper compares to.

The three problems PPO solves at once

REINFORCE's pain point	PPO's patch
No baseline → enormous gradient variance.	Learned value function V_φ(s). Subtract its prediction from the reward.
One gradient step per rollout → expensive sampling wasted.	Importance sampling ρ = π_θ/π_old. Re-use the rollouts across multiple SGD epochs.
No trust region → one rare-token update can destabilize the policy.	Clipped ratio clip(ρ, 1−ε, 1+ε) + a pessimistic min.

Patch 1 · the value head

Add a second head to the network — typically a single linear projection from the final hidden state to a scalar. Train it with mean-squared error against the observed Monte-Carlo return:

L_V(φ) = 𝔼_t ( V_φ(s_t) − R_t )²

Subtracting V_φ(s_t) from the reward gives the per-token advantage A_t = R_t − V_φ(s_t). Because the baseline depends only on the state (the prefix), it cuts variance without biasing the gradient. For a terminal-reward task, V_φ ends up learning "probability that this prefix ends up in a correct response" — it's a success-probability head.

GAE in one line

Generalized Advantage Estimation (Schulman 2016) blends V_φ and bootstrapped returns with a parameter λ. We use λ=1 here — Monte-Carlo returns, the cleanest case — because the verifier reward is terminal and the math is then exact. Production RLHF often runs at λ ∈ [0.95, 0.99] for bias-variance trade-off.

Memory cost. The value head is small, but the trunk forward/backward is doubled at scale (you have to backprop through the shared transformer twice if the head and policy use separate optimizers, or hold extra activation memory if they share). On a 7B+ model this is the single dominant reason GRPO (next lesson) drops the critic.

Patch 2 · importance sampling

Naive REINFORCE uses each rollout once: sample, score, step, discard. Rollouts are expensive — a 7B model generating a 500-token response per prompt eats more compute than the gradient step that follows. PPO re-uses each rollout for N SGD epochs by correcting for the off-policy bias:

ρ_t(θ) = π_θ(y_t | x, y_<t) / π_old(y_t | x, y_<t)

The estimator becomes 𝔼_{y∼π_old}[ρ · A], which is an unbiased estimate of 𝔼_{y∼π_θ}[A] by the importance-sampling identity. Now you can take 4–10 SGD steps per rollout batch — a corresponding reduction in rollout-compute-per-update.

Patch 3 · the clipped surrogate

The danger with importance sampling: if π_θ drifts far from π_old, ρ blows up on rare tokens and one bad gradient wrecks the model. PPO's fix is the clipped surrogate:

L^CLIP_t = − min( ρ_t · A_t, clip(ρ_t, 1−ε, 1+ε) · A_t )

The min is the magic. It is asymmetric protection — clipping kills the gradient only when it would push the policy further past the trust region in the direction that increases reward. In the corrective direction (snapping back from a rare-token excursion) the unclipped term wins and the gradient flows.

Interactive · the PPO clip table

This is the most important diagnostic to internalize about PPO. Click any cell — see which branch of min() wins, and whether the clip is doing real work.

Which branch wins, for each (sign of A) × (position of ρ)

Click a cell. The cell shows the resulting surrogate, the active branch, and the verbal interpretation. (Convention: clip is helpful only when it caps further drift past the trust region; otherwise the unclipped branch passes through.)

	A > 0 (good rollout)	A < 0 (bad rollout)
ρ < 1−ε trainer thinks token is much rarer than rollout did	click to inspect	click to inspect
1−ε ≤ ρ ≤ 1+ε in-band; clip inactive	click to inspect	click to inspect
ρ > 1+ε trainer thinks token is much more likely than rollout	click to inspect	click to inspect

Read the table as: the clip only kills the gradient in the one corner per sign where it would push further past the trust region. In the diagonal corner (corrective direction) the gradient flows freely.

The full PPO loss

# From RL/algorithms/01_ppo.py — ppo_step (one of N_EPOCHS)
log_ratio = new_logp - old_logp                          # (K, T-1)
ratio     = torch.exp(log_ratio)

# Patch 3: clipped surrogate, pessimistic min.
s1 = ratio * A_t
s2 = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps) * A_t
pg_per_tok = -torch.min(s1, s2) * target_mask
pg_loss    = pg_per_tok.sum() / target_mask.sum().clamp(min=1.0)

# Patch 1: value head MSE against Monte-Carlo returns.
v_err  = (new_values - R_t) ** 2 * target_mask
v_loss = v_err.sum() / target_mask.sum().clamp(min=1.0)

# Universal: KL anchor against the frozen reference.
kl_tok  = compute_kl_k3(new_logp, ref_logp.detach())
kl_loss = beta * (kl_tok * target_mask).sum() / target_mask.sum().clamp(min=1.0)

loss = pg_loss + c_v * v_loss + kl_loss

Three terms: policy clipped surrogate, value MSE, KL anchor. Run it for N_EPOCHS on the same rollout batch (patch 2). That's PPO. Compare it line-for-line with REINFORCE — patches 1, 2, 3 are visible as three independent additions.

Why PPO is still the reference algorithm

Even though GRPO has displaced PPO on the reasoning-RL frontier, PPO remains the reference algorithm in this space because:

It handles dense per-step rewards (process reward models, multi-turn agentic rewards) naturally via the value function. GRPO can't — it broadcasts a scalar advantage to every token.
It scales to arbitrary reward signals, not just verifiable 0/1. A reward model that outputs a real-valued preference score plugs in unchanged.
The clip + trust region is well-studied and behaves predictably across a huge range of hyperparameters. GRPO's group-baseline behaves well too, but inherits PPO's clip rather than replacing it.

What you give up: a second network's worth of memory and the credit-assignment burden of training the value head on sparse terminal rewards. For verifiable-reward tasks this trade looks worse and worse as the policy gets bigger — which is the entire economic case for GRPO, next lesson.

Takeaway

PPO = REINFORCE + (learned baseline) + (importance ratio for off-policy re-use) + (clipped pessimistic surrogate for trust region). Each patch is independent and additive; you can ablate any of them and watch what breaks. On verifiable tasks GRPO will drop the first patch; everything else carries over unchanged.