The LLM era (上) — PPO, DPO, GRPO: the map

Lesson 10 ended at PPO's clipped surrogate. Here is the surprise that organizes the next three lessons: training a large language model with RL is PPO, unchanged. The "actions" are tokens, the "policy" is the LLM, and a full generated response is a trajectory. Once you see the mapping, PPO, DPO, and GRPO are just three answers to three different complaints about the same loop.

The hook: nothing new, just renamed

For ten lessons we built one machine. An MDP (lesson 01): states s, actions a, rewards r, a policy π_θ(a|s). A policy gradient (lessons 03, 07): ∇J = 𝔼[A · ∇log π_θ]. An advantage A = Q − V to cut variance (lesson 08). An importance ratio ρ = π_θ/π_old to reuse old rollouts (lesson 09). And a trust region to keep each step safe — TRPO's hard KL constraint, then PPO's cheap clip (lesson 10).

The claim of this lesson is that the entire "RLHF / reasoning-RL" revolution in LLMs reuses exactly this machinery. There is no new optimization theory. What changed is the shape of the MDP: the action set is the vocabulary, the trajectory is a sentence, and the reward arrives once, at the end.

The mapping: MDP-RL → token-RL

Read this table as a dictionary. Every row is something you already know from the left column, renamed for the right.

Classical RL (lessons 01–10)	LLM / token-RL (this lesson on)
State s_t	The prompt plus all tokens generated so far: s_t = (x, y_<t).
Action a_t	The next token y_t, drawn from the vocabulary (≈10⁵ discrete actions).
Policy π_θ(a\|s)	The LLM's next-token distribution π_θ(y_t \| x, y_<t). The weights θ are the whole network.
Transition P(s'\|s,a)	Deterministic and known: appending a token gives the next state. The "environment dynamics" are just string concatenation.
Trajectory τ = (s₀,a₀,…)	One full generated response y = (y₁,…,y_T) — a rollout.
Reward r(s,a)	Usually terminal only: a single scalar r(x,y) scoring the whole response (verifier says correct/incorrect, or a reward model scores helpfulness). Zero at every intermediate token.
Return G = Σ γ^k r_k	With terminal reward and γ=1, the return of every prefix is just the final score r(x,y).

The two structural facts that make token-RL special

(1) The dynamics are known and deterministic. Unlike a robot or a game, there is no stochastic environment to model — appending token y_t to the prefix is the entire transition. All the randomness lives in the policy. (2) The reward is terminal and sparse. One scalar for a 500-token response. That single fact is what makes the baseline (lesson 03) and the advantage estimator (lesson 08) the central design choice — and it is exactly where PPO, GRPO, and DPO diverge.

So a response is a trajectory of length T through a known-dynamics MDP, and the policy-gradient objective from lesson 01 is, verbatim:

J(θ) = 𝔼_{x ∼ D} 𝔼_{y ∼ π_θ(·|x)} [ r(x, y) ]

and its gradient factorizes over tokens, because log π_θ(y|x) = Σ_t log π_θ(y_t | x, y_<t):

∇_θ J = 𝔼 [ Σ_t A_t · ∇_θ log π_θ(y_t | x, y_<t) ]

That is lesson 07's policy-gradient theorem, with a per-token advantage A_t. Everything below is about (a) how to keep the step safe and (b) where A_t comes from.

PPO, re-derived as the cheap TRPO — for tokens

Recall the chain from the last two lessons. Importance sampling (lesson 09) lets us reuse a batch of rollouts sampled from π_old for several gradient steps, by reweighting with the per-token ratio

ρ_t(θ) = π_θ(y_t | x, y_<t) / π_old(y_t | x, y_<t)

giving the surrogate objective L(θ) = 𝔼[ ρ_t · A_t ]. But lesson 09 also showed the danger: when π_θ drifts from π_old, ρ explodes on rare tokens and the estimate's variance blows up. TRPO (lesson 10) fixed this with a hard constraint, 𝔼[KL(π_old ‖ π_θ)] ≤ δ, solved with a natural gradient and Fisher-vector products — correct, but expensive.

PPO replaces the hard trust region with a first-order trick: clip the ratio so the surrogate stops rewarding drift past [1−ε, 1+ε], and take the pessimistic min:

L^CLIP_t(θ) = min( ρ_t · A_t, clip(ρ_t, 1−ε, 1+ε) · A_t )

The min is asymmetric protection. When the update would push the policy further past the trust region in the reward-increasing direction, the clipped branch is flatter, the min selects it, and its gradient w.r.t. θ is zero — the step is capped. In the corrective direction (snapping a rare-token excursion back toward π_old), the unclipped branch wins and the gradient flows. The clip is TRPO's trust region, implemented as one min and one clamp instead of a constrained optimization. The widget below is exactly this function; play with it until the asymmetry is obvious.

The KL-to-reference anchor

The clip keeps π_θ near π_old — the policy from a few gradient steps ago. But in the LLM era there is a second, separate anchor that classical RL did not need: a penalty pulling π_θ toward a frozen reference π_ref, almost always the SFT model we started from.

J(θ) = 𝔼[ r(x,y) ] − β · KL( π_θ(·|x) ‖ π_ref(·|x) )

Why both? They have different jobs. The clip (vs π_old) is a numerical guardrail on the importance ratio — it keeps the off-policy estimate valid step to step. The KL-to-π_ref is a semantic guardrail across the whole run: it stops the policy from wandering far from a fluent, general model just to chase reward — the failure mode called reward hacking, where the policy finds degenerate high-reward text the reward function never meant to endorse. The coefficient β trades these off — and as the widget shows for ε, the wrong setting breaks learning at either extreme. We unpack this anchor's two jobs fully in lesson 13 (RLHF).

Interactive · the PPO clip visualizer

This plots the two branches of L^CLIP as a function of the ratio ρ, for a fixed advantage. The dashed blue line is the unclipped surrogate ρ·A; the solid orange line is the clipped objective min(ρ·A, clip(ρ,1−ε,1+ε)·A) that PPO actually optimizes. Flip the sign of A, and drag ε. Where the orange line goes flat, the gradient is zero — the update is capped.

PPO clipped objective vs. the unclipped surrogate

Blue dashed = ρ·A (no trust region). Orange = the PPO objective min(ρ·A, clip(ρ,1−ε,1+ε)·A). The shaded band is the trust region [1−ε, 1+ε]. Watch where orange goes flat: that is where the clip kills the gradient. Try ε=0 (flat everywhere → no learning) and ε large (clip never engages → no trust region → unstable).

ε: 0.20

Trust region

[0.80, 1.20]

Clip active for

ρ > 1.20

Gradient when clipped

0 (capped)

Show the JS that computes the curve (≈14 lines)

// The two branches of the PPO objective, for advantage A and clip ε.
function unclipped(rho, A){ return rho * A; }                 // ρ·A
function clipped(rho, A, eps){
  const c = Math.max(1 - eps, Math.min(1 + eps, rho));        // clamp ρ
  return c * A;                                               // clip(ρ)·A
}
// PPO optimizes the pessimistic min of the two.
function ppoObjective(rho, A, eps){
  return Math.min(unclipped(rho, A), clipped(rho, A, eps));
}
// The gradient w.r.t. θ is zero wherever ppoObjective is flat in ρ:
//   A>0 and ρ>1+ε   → clipped branch wins, flat → capped
//   A<0 and ρ<1−ε   → clipped branch wins, flat → capped
// elsewhere the unclipped branch passes the gradient through.

The family: one loop, three complaints

PPO works, and it is still the reference algorithm. But it carries three burdens, and the two most influential LLM-RL methods each delete one of them. Here is the whole map on one page; lessons 12–13 derive the pieces.

Method	The ONE complaint it answers	What it deletes	What it keeps
PPO	(the baseline) — variance, off-policy reuse, and trust region, all at once.	Nothing. The full machine.	Clip + ρ + KL-anchor + a learned critic V_φ for the advantage.
GRPO	"The value critic is a whole second network — too much memory, and it's unstable on sparse terminal reward."	The critic V_φ. Replaces it with a group baseline: sample K responses per prompt, use their mean reward as the baseline, A_i = (r_i − mean)/std.	Everything else: the same clip, the same ρ, the same KL anchor.
DPO	"Why run an online RL loop — rollouts, a reward model, sampling — at all, when I have a fixed set of human preference pairs?"	The entire online loop: no rollouts, no reward model, no clip, no sampling.	Only the KL-to-reference idea — and even that becomes implicit in a closed-form supervised loss.

The shape of the family is: PPO is the full actor–critic with a trust region (the reunion from lesson 03, scaled). GRPO removes the critic by computing the baseline from a group of samples instead of learning it — pure policy-gradient with a clever Monte-Carlo baseline. DPO removes the actor–critic loop entirely by exploiting a closed-form solution: the optimal KL-regularized policy is a known function of the reward, so the reward can be folded into the policy and fit with a single classification-style loss on preference pairs. Three points on the value↔policy fork that has organized this whole course.

The catch behind each deletion (previewed; derived in lesson 12)

GRPO: using r_i inside its own group mean makes the baseline slightly biased at finite K (an O(1/K) effect), and dividing by std quietly re-weights easy and hard prompts. DPO: because it never samples, it cannot explore — it can only re-rank responses already present in the preference data, and it tends to drive the chosen-vs-rejected gap overconfidently high. These are the trade-offs, not bugs; lesson 12 makes them precise.

WHERE THIS GOES NEXT (SYSTEMS)

This course is the theory: where these algorithms come from, as patches on the policy gradient and the trust region. The sibling course, RL Post-Training, From First Principles, is how they run on a GPU cluster — the rollout engine, weight sync, KL estimators, and the memory accounting that actually decides PPO-vs-GRPO in practice. Same symbols (ρ = π_θ/π_old, A_t, β·KL(π_θ‖π_ref), clip ε), engineering depth instead of derivation:

PPO, systems view — the same clipped surrogate as "three patches on REINFORCE" (learned baseline + importance ratio + clip), with the full loss in code and the value-head memory cost spelled out.
GRPO, systems view — dropping the critic, the group-relative advantage, and the K forward-passes-vs-second-network trade in GPU memory.
DPO, systems view — the closed-form derivation in full, plus the IPO/KTO/ORPO/SimPO family and when DPO beats PPO.

Read those after lessons 12–13 here, where we derive the math they take as given.

Where this leaves us

We have the map. PPO is TRPO's clip, applied to a token-MDP with terminal reward and a KL anchor to the SFT model. DPO and GRPO are each one deletion from PPO, motivated by one concrete cost. Lesson 12 (下) earns both deletions: it derives DPO's closed form from the KL-regularized optimum and the Bradley–Terry likelihood (watching Z(x) cancel), and derives GRPO's group baseline from the policy-gradient theorem's freedom to subtract any function of the state. Lesson 13 then wires PPO into the full RLHF recipe (SFT → reward model → PPO with the KL anchor).

Takeaway

Training an LLM with RL is the exact machine from lessons 01–10 with a renamed MDP: state = prompt + tokens-so-far, action = next token, policy = the LLM, trajectory = the full response, reward = one terminal score. PPO is TRPO's trust region implemented as a clipped min, plus a KL anchor to the frozen SFT model. From there, GRPO deletes the value critic (group baseline instead) and DPO deletes the online loop entirely (closed-form preference loss) — two ends of the value↔policy fork, each fixing one specific cost. Next: we derive both.