The LLM era (上) — PPO, DPO, GRPO: the map
Lesson 10 ended at PPO's clipped surrogate. Here is the surprise that organizes the next three lessons: training a large language model with RL is PPO, unchanged. The "actions" are tokens, the "policy" is the LLM, and a full generated response is a trajectory. Once you see the mapping, PPO, DPO, and GRPO are just three answers to three different complaints about the same loop.
The hook: nothing new, just renamed
For ten lessons we built one machine. An MDP (lesson 01): states s, actions a, rewards r, a policy πθ(a|s). A policy gradient (lessons 03, 07): ∇J = 𝔼[A · ∇log πθ]. An advantage A = Q − V to cut variance (lesson 08). An importance ratio ρ = πθ/πold to reuse old rollouts (lesson 09). And a trust region to keep each step safe — TRPO's hard KL constraint, then PPO's cheap clip (lesson 10).
The claim of this lesson is that the entire "RLHF / reasoning-RL" revolution in LLMs reuses exactly this machinery. There is no new optimization theory. What changed is the shape of the MDP: the action set is the vocabulary, the trajectory is a sentence, and the reward arrives once, at the end.
The mapping: MDP-RL → token-RL
Read this table as a dictionary. Every row is something you already know from the left column, renamed for the right.
| Classical RL (lessons 01–10) | LLM / token-RL (this lesson on) |
|---|---|
| State st | The prompt plus all tokens generated so far: st = (x, y<t). |
| Action at | The next token yt, drawn from the vocabulary (≈105 discrete actions). |
| Policy πθ(a|s) | The LLM's next-token distribution πθ(yt | x, y<t). The weights θ are the whole network. |
| Transition P(s'|s,a) | Deterministic and known: appending a token gives the next state. The "environment dynamics" are just string concatenation. |
| Trajectory τ = (s0,a0,…) | One full generated response y = (y1,…,yT) — a rollout. |
| Reward r(s,a) | Usually terminal only: a single scalar r(x,y) scoring the whole response (verifier says correct/incorrect, or a reward model scores helpfulness). Zero at every intermediate token. |
| Return G = Σ γk rk | With terminal reward and γ=1, the return of every prefix is just the final score r(x,y). |
So a response is a trajectory of length T through a known-dynamics MDP, and the policy-gradient objective from lesson 01 is, verbatim:
and its gradient factorizes over tokens, because log πθ(y|x) = Σt log πθ(yt | x, y<t):
That is lesson 07's policy-gradient theorem, with a per-token advantage At. Everything below is about (a) how to keep the step safe and (b) where At comes from.
PPO, re-derived as the cheap TRPO — for tokens
Recall the chain from the last two lessons. Importance sampling (lesson 09) lets us reuse a batch of rollouts sampled from πold for several gradient steps, by reweighting with the per-token ratio
giving the surrogate objective L(θ) = 𝔼[ ρt · At ]. But lesson 09 also showed the danger: when πθ drifts from πold, ρ explodes on rare tokens and the estimate's variance blows up. TRPO (lesson 10) fixed this with a hard constraint, 𝔼[KL(πold ‖ πθ)] ≤ δ, solved with a natural gradient and Fisher-vector products — correct, but expensive.
PPO replaces the hard trust region with a first-order trick: clip the ratio so the surrogate stops rewarding drift past [1−ε, 1+ε], and take the pessimistic min:
The min is asymmetric protection. When the update would push the policy further past the trust region in the reward-increasing direction, the clipped branch is flatter, the min selects it, and its gradient w.r.t. θ is zero — the step is capped. In the corrective direction (snapping a rare-token excursion back toward πold), the unclipped branch wins and the gradient flows. The clip is TRPO's trust region, implemented as one min and one clamp instead of a constrained optimization. The widget below is exactly this function; play with it until the asymmetry is obvious.
The KL-to-reference anchor
The clip keeps πθ near πold — the policy from a few gradient steps ago. But in the LLM era there is a second, separate anchor that classical RL did not need: a penalty pulling πθ toward a frozen reference πref, almost always the SFT model we started from.
Why both? They have different jobs. The clip (vs πold) is a numerical guardrail on the importance ratio — it keeps the off-policy estimate valid step to step. The KL-to-πref is a semantic guardrail across the whole run: it stops the policy from wandering far from a fluent, general model just to chase reward — the failure mode called reward hacking, where the policy finds degenerate high-reward text the reward function never meant to endorse. The coefficient β trades these off — and as the widget shows for ε, the wrong setting breaks learning at either extreme. We unpack this anchor's two jobs fully in lesson 13 (RLHF).
Interactive · the PPO clip visualizer
This plots the two branches of LCLIP as a function of the ratio ρ, for a fixed advantage. The dashed blue line is the unclipped surrogate ρ·A; the solid orange line is the clipped objective min(ρ·A, clip(ρ,1−ε,1+ε)·A) that PPO actually optimizes. Flip the sign of A, and drag ε. Where the orange line goes flat, the gradient is zero — the update is capped.
The family: one loop, three complaints
PPO works, and it is still the reference algorithm. But it carries three burdens, and the two most influential LLM-RL methods each delete one of them. Here is the whole map on one page; lessons 12–13 derive the pieces.
| Method | The ONE complaint it answers | What it deletes | What it keeps |
|---|---|---|---|
| PPO | (the baseline) — variance, off-policy reuse, and trust region, all at once. | Nothing. The full machine. | Clip + ρ + KL-anchor + a learned critic Vφ for the advantage. |
| GRPO | "The value critic is a whole second network — too much memory, and it's unstable on sparse terminal reward." | The critic Vφ. Replaces it with a group baseline: sample K responses per prompt, use their mean reward as the baseline, Ai = (ri − mean)/std. | Everything else: the same clip, the same ρ, the same KL anchor. |
| DPO | "Why run an online RL loop — rollouts, a reward model, sampling — at all, when I have a fixed set of human preference pairs?" | The entire online loop: no rollouts, no reward model, no clip, no sampling. | Only the KL-to-reference idea — and even that becomes implicit in a closed-form supervised loss. |
The shape of the family is: PPO is the full actor–critic with a trust region (the reunion from lesson 03, scaled). GRPO removes the critic by computing the baseline from a group of samples instead of learning it — pure policy-gradient with a clever Monte-Carlo baseline. DPO removes the actor–critic loop entirely by exploiting a closed-form solution: the optimal KL-regularized policy is a known function of the reward, so the reward can be folded into the policy and fit with a single classification-style loss on preference pairs. Three points on the value↔policy fork that has organized this whole course.
This course is the theory: where these algorithms come from, as patches on the policy gradient and the trust region. The sibling course, RL Post-Training, From First Principles, is how they run on a GPU cluster — the rollout engine, weight sync, KL estimators, and the memory accounting that actually decides PPO-vs-GRPO in practice. Same symbols (ρ = πθ/πold, At, β·KL(πθ‖πref), clip ε), engineering depth instead of derivation:
- PPO, systems view — the same clipped surrogate as "three patches on REINFORCE" (learned baseline + importance ratio + clip), with the full loss in code and the value-head memory cost spelled out.
- GRPO, systems view — dropping the critic, the group-relative advantage, and the K forward-passes-vs-second-network trade in GPU memory.
- DPO, systems view — the closed-form derivation in full, plus the IPO/KTO/ORPO/SimPO family and when DPO beats PPO.
Read those after lessons 12–13 here, where we derive the math they take as given.
Where this leaves us
We have the map. PPO is TRPO's clip, applied to a token-MDP with terminal reward and a KL anchor to the SFT model. DPO and GRPO are each one deletion from PPO, motivated by one concrete cost. Lesson 12 (下) earns both deletions: it derives DPO's closed form from the KL-regularized optimum and the Bradley–Terry likelihood (watching Z(x) cancel), and derives GRPO's group baseline from the policy-gradient theorem's freedom to subtract any function of the state. Lesson 13 then wires PPO into the full RLHF recipe (SFT → reward model → PPO with the KL anchor).
min, plus a KL anchor to the frozen SFT model. From there, GRPO deletes the value critic (group baseline instead) and DPO deletes the online loop entirely (closed-form preference loss) — two ends of the value↔policy fork, each fixing one specific cost. Next: we derive both.