rl_foundations / lessons / 11 · the LLM era (上) lesson 11 / 32

The LLM era (上) — PPO, DPO, GRPO: the map

Lesson 10 ended at PPO's clipped surrogate. Here is the surprise that organizes the next three lessons: training a large language model with RL is PPO, unchanged. The "actions" are tokens, the "policy" is the LLM, and a full generated response is a trajectory. Once you see the mapping, PPO, DPO, and GRPO are just three answers to three different complaints about the same loop.

The hook: nothing new, just renamed

For ten lessons we built one machine. An MDP (lesson 01): states s, actions a, rewards r, a policy πθ(a|s). A policy gradient (lessons 03, 07): ∇J = 𝔼[A · ∇log πθ]. An advantage A = Q − V to cut variance (lesson 08). An importance ratio ρ = πθold to reuse old rollouts (lesson 09). And a trust region to keep each step safe — TRPO's hard KL constraint, then PPO's cheap clip (lesson 10).

The claim of this lesson is that the entire "RLHF / reasoning-RL" revolution in LLMs reuses exactly this machinery. There is no new optimization theory. What changed is the shape of the MDP: the action set is the vocabulary, the trajectory is a sentence, and the reward arrives once, at the end.

The mapping: MDP-RL → token-RL

Read this table as a dictionary. Every row is something you already know from the left column, renamed for the right.

Classical RL (lessons 01–10)LLM / token-RL (this lesson on)
State stThe prompt plus all tokens generated so far: st = (x, y<t).
Action atThe next token yt, drawn from the vocabulary (≈105 discrete actions).
Policy πθ(a|s)The LLM's next-token distribution πθ(yt | x, y<t). The weights θ are the whole network.
Transition P(s'|s,a)Deterministic and known: appending a token gives the next state. The "environment dynamics" are just string concatenation.
Trajectory τ = (s0,a0,…)One full generated response y = (y1,…,yT) — a rollout.
Reward r(s,a)Usually terminal only: a single scalar r(x,y) scoring the whole response (verifier says correct/incorrect, or a reward model scores helpfulness). Zero at every intermediate token.
Return G = Σ γk rkWith terminal reward and γ=1, the return of every prefix is just the final score r(x,y).
The two structural facts that make token-RL special
(1) The dynamics are known and deterministic. Unlike a robot or a game, there is no stochastic environment to model — appending token yt to the prefix is the entire transition. All the randomness lives in the policy. (2) The reward is terminal and sparse. One scalar for a 500-token response. That single fact is what makes the baseline (lesson 03) and the advantage estimator (lesson 08) the central design choice — and it is exactly where PPO, GRPO, and DPO diverge.

So a response is a trajectory of length T through a known-dynamics MDP, and the policy-gradient objective from lesson 01 is, verbatim:

J(θ)  =  𝔼x ∼ D   𝔼y ∼ πθ(·|x) [ r(x, y) ]

and its gradient factorizes over tokens, because log πθ(y|x) = Σt log πθ(yt | x, y<t):

θ J  =  𝔼 [   Σt At · ∇θ log πθ(yt | x, y<t)   ]

That is lesson 07's policy-gradient theorem, with a per-token advantage At. Everything below is about (a) how to keep the step safe and (b) where At comes from.

PPO, re-derived as the cheap TRPO — for tokens

Recall the chain from the last two lessons. Importance sampling (lesson 09) lets us reuse a batch of rollouts sampled from πold for several gradient steps, by reweighting with the per-token ratio

ρt(θ)  =  πθ(yt | x, y<t)  /  πold(yt | x, y<t)

giving the surrogate objective L(θ) = 𝔼[ ρt · At ]. But lesson 09 also showed the danger: when πθ drifts from πold, ρ explodes on rare tokens and the estimate's variance blows up. TRPO (lesson 10) fixed this with a hard constraint, 𝔼[KL(πold ‖ πθ)] ≤ δ, solved with a natural gradient and Fisher-vector products — correct, but expensive.

PPO replaces the hard trust region with a first-order trick: clip the ratio so the surrogate stops rewarding drift past [1−ε, 1+ε], and take the pessimistic min:

LCLIPt(θ)  =  min(   ρt · At,   clip(ρt, 1−ε, 1+ε) · At   )

The min is asymmetric protection. When the update would push the policy further past the trust region in the reward-increasing direction, the clipped branch is flatter, the min selects it, and its gradient w.r.t. θ is zero — the step is capped. In the corrective direction (snapping a rare-token excursion back toward πold), the unclipped branch wins and the gradient flows. The clip is TRPO's trust region, implemented as one min and one clamp instead of a constrained optimization. The widget below is exactly this function; play with it until the asymmetry is obvious.

The KL-to-reference anchor

The clip keeps πθ near πold — the policy from a few gradient steps ago. But in the LLM era there is a second, separate anchor that classical RL did not need: a penalty pulling πθ toward a frozen reference πref, almost always the SFT model we started from.

J(θ)  =  𝔼[ r(x,y) ]  −  β · KL( πθ(·|x) ‖ πref(·|x) )

Why both? They have different jobs. The clip (vs πold) is a numerical guardrail on the importance ratio — it keeps the off-policy estimate valid step to step. The KL-to-πref is a semantic guardrail across the whole run: it stops the policy from wandering far from a fluent, general model just to chase reward — the failure mode called reward hacking, where the policy finds degenerate high-reward text the reward function never meant to endorse. The coefficient β trades these off — and as the widget shows for ε, the wrong setting breaks learning at either extreme. We unpack this anchor's two jobs fully in lesson 13 (RLHF).

Interactive · the PPO clip visualizer

This plots the two branches of LCLIP as a function of the ratio ρ, for a fixed advantage. The dashed blue line is the unclipped surrogate ρ·A; the solid orange line is the clipped objective min(ρ·A, clip(ρ,1−ε,1+ε)·A) that PPO actually optimizes. Flip the sign of A, and drag ε. Where the orange line goes flat, the gradient is zero — the update is capped.

PPO clipped objective vs. the unclipped surrogate
Blue dashed = ρ·A (no trust region). Orange = the PPO objective min(ρ·A, clip(ρ,1−ε,1+ε)·A). The shaded band is the trust region [1−ε, 1+ε]. Watch where orange goes flat: that is where the clip kills the gradient. Try ε=0 (flat everywhere → no learning) and ε large (clip never engages → no trust region → unstable).
Trust region
[0.80, 1.20]
Clip active for
ρ > 1.20
Gradient when clipped
0 (capped)
Show the JS that computes the curve (≈14 lines)
// The two branches of the PPO objective, for advantage A and clip ε.
function unclipped(rho, A){ return rho * A; }                 // ρ·A
function clipped(rho, A, eps){
  const c = Math.max(1 - eps, Math.min(1 + eps, rho));        // clamp ρ
  return c * A;                                               // clip(ρ)·A
}
// PPO optimizes the pessimistic min of the two.
function ppoObjective(rho, A, eps){
  return Math.min(unclipped(rho, A), clipped(rho, A, eps));
}
// The gradient w.r.t. θ is zero wherever ppoObjective is flat in ρ:
//   A>0 and ρ>1+ε   → clipped branch wins, flat → capped
//   A<0 and ρ<1−ε   → clipped branch wins, flat → capped
// elsewhere the unclipped branch passes the gradient through.

The family: one loop, three complaints

PPO works, and it is still the reference algorithm. But it carries three burdens, and the two most influential LLM-RL methods each delete one of them. Here is the whole map on one page; lessons 12–13 derive the pieces.

MethodThe ONE complaint it answersWhat it deletesWhat it keeps
PPO (the baseline) — variance, off-policy reuse, and trust region, all at once. Nothing. The full machine. Clip + ρ + KL-anchor + a learned critic Vφ for the advantage.
GRPO "The value critic is a whole second network — too much memory, and it's unstable on sparse terminal reward." The critic Vφ. Replaces it with a group baseline: sample K responses per prompt, use their mean reward as the baseline, Ai = (ri − mean)/std. Everything else: the same clip, the same ρ, the same KL anchor.
DPO "Why run an online RL loop — rollouts, a reward model, sampling — at all, when I have a fixed set of human preference pairs?" The entire online loop: no rollouts, no reward model, no clip, no sampling. Only the KL-to-reference idea — and even that becomes implicit in a closed-form supervised loss.

The shape of the family is: PPO is the full actor–critic with a trust region (the reunion from lesson 03, scaled). GRPO removes the critic by computing the baseline from a group of samples instead of learning it — pure policy-gradient with a clever Monte-Carlo baseline. DPO removes the actor–critic loop entirely by exploiting a closed-form solution: the optimal KL-regularized policy is a known function of the reward, so the reward can be folded into the policy and fit with a single classification-style loss on preference pairs. Three points on the value↔policy fork that has organized this whole course.

The catch behind each deletion (previewed; derived in lesson 12)
GRPO: using ri inside its own group mean makes the baseline slightly biased at finite K (an O(1/K) effect), and dividing by std quietly re-weights easy and hard prompts. DPO: because it never samples, it cannot explore — it can only re-rank responses already present in the preference data, and it tends to drive the chosen-vs-rejected gap overconfidently high. These are the trade-offs, not bugs; lesson 12 makes them precise.
WHERE THIS GOES NEXT (SYSTEMS)

This course is the theory: where these algorithms come from, as patches on the policy gradient and the trust region. The sibling course, RL Post-Training, From First Principles, is how they run on a GPU cluster — the rollout engine, weight sync, KL estimators, and the memory accounting that actually decides PPO-vs-GRPO in practice. Same symbols (ρ = πθold, At, β·KL(πθ‖πref), clip ε), engineering depth instead of derivation:

Read those after lessons 12–13 here, where we derive the math they take as given.

Where this leaves us

We have the map. PPO is TRPO's clip, applied to a token-MDP with terminal reward and a KL anchor to the SFT model. DPO and GRPO are each one deletion from PPO, motivated by one concrete cost. Lesson 12 (下) earns both deletions: it derives DPO's closed form from the KL-regularized optimum and the Bradley–Terry likelihood (watching Z(x) cancel), and derives GRPO's group baseline from the policy-gradient theorem's freedom to subtract any function of the state. Lesson 13 then wires PPO into the full RLHF recipe (SFT → reward model → PPO with the KL anchor).

Takeaway
Training an LLM with RL is the exact machine from lessons 01–10 with a renamed MDP: state = prompt + tokens-so-far, action = next token, policy = the LLM, trajectory = the full response, reward = one terminal score. PPO is TRPO's trust region implemented as a clipped min, plus a KL anchor to the frozen SFT model. From there, GRPO deletes the value critic (group baseline instead) and DPO deletes the online loop entirely (closed-form preference loss) — two ends of the value↔policy fork, each fixing one specific cost. Next: we derive both.