TRPO — natural gradient, the KL trust region → PPO
Lesson 09 said "keep πθ close to πold" so the importance ratio doesn't explode. TRPO stops asking nicely: it makes closeness a hard constraint and proves that, inside that region, improving the surrogate provably improves the real policy. This is the theoretical peak of the policy-gradient lineage — and the thing PPO will cheapen into a clip.
Where we are
We have, from the last three lessons, all the parts on the table:
- The advantage Aπold(s,a) = Q(s,a) − V(s) (lesson 08): how much better an action is than the policy's own average.
- The importance ratio ρ = πθ(a|s) / πold(a|s) (lesson 09): the reweighting that lets us score a new policy πθ using rollouts drawn from the old one.
- The warning: when πθ drifts far from πold, ρ blows up and the estimate becomes garbage.
Lesson 09 left "close" as a vibe. TRPO's contribution is to make it precise and load-bearing.
The surrogate objective
We want to know how much better a candidate policy πθ is than πold, using only rollouts we already collected from πold. The natural object is the importance-weighted advantage — the surrogate objective:
Read it: "under the old policy's data, reweight each action by how much more (or less) likely πθ makes it, and credit it by its advantage." If πθ puts more mass on above-average actions (A>0) and less on below-average ones, L goes up. At θ = θold we have ρ = 1, so L = 0, and — crucially — its gradient equals the true policy gradient there. So L is a local model of the true objective, exact to first order at πold.
The performance-difference lemma
Why should maximizing L improve the actual return J(πθ) = 𝔼[ Σ γt rt ]? Because of an exact identity — the performance-difference lemma:
The true improvement of πθ equals the old policy's advantage, averaged over states and actions visited by the new policy. This is exact — no approximation. The trouble is the subscript: the expectation is over πθ's own state-visitation distribution, which we cannot sample without already running πθ. The surrogate L is exactly this quantity with one cheat: it uses πold's state distribution instead. So L and the true improvement agree only while the two policies visit similar states — i.e. while they are close.
Bounding the error: from total variation to KL
How wrong can the surrogate be? TRPO bounds the gap. If the two policies differ by at most DTV in total-variation distance at every state, then
with C a constant in γ and the advantage scale. This is a genuine lower bound on the true objective: the surrogate, minus a penalty for moving too far. Maximize the right-hand side and you are guaranteed to raise the true return — a monotonic improvement guarantee. That penalty term is the entire reason the method is safe.
Total variation is awkward to estimate from samples; KL divergence is the natural quantity (it's what log-probabilities give you). Pinsker's inequality bridges them — DTV(p‖q)2 ≤ ½ KL(p‖q) — so we can replace TV by KL and keep a valid bound:
The constrained problem
In principle you'd just maximize that penalized bound. In practice the theoretical C′ is so conservative that the steps are uselessly tiny. TRPO's pragmatic move: turn the penalty into a constraint with a tunable budget δ:
"Improve the surrogate as much as you can, but don't move the policy distribution more than δ in average KL." δ is the radius of the trust region. Small δ = cautious, reliable steps; large δ = the bound goes slack and the surrogate stops predicting the truth. The widget below is exactly this dial.
The natural gradient — why Euclidean geometry is wrong
Now solve the constrained problem. Linearize the objective and quadratically approximate the constraint around θold:
where g = ∇θL is the ordinary policy gradient and F is the Fisher information matrix — the second-order (curvature) approximation of the KL divergence. The constrained solution is the natural gradient:
Why F−1 and not plain g? Because a vanilla gradient step θ ← θ + α g moves a fixed distance in parameter space — and equal steps in θ can mean wildly unequal changes in the policy. Tweaking a logit when the policy is near-uniform barely moves the distribution; the same tweak when the policy is nearly deterministic can flip its behavior. Euclidean distance in θ is the wrong ruler. The Fisher matrix is the right ruler: it measures distance in terms of how much the output distribution changes (KL), so F−1g takes the steepest-ascent step in policy space, not parameter space. The trust region is a KL ball, and the natural gradient is the gradient measured in that ball's geometry.
Interactive · the trust region, and what happens when it's too big
Below is a 1-D slice through policy-parameter space: the horizontal axis is a step Δθ away from πold (at the center). The blue curve is the true objective J(θ) — what we actually care about. The orange curve is the surrogate L(θ) — our local model, which (being first-order at the center) keeps rising even where the truth turns down. The shaded band is the trust region of radius δ. The rule of TRPO: take the step that maximizes the surrogate, but only inside the band.
Drag δ. Small: the chosen step lands where surrogate and truth still agree — the true objective rises every time. Large: the surrogate lures you out to a step where the true objective has already peaked and rolled over — the "improvement" step actually drops real performance. That overshoot is the bug TRPO's constraint exists to prevent.
The lesson the widget hammers: the surrogate is not wrong — it is a faithful local model that becomes untrustworthy with distance. The constraint isn't there to slow you down for its own sake; it is there because beyond the trust region the bound goes slack and the guarantee evaporates. Overshoot the region and you optimize a model of the objective right off the edge of where it predicts anything.
TRPO is correct but expensive → PPO
TRPO delivers the monotonic-improvement guarantee, but the price is steep: per update you run conjugate gradient (many Fisher-vector products) and a backtracking line search, all on top of the normal forward/backward passes. It's fiddly to implement and slow at scale. The field wanted TRPO's safety without its second-order machinery.
PPO (next lesson) is the answer, and it's almost embarrassingly simple. Drop the hard KL constraint, drop the Fisher matrix, drop conjugate gradient and the line search. Replace all of it with a first-order surrogate that clips the importance ratio so it can never stray far from 1:
The clip is a crude, cheap stand-in for the KL ball: instead of measuring distribution drift and solving a constrained optimization, it just refuses to reward the surrogate for pushing ρ past 1±ε in the rewarding direction. Same intent as the trust region — don't move too far per step — but enforced with a clamp and plain SGD, no curvature, no F−1. You trade TRPO's provable guarantee for an approximation that works well enough and is trivial to run on enormous models. That trade is exactly what made it the default for the LLM era.