TRPO — natural gradient, the KL trust region → PPO

Lesson 09 said "keep π_θ close to π_old" so the importance ratio doesn't explode. TRPO stops asking nicely: it makes closeness a hard constraint and proves that, inside that region, improving the surrogate provably improves the real policy. This is the theoretical peak of the policy-gradient lineage — and the thing PPO will cheapen into a clip.

Where we are

We have, from the last three lessons, all the parts on the table:

The advantage A^π_old(s,a) = Q(s,a) − V(s) (lesson 08): how much better an action is than the policy's own average.
The importance ratio ρ = π_θ(a|s) / π_old(a|s) (lesson 09): the reweighting that lets us score a new policy π_θ using rollouts drawn from the old one.
The warning: when π_θ drifts far from π_old, ρ blows up and the estimate becomes garbage.

Lesson 09 left "close" as a vibe. TRPO's contribution is to make it precise and load-bearing.

The surrogate objective

We want to know how much better a candidate policy π_θ is than π_old, using only rollouts we already collected from π_old. The natural object is the importance-weighted advantage — the surrogate objective:

L_{π_old}(θ) = 𝔼_{s,a ∼ π_old} [ ρ · A^π_old(s,a) ] = 𝔼 [ (π_θ/π_old) · A^π_old ]

Read it: "under the old policy's data, reweight each action by how much more (or less) likely π_θ makes it, and credit it by its advantage." If π_θ puts more mass on above-average actions (A>0) and less on below-average ones, L goes up. At θ = θ_old we have ρ = 1, so L = 0, and — crucially — its gradient equals the true policy gradient there. So L is a local model of the true objective, exact to first order at π_old.

The performance-difference lemma

Why should maximizing L improve the actual return J(π_θ) = 𝔼[ Σ γ^t r_t ]? Because of an exact identity — the performance-difference lemma:

J(π_θ) − J(π_old) = 𝔼_{s,a ∼ π_θ} [ A^π_old(s,a) ]

The true improvement of π_θ equals the old policy's advantage, averaged over states and actions visited by the new policy. This is exact — no approximation. The trouble is the subscript: the expectation is over π_θ's own state-visitation distribution, which we cannot sample without already running π_θ. The surrogate L is exactly this quantity with one cheat: it uses π_old's state distribution instead. So L and the true improvement agree only while the two policies visit similar states — i.e. while they are close.

The whole game in one sentence

L(θ) is the true performance improvement computed on the wrong state distribution. It is trustworthy only in a neighborhood of π_old. So: maximize L, but only take a step small enough that the state distributions haven't drifted apart. That neighborhood is the trust region.

Bounding the error: from total variation to KL

How wrong can the surrogate be? TRPO bounds the gap. If the two policies differ by at most D_TV in total-variation distance at every state, then

J(π_θ) ≥ L_{π_old}(θ) − C · ( max_s D_TV(π_old ‖ π_θ)[s] )²

with C a constant in γ and the advantage scale. This is a genuine lower bound on the true objective: the surrogate, minus a penalty for moving too far. Maximize the right-hand side and you are guaranteed to raise the true return — a monotonic improvement guarantee. That penalty term is the entire reason the method is safe.

Total variation is awkward to estimate from samples; KL divergence is the natural quantity (it's what log-probabilities give you). Pinsker's inequality bridges them — D_TV(p‖q)² ≤ ½ KL(p‖q) — so we can replace TV by KL and keep a valid bound:

J(π_θ) ≥ L_{π_old}(θ) − C′ · 𝔼_s [ KL( π_old(·|s) ‖ π_θ(·|s) ) ]

The constrained problem

In principle you'd just maximize that penalized bound. In practice the theoretical C′ is so conservative that the steps are uselessly tiny. TRPO's pragmatic move: turn the penalty into a constraint with a tunable budget δ:

max_θ L_{π_old}(θ) subject to 𝔼_{s ∼ π_old} [ KL( π_old(·|s) ‖ π_θ(·|s) ) ] ≤ δ

"Improve the surrogate as much as you can, but don't move the policy distribution more than δ in average KL." δ is the radius of the trust region. Small δ = cautious, reliable steps; large δ = the bound goes slack and the surrogate stops predicting the truth. The widget below is exactly this dial.

The natural gradient — why Euclidean geometry is wrong

Now solve the constrained problem. Linearize the objective and quadratically approximate the constraint around θ_old:

max_Δθ g^⊤Δθ subject to ½ Δθ^⊤ F Δθ ≤ δ

where g = ∇_θL is the ordinary policy gradient and F is the Fisher information matrix — the second-order (curvature) approximation of the KL divergence. The constrained solution is the natural gradient:

Δθ ∝ F⁻¹ g

Why F⁻¹ and not plain g? Because a vanilla gradient step θ ← θ + α g moves a fixed distance in parameter space — and equal steps in θ can mean wildly unequal changes in the policy. Tweaking a logit when the policy is near-uniform barely moves the distribution; the same tweak when the policy is nearly deterministic can flip its behavior. Euclidean distance in θ is the wrong ruler. The Fisher matrix is the right ruler: it measures distance in terms of how much the output distribution changes (KL), so F⁻¹g takes the steepest-ascent step in policy space, not parameter space. The trust region is a KL ball, and the natural gradient is the gradient measured in that ball's geometry.

The cost that kills TRPO

F is an n × n matrix where n is the parameter count — millions to billions. You can never form it, let alone invert it. TRPO instead computes F⁻¹g with conjugate gradient, which needs only Fisher-vector products F v (cheap via a double backprop on the KL), never F itself. Then a backtracking line search shrinks the step until the KL constraint actually holds (the quadratic model can be wrong). Per update: many Fisher-vector products plus a line search. Correct, principled — and painfully expensive.

Interactive · the trust region, and what happens when it's too big

Below is a 1-D slice through policy-parameter space: the horizontal axis is a step Δθ away from π_old (at the center). The blue curve is the true objective J(θ) — what we actually care about. The orange curve is the surrogate L(θ) — our local model, which (being first-order at the center) keeps rising even where the truth turns down. The shaded band is the trust region of radius δ. The rule of TRPO: take the step that maximizes the surrogate, but only inside the band.

Drag δ. Small: the chosen step lands where surrogate and truth still agree — the true objective rises every time. Large: the surrogate lures you out to a step where the true objective has already peaked and rolled over — the "improvement" step actually drops real performance. That overshoot is the bug TRPO's constraint exists to prevent.

Surrogate vs. true objective over a 1-D policy slice

Center = π_old. Orange = surrogate L(θ) (our model). Blue = true J(θ). Shaded = trust region of radius δ. The dashed marker is the step TRPO takes: the surrogate's max inside the band. Watch the "true gain" KPI flip negative when δ gets greedy.

δ (trust-region radius): 0.35

step Δθ taken

—

surrogate L at step

—

true gain ΔJ

—

verdict

—

Show the core JS (≈22 lines)

// True objective: rises then falls (the surrogate can't see the fall).
const J = dt => 1.15*dt - 0.95*dt*dt - 0.18*dt*dt*dt;
// Surrogate: first-order-exact at 0, but keeps climbing (concave-but-optimistic).
const L = dt => 1.15*dt - 0.32*dt*dt;

function takeStep(delta){
  // TRPO picks the step maximizing L within the KL ball |Δθ| ≤ delta.
  // L is increasing on [0, ~1.8], so the constrained max is the band edge.
  const step = Math.min(delta, 1.8);     // surrogate's argmax inside trust region
  return { step, surr: L(step), gain: J(step) - J(0) };
}

The lesson the widget hammers: the surrogate is not wrong — it is a faithful local model that becomes untrustworthy with distance. The constraint isn't there to slow you down for its own sake; it is there because beyond the trust region the bound goes slack and the guarantee evaporates. Overshoot the region and you optimize a model of the objective right off the edge of where it predicts anything.

TRPO is correct but expensive → PPO

TRPO delivers the monotonic-improvement guarantee, but the price is steep: per update you run conjugate gradient (many Fisher-vector products) and a backtracking line search, all on top of the normal forward/backward passes. It's fiddly to implement and slow at scale. The field wanted TRPO's safety without its second-order machinery.

PPO (next lesson) is the answer, and it's almost embarrassingly simple. Drop the hard KL constraint, drop the Fisher matrix, drop conjugate gradient and the line search. Replace all of it with a first-order surrogate that clips the importance ratio so it can never stray far from 1:

L^CLIP(θ) = 𝔼 [ min( ρ · A, clip(ρ, 1−ε, 1+ε) · A ) ]

The clip is a crude, cheap stand-in for the KL ball: instead of measuring distribution drift and solving a constrained optimization, it just refuses to reward the surrogate for pushing ρ past 1±ε in the rewarding direction. Same intent as the trust region — don't move too far per step — but enforced with a clamp and plain SGD, no curvature, no F⁻¹. You trade TRPO's provable guarantee for an approximation that works well enough and is trivial to run on enormous models. That trade is exactly what made it the default for the LLM era.

The lineage, compressed

REINFORCE (07) → subtract a baseline / use the advantage (08) → reuse data via the importance ratio (09) → constrain the ratio's drift with a KL trust region and a natural-gradient solve = TRPO (here) → cheapen the constraint to a clipped first-order surrogate = PPO (11). Each step is the smallest patch that fixes the previous one's failure mode.

Takeaway

The surrogate L(θ) = 𝔼[ρ·A] is the true performance improvement evaluated on the wrong state distribution — so it's only trustworthy near π_old. TRPO bounds the error (TV → KL via Pinsker), turns it into a hard constraint 𝔼[KL] ≤ δ, and steps along the natural gradient F⁻¹g because KL — not Euclidean distance in θ — is the right ruler for "how much did the policy move." It's provably monotone but second-order and expensive; PPO keeps the trust-region intent and throws away the machinery, clipping the ratio instead. That clip is where the LLM era begins — next lesson.