rl_foundations / lessons / 07 · policy gradient lesson 07 / 32

Policy gradient, rigorously

Lesson 03 stated the policy-gradient theorem ∇J = 𝔼[G · ∇log π] and lesson 06 leaned on it to run A3C across many actors. We took it on faith. Now we prove it from the objective up, dissect every piece, and pin down the one thing that controls whether any of it works in practice: the variance of the estimator.

What we are standing on

By now the symbols are reflexes: state s, action a, reward r, transition P(s'|s,a), discount γ, the return Gt = ∑k≥0 γk rt+k, a parameterized policy πθ(a|s), and the value Vπ(s) = 𝔼[Gt|st=s]. Lesson 03 wrote the gradient down and used it; lesson 06 distributed it across workers. Neither derived it. The whole lineage that follows — advantages and GAE (08), importance sampling (09), trust regions and PPO (10–11) — is variance-reduction surgery on the object we are about to build, so we had better know exactly what it is.

The plan
Four moves. (1) Write the objective and the probability of a trajectory. (2) Differentiate with the log-derivative trick — and watch the transition terms vanish. (3) Sharpen it twice for free: causality / reward-to-go, then a baseline. (4) Decompose the variance to see why each fix helps, and add an entropy bonus so we keep exploring while we do it.

1 · The objective and the trajectory distribution

We want the policy that maximizes expected return. Let τ = (s0, a0, r0, s1, a1, r1, …) be a trajectory rolled out under πθ, and let R(τ) = ∑t γt rt be its discounted return. The objective is

J(θ) = 𝔼τ ∼ πθ [ R(τ) ] = ∑τ Pθ(τ) · R(τ) .

The only thing that depends on θ here is Pθ(τ), the probability the policy assigns to that trajectory. By the Markov property it factorizes — the start state, then alternating policy choices and environment transitions:

Pθ(τ) = ρ0(s0) · ∏t=0T−1 πθ(at|st) · P(st+1|st,at) .

Take the log and a product becomes a sum — this is the move that makes the gradient tractable:

log Pθ(τ) = log ρ0(s0) + ∑t [ log πθ(at|st) + log P(st+1|st,at) ] .

Stare at the right-hand side and mark which terms touch θ. The start-state distribution ρ0 does not. The transitions P(st+1|st,at) do not — those are the environment's physics, not the policy's. Only the log πθ terms do.

2 · Differentiate: the log-derivative trick

We cannot differentiate J by pushing θ inside the expectation naively, because the distribution we average over is the very thing we differentiate. The escape is one identity, valid for any parameterized distribution because θ Pθ = Pθ · ∇θ log Pθ:

θ J(θ) = ∑τθ Pθ(τ) · R(τ) = ∑τ Pθ(τ) · ∇θ log Pθ(τ) · R(τ) = 𝔼τ ∼ πθ [ R(τ) · ∇θ log Pθ(τ) ] .

Now substitute the factorized log from move 1. The θ annihilates every term that did not contain θθ log ρ0 = 0 and θ log P(st+1|st,at) = 0. The transition model drops out. What survives is purely the policy's own log-probabilities:

θ log Pθ(τ) = ∑tθ log πθ(at|st)  ⟹  ∇θ J(θ) = 𝔼τ [ R(τ) · ∑tθ log πθ(at|st) ] .
Why this is the whole point of model-free RL
We just differentiated the expected return without knowing the dynamics P. They cancelled. That is what lets a robot or an LLM learn from rollouts alone — you never need a model of the world, only the ability to sample from it and read off the log-probability of the actions you took.

The score function

The vector θ log πθ(a|s) has a name borrowed from statistics: the score function. It points in the direction in parameter space that most increases the log-probability of action a in state s. The policy gradient is just this score, sampled at every action you took, weighted by how good the outcome was. Two facts about the score earn their keep below.

Its expectation is zero. Averaged over actions drawn from πθ itself, the score vanishes — because probabilities always sum to one and the gradient of a constant is zero:

𝔼a ∼ πθ [ ∇θ log πθ(a|s) ] = ∑a πθ(a|s) ∇θ log πθ(a|s) = ∑aθ πθ(a|s) = ∇θa πθ(a|s) = ∇θ 1 = 0 .

This single line is the engine behind the baseline trick (move 3b). Concretely, for a softmax policy πθ = softmax(θ) over discrete actions the score has a clean closed form, θ log πθ(a) = ea − πθ (the one-hot of the chosen action minus the policy's own probabilities) — exactly the (i===a?1:0) − pi[i] you saw in the lesson 01 and 03 widgets, and the one we use again below.

3a · Causality: replace the return with the reward-to-go

The estimator above multiplies every action's score by the whole-trajectory return R(τ) — including reward earned before that action was taken. That is absurd on its face: an action at time t cannot influence rewards at times < t; the past is already written. Including those rewards adds noise and nothing else.

Causality lets us drop them. Because the score at time t has zero mean and is independent of any reward earned at an earlier time t' < t (that reward was produced by earlier actions, before at was even sampled), the cross-terms 𝔼[ rt' · ∇log πθ(at|st) ] = 0. So we may replace R(τ) attached to each action with only the rewards that came after it — the reward-to-go:

θ J(θ) = 𝔼τ [ ∑t Gt · ∇θ log πθ(at|st) ] ,    Gt = ∑k≥0 γk rt+k .

Same expectation — still unbiased — but each score is now weighted by a smaller, less noisy number. This is the form lesson 03 wrote down; we have just earned the right to write Gt (reward-to-go) instead of R(τ) (full return). It is the first free variance cut.

3b · Baselines: subtract without bias

The second free cut. Subtract from each reward-to-go any quantity b(st) that depends on the state but not on the action:

θ J(θ) = 𝔼τ [ ∑t ( Gt − b(st) ) · ∇θ log πθ(at|st) ] .

This changes nothing in expectation. The proof is just the zero-mean score from above, applied per state: the baseline factors out of the action-expectation, and what is left is b(s) · 𝔼a[∇log π] = b(s) · 0 = 0.

𝔼a ∼ πθ [ b(s) · ∇θ log πθ(a|s) ] = b(s) · ∑a πθ(a|s) ∇θ log πθ(a|s) = b(s) · 0 = 0 .

So we can subtract any state-dependent baseline for free, and we should choose the one that shrinks variance most. The near-optimal choice is the value b(s) = Vπ(s) — and then Gt − V(st) is the advantage A(st,at), the object lesson 08 will spend its entire length estimating well. Without a baseline, when all returns are large and positive, every action's log-prob is shoved up and the only signal is the tiny gap between them — buried in noise. With b = V, better-than-average actions rise and worse-than-average fall, centered on zero.

4 · The variance decomposition — why the villain is variance, not bias

All three forms above are unbiased: average enough rollouts and each converges to the true ∇J. They differ only in variance. And variance is what kills you, because you never have infinite rollouts — you have a batch of N, and the error of a Monte-Carlo estimate shrinks only as

Var[ ĝN ] = Var[ ĝ1 ] / N  ⟹  standard error ∝ 1/√N .

Halving the noise of a single-sample estimate is worth as much as quadrupling the batch. That is the entire economic case for the two free cuts above and for everything in lessons 08–10. Write the single-sample estimator as ĝ = ∑t Ψt · ∇log πθ(at|st), where Ψt is the weight we attach to each score. The whole story is a contest over how noisy Ψt is:

weight Ψtnamebiasvariance
R(τ)full return (vanilla)nonehighest — includes past rewards + all noise
Gtreward-to-gononelower — drops the irrelevant past
Gt − V(st)advantage (baseline)nonelower still — centered on zero
AGAEtGAE (lesson 08)tunablelowest, trades a little bias

Intuitively the single-sample variance has two sources: noise in the weight Ψt (which return did this roll happen to get?) and noise in the score (which action did we happen to sample?). Reward-to-go and baselines attack the first; they shrink the magnitude and re-center the weight without touching the unbiased direction. The batch size N averages down whatever variance remains. The widget below lets you watch all three knobs fight the same number.

The villain of lessons 08–10
Vanilla policy gradient is correct (unbiased) and useless (high-variance) at the same time. Bias is not the problem; variance is. Lesson 08 (advantages / GAE) trades a sliver of bias for a large variance cut; lesson 09 (importance sampling) shows how variance explodes when you reuse off-policy data; lesson 10 (TRPO/PPO) bounds the step so that variance cannot throw the policy off a cliff. Keep your eye on this one number.

The entropy bonus: don't collapse while you reduce variance

A subtle failure: a low-variance gradient can drive the policy to commit too early. As soon as one action looks slightly best, the score keeps pushing its probability toward 1; the policy goes near-deterministic, stops sampling alternatives, and can lock onto a local optimum it never explored its way out of. The standard guard is to add the policy's entropy H[πθ(·|s)] = −∑a πθ(a|s) log πθ(a|s) to the objective, scaled by a small coefficient β:

Jent(θ) = J(θ) + β · 𝔼s [ H[πθ(·|s)] ] .

Its gradient gently pushes πθ back toward uniform, keeping a floor of randomness so exploration (lesson 05) survives. Too much β and the policy refuses to commit; too little and it collapses early. A3C in lesson 06 already carried this term; now you know why it was there.

Interactive · the gradient-variance explorer

A tiny 3-step episodic MDP: at each of three timesteps the policy πθ = softmax(θ) picks one of three actions and earns a noisy reward (one action is best at each step). We estimate the policy gradient over a batch of N rollouts, then measure the variance of that gradient estimate across many independent batches — the actual quantity that decides whether learning is smooth or jittery. Three knobs, each a thing we proved above:

Start with full-return + no-baseline + N=1 — that is vanilla REINFORCE, and the variance bar pins to the top. The bug is the lesson. Then flip reward-to-go on, then the baseline on, then drag N up, and watch the same bar fall at every step.

Gradient-estimate variance — three knobs, one villain
Top: variance of the policy-gradient estimate (measured over 60 independent batches; lower = smoother, more confident learning) — bar + rolling history. Bottom-left: the per-knob breakdown. Each knob you proved unbiased above only ever lowers this number.
Config
vanilla
Batch N
1
Grad variance
vs vanilla
1.00×
Show the core JS (≈26 lines)
// 3-step MDP, 3 actions/step; one action per step is best (noisy reward).
const T = 3, A = 3, best = [0,1,2];
let theta = [[0,0,0],[0,0,0],[0,0,0]];      // logits per step
let bRtg  = [0,0,0];                         // running baseline ≈ V per step

function rollout(){                          // one episode → per-step (a, r, score)
  const steps = [];
  for (let t=0;t<T;t++){
    const pi = softmax(theta[t]);
    const a  = sample(pi);
    const r  = (a===best[t] ? 1 : 0) + 0.6*(Math.random()-0.5);  // noisy reward
    const score = pi.map((p,i)=> (i===a?1:0) - p);               // ∇log π = e_a − π
    steps.push({a, r, score});
  }
  return steps;
}

function estimateGrad(N){                     // average ĝ over N rollouts
  const g = [[0,0,0],[0,0,0],[0,0,0]];
  for (let n=0;n<N;n++){
    const ep = rollout();
    const Rtot = ep.reduce((s,x)=>s+x.r, 0);
    for (let t=0;t<T;t++){
      let Gt = useRtg ? ep.slice(t).reduce((s,x)=>s+x.r,0) : Rtot;  // reward-to-go vs full return
      if (useBase) Gt -= bRtg[t];                                  // baseline (no bias)
      for (let i=0;i<A;i++) g[t][i] += Gt * ep[t].score[i] / N;
    }
  }
  return g;                                   // variance of this g over many batches = the KPI
}

The takeaway you can see on screen: turning on reward-to-go drops the bar a notch, the baseline drops it again, and the batch slider drives it toward zero like 1/N — all without moving the policy's average update, because every one of these is unbiased. You are watching the estimator get quieter, not different.

Where this points next

We now have an unbiased gradient and two free variance cuts, and we have named the enemy. The single most powerful weapon against variance turned out to be the baseline — and the best baseline is the value V(s), turning the weight into the advantage A = G − V. But Gt is itself a noisy, full-Monte-Carlo estimate of Q, and V has to be learned. Lesson 08 formalizes the advantage: it shows the one-step TD error δ as a low-variance (but biased) advantage estimate, the n-step returns in between, and GAE(λ) as the geometric knob that blends them — the exact bias↔variance dial this lesson has been circling. After that, lesson 09 asks the dangerous question: can we reuse the rollouts instead of throwing them away each step? — and finds the variance bomb that motivates trust regions.

Takeaway
Write J(θ)=𝔼τ[R(τ)]; the trajectory probability factorizes over steps, so the log-derivative trick turns ∇J into 𝔼[∑t Ψt ∇log πθ(at|st)] — and the transition model cancels, which is why model-free RL is possible at all. Two refinements cost no bias because the score has zero mean: reward-to-go (causality drops past rewards) and a baseline b(s) (best choice V(s) → the advantage). All forms are unbiased; they differ only in variance, which falls as 1/N with batch size and is the true villain of the next three lessons. Add an entropy bonus β H[π] so the policy keeps exploring instead of collapsing.