Policy gradient, rigorously

Lesson 03 stated the policy-gradient theorem ∇J = 𝔼[G · ∇log π] and lesson 06 leaned on it to run A3C across many actors. We took it on faith. Now we prove it from the objective up, dissect every piece, and pin down the one thing that controls whether any of it works in practice: the variance of the estimator.

What we are standing on

By now the symbols are reflexes: state s, action a, reward r, transition P(s'|s,a), discount γ, the return G_t = ∑_k≥0 γ^k r_t+k, a parameterized policy π_θ(a|s), and the value V^π(s) = 𝔼[G_t|s_t=s]. Lesson 03 wrote the gradient down and used it; lesson 06 distributed it across workers. Neither derived it. The whole lineage that follows — advantages and GAE (08), importance sampling (09), trust regions and PPO (10–11) — is variance-reduction surgery on the object we are about to build, so we had better know exactly what it is.

The plan

Four moves. (1) Write the objective and the probability of a trajectory. (2) Differentiate with the log-derivative trick — and watch the transition terms vanish. (3) Sharpen it twice for free: causality / reward-to-go, then a baseline. (4) Decompose the variance to see why each fix helps, and add an entropy bonus so we keep exploring while we do it.

1 · The objective and the trajectory distribution

We want the policy that maximizes expected return. Let τ = (s₀, a₀, r₀, s₁, a₁, r₁, …) be a trajectory rolled out under π_θ, and let R(τ) = ∑_t γ^t r_t be its discounted return. The objective is

J(θ) = 𝔼_{τ ∼ π_θ} [ R(τ) ] = ∑_τ P_θ(τ) · R(τ) .

The only thing that depends on θ here is P_θ(τ), the probability the policy assigns to that trajectory. By the Markov property it factorizes — the start state, then alternating policy choices and environment transitions:

P_θ(τ) = ρ₀(s₀) · ∏_t=0^T−1 π_θ(a_t|s_t) · P(s_t+1|s_t,a_t) .

Take the log and a product becomes a sum — this is the move that makes the gradient tractable:

log P_θ(τ) = log ρ₀(s₀) + ∑_t [ log π_θ(a_t|s_t) + log P(s_t+1|s_t,a_t) ] .

Stare at the right-hand side and mark which terms touch θ. The start-state distribution ρ₀ does not. The transitions P(s_t+1|s_t,a_t) do not — those are the environment's physics, not the policy's. Only the log π_θ terms do.

2 · Differentiate: the log-derivative trick

We cannot differentiate J by pushing ∇_θ inside the expectation naively, because the distribution we average over is the very thing we differentiate. The escape is one identity, valid for any parameterized distribution because ∇_θ P_θ = P_θ · ∇_θ log P_θ:

∇_θ J(θ) = ∑_τ ∇_θ P_θ(τ) · R(τ) = ∑_τ P_θ(τ) · ∇_θ log P_θ(τ) · R(τ) = 𝔼_{τ ∼ π_θ} [ R(τ) · ∇_θ log P_θ(τ) ] .

Now substitute the factorized log from move 1. The ∇_θ annihilates every term that did not contain θ — ∇_θ log ρ₀ = 0 and ∇_θ log P(s_t+1|s_t,a_t) = 0. The transition model drops out. What survives is purely the policy's own log-probabilities:

∇_θ log P_θ(τ) = ∑_t ∇_θ log π_θ(a_t|s_t) ⟹ ∇_θ J(θ) = 𝔼_τ [ R(τ) · ∑_t ∇_θ log π_θ(a_t|s_t) ] .

Why this is the whole point of model-free RL

We just differentiated the expected return without knowing the dynamics P. They cancelled. That is what lets a robot or an LLM learn from rollouts alone — you never need a model of the world, only the ability to sample from it and read off the log-probability of the actions you took.

The score function

The vector ∇_θ log π_θ(a|s) has a name borrowed from statistics: the score function. It points in the direction in parameter space that most increases the log-probability of action a in state s. The policy gradient is just this score, sampled at every action you took, weighted by how good the outcome was. Two facts about the score earn their keep below.

Its expectation is zero. Averaged over actions drawn from π_θ itself, the score vanishes — because probabilities always sum to one and the gradient of a constant is zero:

𝔼_{a ∼ π_θ} [ ∇_θ log π_θ(a|s) ] = ∑_a π_θ(a|s) ∇_θ log π_θ(a|s) = ∑_a ∇_θ π_θ(a|s) = ∇_θ ∑_a π_θ(a|s) = ∇_θ 1 = 0 .

This single line is the engine behind the baseline trick (move 3b). Concretely, for a softmax policy π_θ = softmax(θ) over discrete actions the score has a clean closed form, ∇_θ log π_θ(a) = e_a − π_θ (the one-hot of the chosen action minus the policy's own probabilities) — exactly the (i===a?1:0) − pi[i] you saw in the lesson 01 and 03 widgets, and the one we use again below.

3a · Causality: replace the return with the reward-to-go

The estimator above multiplies every action's score by the whole-trajectory return R(τ) — including reward earned before that action was taken. That is absurd on its face: an action at time t cannot influence rewards at times < t; the past is already written. Including those rewards adds noise and nothing else.

Causality lets us drop them. Because the score at time t has zero mean and is independent of any reward earned at an earlier time t' < t (that reward was produced by earlier actions, before a_t was even sampled), the cross-terms 𝔼[ r_t' · ∇log π_θ(a_t|s_t) ] = 0. So we may replace R(τ) attached to each action with only the rewards that came after it — the reward-to-go:

∇_θ J(θ) = 𝔼_τ [ ∑_t G_t · ∇_θ log π_θ(a_t|s_t) ] , G_t = ∑_k≥0 γ^k r_t+k .

Same expectation — still unbiased — but each score is now weighted by a smaller, less noisy number. This is the form lesson 03 wrote down; we have just earned the right to write G_t (reward-to-go) instead of R(τ) (full return). It is the first free variance cut.

3b · Baselines: subtract without bias

The second free cut. Subtract from each reward-to-go any quantity b(s_t) that depends on the state but not on the action:

∇_θ J(θ) = 𝔼_τ [ ∑_t ( G_t − b(s_t) ) · ∇_θ log π_θ(a_t|s_t) ] .

This changes nothing in expectation. The proof is just the zero-mean score from above, applied per state: the baseline factors out of the action-expectation, and what is left is b(s) · 𝔼_a[∇log π] = b(s) · 0 = 0.

𝔼_{a ∼ π_θ} [ b(s) · ∇_θ log π_θ(a|s) ] = b(s) · ∑_a π_θ(a|s) ∇_θ log π_θ(a|s) = b(s) · 0 = 0 .

So we can subtract any state-dependent baseline for free, and we should choose the one that shrinks variance most. The near-optimal choice is the value b(s) = V^π(s) — and then G_t − V(s_t) is the advantage A(s_t,a_t), the object lesson 08 will spend its entire length estimating well. Without a baseline, when all returns are large and positive, every action's log-prob is shoved up and the only signal is the tiny gap between them — buried in noise. With b = V, better-than-average actions rise and worse-than-average fall, centered on zero.

4 · The variance decomposition — why the villain is variance, not bias

All three forms above are unbiased: average enough rollouts and each converges to the true ∇J. They differ only in variance. And variance is what kills you, because you never have infinite rollouts — you have a batch of N, and the error of a Monte-Carlo estimate shrinks only as

Var[ ĝ_N ] = Var[ ĝ₁ ] / N ⟹ standard error ∝ 1/√N .

Halving the noise of a single-sample estimate is worth as much as quadrupling the batch. That is the entire economic case for the two free cuts above and for everything in lessons 08–10. Write the single-sample estimator as ĝ = ∑_t Ψ_t · ∇log π_θ(a_t|s_t), where Ψ_t is the weight we attach to each score. The whole story is a contest over how noisy Ψ_t is:

weight Ψ_t	name	bias	variance
R(τ)	full return (vanilla)	none	highest — includes past rewards + all noise
G_t	reward-to-go	none	lower — drops the irrelevant past
G_t − V(s_t)	advantage (baseline)	none	lower still — centered on zero
A^GAE_t	GAE (lesson 08)	tunable	lowest, trades a little bias

Intuitively the single-sample variance has two sources: noise in the weight Ψ_t (which return did this roll happen to get?) and noise in the score (which action did we happen to sample?). Reward-to-go and baselines attack the first; they shrink the magnitude and re-center the weight without touching the unbiased direction. The batch size N averages down whatever variance remains. The widget below lets you watch all three knobs fight the same number.

The villain of lessons 08–10

Vanilla policy gradient is correct (unbiased) and useless (high-variance) at the same time. Bias is not the problem; variance is. Lesson 08 (advantages / GAE) trades a sliver of bias for a large variance cut; lesson 09 (importance sampling) shows how variance explodes when you reuse off-policy data; lesson 10 (TRPO/PPO) bounds the step so that variance cannot throw the policy off a cliff. Keep your eye on this one number.

The entropy bonus: don't collapse while you reduce variance

A subtle failure: a low-variance gradient can drive the policy to commit too early. As soon as one action looks slightly best, the score keeps pushing its probability toward 1; the policy goes near-deterministic, stops sampling alternatives, and can lock onto a local optimum it never explored its way out of. The standard guard is to add the policy's entropy H[π_θ(·|s)] = −∑_a π_θ(a|s) log π_θ(a|s) to the objective, scaled by a small coefficient β:

J^ent(θ) = J(θ) + β · 𝔼_s [ H[π_θ(·|s)] ] .

Its gradient gently pushes π_θ back toward uniform, keeping a floor of randomness so exploration (lesson 05) survives. Too much β and the policy refuses to commit; too little and it collapses early. A3C in lesson 06 already carried this term; now you know why it was there.

Interactive · the gradient-variance explorer

A tiny 3-step episodic MDP: at each of three timesteps the policy π_θ = softmax(θ) picks one of three actions and earns a noisy reward (one action is best at each step). We estimate the policy gradient over a batch of N rollouts, then measure the variance of that gradient estimate across many independent batches — the actual quantity that decides whether learning is smooth or jittery. Three knobs, each a thing we proved above:

Reward-to-go — weight each action by G_t (future only) instead of the full-episode return R(τ).
Baseline — subtract the running mean reward-to-go b(s_t) (a learned-V stand-in).
Batch size N — average more rollouts per estimate; variance falls like 1/N.

Start with full-return + no-baseline + N=1 — that is vanilla REINFORCE, and the variance bar pins to the top. The bug is the lesson. Then flip reward-to-go on, then the baseline on, then drag N up, and watch the same bar fall at every step.

Gradient-estimate variance — three knobs, one villain

Top: variance of the policy-gradient estimate (measured over 60 independent batches; lower = smoother, more confident learning) — bar + rolling history. Bottom-left: the per-knob breakdown. Each knob you proved unbiased above only ever lowers this number.

batch N: 1

Config

vanilla

Batch N

Grad variance

—

vs vanilla

1.00×

Show the core JS (≈26 lines)

// 3-step MDP, 3 actions/step; one action per step is best (noisy reward).
const T = 3, A = 3, best = [0,1,2];
let theta = [[0,0,0],[0,0,0],[0,0,0]];      // logits per step
let bRtg  = [0,0,0];                         // running baseline ≈ V per step

function rollout(){                          // one episode → per-step (a, r, score)
  const steps = [];
  for (let t=0;t<T;t++){
    const pi = softmax(theta[t]);
    const a  = sample(pi);
    const r  = (a===best[t] ? 1 : 0) + 0.6*(Math.random()-0.5);  // noisy reward
    const score = pi.map((p,i)=> (i===a?1:0) - p);               // ∇log π = e_a − π
    steps.push({a, r, score});
  }
  return steps;
}

function estimateGrad(N){                     // average ĝ over N rollouts
  const g = [[0,0,0],[0,0,0],[0,0,0]];
  for (let n=0;n<N;n++){
    const ep = rollout();
    const Rtot = ep.reduce((s,x)=>s+x.r, 0);
    for (let t=0;t<T;t++){
      let Gt = useRtg ? ep.slice(t).reduce((s,x)=>s+x.r,0) : Rtot;  // reward-to-go vs full return
      if (useBase) Gt -= bRtg[t];                                  // baseline (no bias)
      for (let i=0;i<A;i++) g[t][i] += Gt * ep[t].score[i] / N;
    }
  }
  return g;                                   // variance of this g over many batches = the KPI
}

The takeaway you can see on screen: turning on reward-to-go drops the bar a notch, the baseline drops it again, and the batch slider drives it toward zero like 1/N — all without moving the policy's average update, because every one of these is unbiased. You are watching the estimator get quieter, not different.

Where this points next

We now have an unbiased gradient and two free variance cuts, and we have named the enemy. The single most powerful weapon against variance turned out to be the baseline — and the best baseline is the value V(s), turning the weight into the advantage A = G − V. But G_t is itself a noisy, full-Monte-Carlo estimate of Q, and V has to be learned. Lesson 08 formalizes the advantage: it shows the one-step TD error δ as a low-variance (but biased) advantage estimate, the n-step returns in between, and GAE(λ) as the geometric knob that blends them — the exact bias↔variance dial this lesson has been circling. After that, lesson 09 asks the dangerous question: can we reuse the rollouts instead of throwing them away each step? — and finds the variance bomb that motivates trust regions.

Takeaway

Write J(θ)=𝔼_τ[R(τ)]; the trajectory probability factorizes over steps, so the log-derivative trick turns ∇J into 𝔼[∑_t Ψ_t ∇log π_θ(a_t|s_t)] — and the transition model cancels, which is why model-free RL is possible at all. Two refinements cost no bias because the score has zero mean: reward-to-go (causality drops past rewards) and a baseline b(s) (best choice V(s) → the advantage). All forms are unbiased; they differ only in variance, which falls as 1/N with batch size and is the true villain of the next three lessons. Add an entropy bonus β H[π] so the policy keeps exploring instead of collapsing.