Policy gradient, rigorously
Lesson 03 stated the policy-gradient theorem ∇J = 𝔼[G · ∇log π] and lesson 06 leaned on it to run A3C across many actors. We took it on faith. Now we prove it from the objective up, dissect every piece, and pin down the one thing that controls whether any of it works in practice: the variance of the estimator.
What we are standing on
By now the symbols are reflexes: state s, action a, reward r, transition P(s'|s,a), discount γ, the return Gt = ∑k≥0 γk rt+k, a parameterized policy πθ(a|s), and the value Vπ(s) = 𝔼[Gt|st=s]. Lesson 03 wrote the gradient down and used it; lesson 06 distributed it across workers. Neither derived it. The whole lineage that follows — advantages and GAE (08), importance sampling (09), trust regions and PPO (10–11) — is variance-reduction surgery on the object we are about to build, so we had better know exactly what it is.
1 · The objective and the trajectory distribution
We want the policy that maximizes expected return. Let τ = (s0, a0, r0, s1, a1, r1, …) be a trajectory rolled out under πθ, and let R(τ) = ∑t γt rt be its discounted return. The objective is
The only thing that depends on θ here is Pθ(τ), the probability the policy assigns to that trajectory. By the Markov property it factorizes — the start state, then alternating policy choices and environment transitions:
Take the log and a product becomes a sum — this is the move that makes the gradient tractable:
Stare at the right-hand side and mark which terms touch θ. The start-state distribution ρ0 does not. The transitions P(st+1|st,at) do not — those are the environment's physics, not the policy's. Only the log πθ terms do.
2 · Differentiate: the log-derivative trick
We cannot differentiate J by pushing ∇θ inside the expectation naively, because the distribution we average over is the very thing we differentiate. The escape is one identity, valid for any parameterized distribution because ∇θ Pθ = Pθ · ∇θ log Pθ:
Now substitute the factorized log from move 1. The ∇θ annihilates every term that did not contain θ — ∇θ log ρ0 = 0 and ∇θ log P(st+1|st,at) = 0. The transition model drops out. What survives is purely the policy's own log-probabilities:
The score function
The vector ∇θ log πθ(a|s) has a name borrowed from statistics: the score function. It points in the direction in parameter space that most increases the log-probability of action a in state s. The policy gradient is just this score, sampled at every action you took, weighted by how good the outcome was. Two facts about the score earn their keep below.
Its expectation is zero. Averaged over actions drawn from πθ itself, the score vanishes — because probabilities always sum to one and the gradient of a constant is zero:
This single line is the engine behind the baseline trick (move 3b). Concretely, for a softmax policy πθ = softmax(θ) over discrete actions the score has a clean closed form, ∇θ log πθ(a) = ea − πθ (the one-hot of the chosen action minus the policy's own probabilities) — exactly the (i===a?1:0) − pi[i] you saw in the lesson 01 and 03 widgets, and the one we use again below.
3a · Causality: replace the return with the reward-to-go
The estimator above multiplies every action's score by the whole-trajectory return R(τ) — including reward earned before that action was taken. That is absurd on its face: an action at time t cannot influence rewards at times < t; the past is already written. Including those rewards adds noise and nothing else.
Causality lets us drop them. Because the score at time t has zero mean and is independent of any reward earned at an earlier time t' < t (that reward was produced by earlier actions, before at was even sampled), the cross-terms 𝔼[ rt' · ∇log πθ(at|st) ] = 0. So we may replace R(τ) attached to each action with only the rewards that came after it — the reward-to-go:
Same expectation — still unbiased — but each score is now weighted by a smaller, less noisy number. This is the form lesson 03 wrote down; we have just earned the right to write Gt (reward-to-go) instead of R(τ) (full return). It is the first free variance cut.
3b · Baselines: subtract without bias
The second free cut. Subtract from each reward-to-go any quantity b(st) that depends on the state but not on the action:
This changes nothing in expectation. The proof is just the zero-mean score from above, applied per state: the baseline factors out of the action-expectation, and what is left is b(s) · 𝔼a[∇log π] = b(s) · 0 = 0.
So we can subtract any state-dependent baseline for free, and we should choose the one that shrinks variance most. The near-optimal choice is the value b(s) = Vπ(s) — and then Gt − V(st) is the advantage A(st,at), the object lesson 08 will spend its entire length estimating well. Without a baseline, when all returns are large and positive, every action's log-prob is shoved up and the only signal is the tiny gap between them — buried in noise. With b = V, better-than-average actions rise and worse-than-average fall, centered on zero.
4 · The variance decomposition — why the villain is variance, not bias
All three forms above are unbiased: average enough rollouts and each converges to the true ∇J. They differ only in variance. And variance is what kills you, because you never have infinite rollouts — you have a batch of N, and the error of a Monte-Carlo estimate shrinks only as
Halving the noise of a single-sample estimate is worth as much as quadrupling the batch. That is the entire economic case for the two free cuts above and for everything in lessons 08–10. Write the single-sample estimator as ĝ = ∑t Ψt · ∇log πθ(at|st), where Ψt is the weight we attach to each score. The whole story is a contest over how noisy Ψt is:
| weight Ψt | name | bias | variance |
|---|---|---|---|
| R(τ) | full return (vanilla) | none | highest — includes past rewards + all noise |
| Gt | reward-to-go | none | lower — drops the irrelevant past |
| Gt − V(st) | advantage (baseline) | none | lower still — centered on zero |
| AGAEt | GAE (lesson 08) | tunable | lowest, trades a little bias |
Intuitively the single-sample variance has two sources: noise in the weight Ψt (which return did this roll happen to get?) and noise in the score (which action did we happen to sample?). Reward-to-go and baselines attack the first; they shrink the magnitude and re-center the weight without touching the unbiased direction. The batch size N averages down whatever variance remains. The widget below lets you watch all three knobs fight the same number.
The entropy bonus: don't collapse while you reduce variance
A subtle failure: a low-variance gradient can drive the policy to commit too early. As soon as one action looks slightly best, the score keeps pushing its probability toward 1; the policy goes near-deterministic, stops sampling alternatives, and can lock onto a local optimum it never explored its way out of. The standard guard is to add the policy's entropy H[πθ(·|s)] = −∑a πθ(a|s) log πθ(a|s) to the objective, scaled by a small coefficient β:
Its gradient gently pushes πθ back toward uniform, keeping a floor of randomness so exploration (lesson 05) survives. Too much β and the policy refuses to commit; too little and it collapses early. A3C in lesson 06 already carried this term; now you know why it was there.
Interactive · the gradient-variance explorer
A tiny 3-step episodic MDP: at each of three timesteps the policy πθ = softmax(θ) picks one of three actions and earns a noisy reward (one action is best at each step). We estimate the policy gradient over a batch of N rollouts, then measure the variance of that gradient estimate across many independent batches — the actual quantity that decides whether learning is smooth or jittery. Three knobs, each a thing we proved above:
- Reward-to-go — weight each action by Gt (future only) instead of the full-episode return R(τ).
- Baseline — subtract the running mean reward-to-go b(st) (a learned-V stand-in).
- Batch size N — average more rollouts per estimate; variance falls like 1/N.
Start with full-return + no-baseline + N=1 — that is vanilla REINFORCE, and the variance bar pins to the top. The bug is the lesson. Then flip reward-to-go on, then the baseline on, then drag N up, and watch the same bar fall at every step.
The takeaway you can see on screen: turning on reward-to-go drops the bar a notch, the baseline drops it again, and the batch slider drives it toward zero like 1/N — all without moving the policy's average update, because every one of these is unbiased. You are watching the estimator get quieter, not different.
Where this points next
We now have an unbiased gradient and two free variance cuts, and we have named the enemy. The single most powerful weapon against variance turned out to be the baseline — and the best baseline is the value V(s), turning the weight into the advantage A = G − V. But Gt is itself a noisy, full-Monte-Carlo estimate of Q, and V has to be learned. Lesson 08 formalizes the advantage: it shows the one-step TD error δ as a low-variance (but biased) advantage estimate, the n-step returns in between, and GAE(λ) as the geometric knob that blends them — the exact bias↔variance dial this lesson has been circling. After that, lesson 09 asks the dangerous question: can we reuse the rollouts instead of throwing them away each step? — and finds the variance bomb that motivates trust regions.