Advantage functions — Actor-Critic, GAE, and the road to TRPO
Lesson 07's single best variance tool was the baseline, which turned the policy gradient into an advantage-weighted update: A = Q − V. But there is no free lunch — estimating A itself trades bias against variance, and one knob, λ, dials between the two extremes.
Where we left off: the baseline became an advantage
Lesson 07 proved the policy-gradient theorem and then sharpened it. Reward-to-go replaced the full return with future reward only; a state-dependent baseline b(s) was subtracted without introducing bias. The natural baseline is the state-value Vπ(s) (the critic from lesson 03), and subtracting it from the action-value gives the quantity the gradient really cares about:
The advantage Aπ(s,a) answers exactly one question: was this action better or worse than what the policy would do on average from this state? Positive advantage pushes the action's log-probability up; negative pushes it down. This is the Actor-Critic update in one line — the actor πθ moves, the critic Vφ supplies the baseline.
The one-step TD-error is the cheapest advantage estimate
Start at the low-variance end. Suppose we trust the critic Vφ. Then the simplest estimate of how good action a in state s was is the TD-error — the same δ from value-based learning in lesson 02:
Why is this an advantage estimate? Because 𝔼[ rt + γ Vπ(st+1) | st, at ] = Qπ(st, at), so if the critic were exact, 𝔼[δt] = Qπ − Vπ = Aπ. It uses exactly one real reward and then immediately defers to the critic for everything after. That makes it:
- Low variance — only one noisy reward sample enters; the rest is the deterministic Vφ.
- Biased — it is only as correct as the critic. If Vφ is wrong (and early in training it always is), δ is systematically off, and bootstrapping propagates that error.
n-step returns interpolate
The opposite extreme is to trust no critic at all: roll the actual trajectory out and sum the real discounted rewards — the Monte-Carlo return from lesson 03. In between sits the family of n-step advantage estimators: use n real rewards, then bootstrap with the critic:
As n grows you trade away bias (fewer terms come from the approximate critic) but take on variance (more random rewards accumulate). At the limits:
| Estimator | Real rewards used | Bias | Variance |
|---|---|---|---|
| Â(1) = δt (one-step TD) | 1 | high (all critic) | low |
| Â(n) | n | middling | middling |
| Â(∞) = Monte-Carlo | all (to episode end) | none | high |
A neat identity makes the next step possible: each n-step advantage is just a discounted sum of TD-errors,
(The intermediate Vφ terms telescope.) Which n should you pick? Any single choice is arbitrary. The better idea is to not pick — average them all.
GAE(λ): an exponentially-weighted blend of every n-step estimator
Generalized Advantage Estimation (Schulman et al., 2016) takes an exponentially-weighted average of all the n-step estimators, with weight λn−1 on the n-step term. Because of the telescoping identity, the whole infinite average collapses into a strikingly simple form — a geometric, γλ-discounted sum of TD-errors:
The right-hand recursion is what you actually code: walk the trajectory backwards, accumulating δt + γλ · (the advantage already computed for the next step). One pass, no extra storage.
The single new knob is λ ∈ [0,1], and it is the bias↔variance dial:
| λ | GAE reduces to | Bias | Variance |
|---|---|---|---|
| λ = 0 | δt — pure one-step TD | maximal (all critic) | minimal |
| 0 < λ < 1 | blend of all n-step | tunable | tunable |
| λ = 1 | Monte-Carlo advantage ∑ γkr − V | none (unbiased) | maximal |
A2C with GAE — the practical recipe
Advantage Actor-Critic (A2C, the synchronous sibling of A3C from lesson 06) with GAE is the workhorse on-policy loop:
collect a batch of trajectories with the current π_θ
for each trajectory (backwards):
δ_t = r_t + γ V_φ(s_{t+1}) − V_φ(s_t)
Â_t = δ_t + γλ · Â_{t+1} # GAE, one backward pass
R_t = Â_t + V_φ(s_t) # value target (= bootstrapped return)
actor loss: − Σ log π_θ(a_t|s_t) · Â_t # push up high-advantage actions
critic loss: Σ ( V_φ(s_t) − R_t )² # fit the value to its target
(+ entropy bonus on the actor, from lesson 07)
In practice γ ≈ 0.99 and λ ≈ 0.95 are the canonical defaults: λ sits high but not at 1, accepting a little bias to cut a lot of variance. The widget below shows why that sweet spot exists.
Interactive · the λ dial on a toy trajectory
Below is a fixed short trajectory with a deliberately imperfect critic Vφ (so bias is real). For each λ from 0 to 1 we estimate the GAE advantage of the first step many times over noisy reward draws, then measure two things against the true advantage: the bias (squared, how far the average estimate is off — driven by the bad critic) and the variance (how much the estimate jitters across draws). Drag λ and watch the curves cross.
Two things to notice. At λ=0 the estimate barely jitters (low variance) but sits well off the true advantage — the biased critic is the only thing speaking. At λ=1 the bias vanishes (the critic is no longer trusted) but the variance balloons as eight noisy rewards pile up. The total-error curve is a valley with its floor near λ ≈ 0.95 — which is exactly why that is the default.
A great advantage estimate is still not enough
Suppose GAE hands us a near-perfect advantage. We plug it into the policy gradient and take a step. Here is the problem that the next three lessons exist to solve: the gradient tells us a direction in parameter space, but says nothing about how far to move. Plain policy gradient takes a raw step θ ← θ + α ∇θJ — and a small step in parameter space can be a huge, catastrophic change in the policy's actual behavior (the mapping from θ to πθ is wildly nonlinear). One overshoot and the policy collapses; on-policy, you cannot un-collapse it, because the data you would need to recover is generated by the broken policy.