Advantage functions — Actor-Critic, GAE, and the road to TRPO

Lesson 07's single best variance tool was the baseline, which turned the policy gradient into an advantage-weighted update: A = Q − V. But there is no free lunch — estimating A itself trades bias against variance, and one knob, λ, dials between the two extremes.

Where we left off: the baseline became an advantage

Lesson 07 proved the policy-gradient theorem and then sharpened it. Reward-to-go replaced the full return with future reward only; a state-dependent baseline b(s) was subtracted without introducing bias. The natural baseline is the state-value V^π(s) (the critic from lesson 03), and subtracting it from the action-value gives the quantity the gradient really cares about:

∇_θ J(θ) = 𝔼_π [ ∇_θ log π_θ(a|s) · A^π(s,a) ] , A^π(s,a) = Q^π(s,a) − V^π(s)

The advantage A^π(s,a) answers exactly one question: was this action better or worse than what the policy would do on average from this state? Positive advantage pushes the action's log-probability up; negative pushes it down. This is the Actor-Critic update in one line — the actor π_θ moves, the critic V_φ supplies the baseline.

The catch nobody mentioned

We do not have A^π. We have to estimate it. And every way of estimating it sits somewhere on a bias↔variance spectrum: the more we lean on sampled future rewards, the higher the variance; the more we lean on a learned value function V_φ, the lower the variance but the more we inherit the critic's bias (it is only approximate, and bootstrapping compounds its errors). This lesson is about that spectrum and the one knob that slides along it.

The one-step TD-error is the cheapest advantage estimate

Start at the low-variance end. Suppose we trust the critic V_φ. Then the simplest estimate of how good action a in state s was is the TD-error — the same δ from value-based learning in lesson 02:

δ_t = r_t + γ V_φ(s_t+1) − V_φ(s_t)

Why is this an advantage estimate? Because 𝔼[ r_t + γ V^π(s_t+1) | s_t, a_t ] = Q^π(s_t, a_t), so if the critic were exact, 𝔼[δ_t] = Q^π − V^π = A^π. It uses exactly one real reward and then immediately defers to the critic for everything after. That makes it:

Low variance — only one noisy reward sample enters; the rest is the deterministic V_φ.
Biased — it is only as correct as the critic. If V_φ is wrong (and early in training it always is), δ is systematically off, and bootstrapping propagates that error.

n-step returns interpolate

The opposite extreme is to trust no critic at all: roll the actual trajectory out and sum the real discounted rewards — the Monte-Carlo return from lesson 03. In between sits the family of n-step advantage estimators: use n real rewards, then bootstrap with the critic:

Â⁽ⁿ⁾_t = ( r_t + γ r_t+1 + ⋯ + γⁿ⁻¹ r_t+n−1 + γⁿ V_φ(s_t+n) ) − V_φ(s_t)

As n grows you trade away bias (fewer terms come from the approximate critic) but take on variance (more random rewards accumulate). At the limits:

Estimator	Real rewards used	Bias	Variance
Â⁽¹⁾ = δ_t (one-step TD)	1	high (all critic)	low
Â⁽ⁿ⁾	n	middling	middling
Â^(∞) = Monte-Carlo	all (to episode end)	none	high

A neat identity makes the next step possible: each n-step advantage is just a discounted sum of TD-errors,

Â⁽ⁿ⁾_t = δ_t + γ δ_t+1 + γ² δ_t+2 + ⋯ + γⁿ⁻¹ δ_t+n−1

(The intermediate V_φ terms telescope.) Which n should you pick? Any single choice is arbitrary. The better idea is to not pick — average them all.

GAE(λ): an exponentially-weighted blend of every n-step estimator

Generalized Advantage Estimation (Schulman et al., 2016) takes an exponentially-weighted average of all the n-step estimators, with weight λⁿ⁻¹ on the n-step term. Because of the telescoping identity, the whole infinite average collapses into a strikingly simple form — a geometric, γλ-discounted sum of TD-errors:

Â^GAE(γ,λ)_t = ∑_k=0^∞ (γλ)^k δ_t+k = δ_t + γλ · Â^GAE_t+1

The right-hand recursion is what you actually code: walk the trajectory backwards, accumulating δ_t + γλ · (the advantage already computed for the next step). One pass, no extra storage.

The single new knob is λ ∈ [0,1], and it is the bias↔variance dial:

λ	GAE reduces to	Bias	Variance
λ = 0	δ_t — pure one-step TD	maximal (all critic)	minimal
0 < λ < 1	blend of all n-step	tunable	tunable
λ = 1	Monte-Carlo advantage ∑ γ^kr − V	none (unbiased)	maximal

Yes, it is the same λ as TD(λ)

The exponentially-weighted blend of n-step estimators is precisely the eligibility-trace idea behind TD(λ). GAE is TD(λ) applied to the advantage rather than the value. λ=0 is the one-step bootstrap, λ=1 is the full Monte-Carlo return — and everything in between trades bias for variance, exactly as it does in TD(λ). Two discount-like factors are at play: γ discounts the future (a property of the problem), while λ discounts how far we trust real rewards before deferring to the critic (a property of the estimator).

A2C with GAE — the practical recipe

Advantage Actor-Critic (A2C, the synchronous sibling of A3C from lesson 06) with GAE is the workhorse on-policy loop:

collect a batch of trajectories with the current π_θ
for each trajectory (backwards):
    δ_t   = r_t + γ V_φ(s_{t+1}) − V_φ(s_t)
    Â_t   = δ_t + γλ · Â_{t+1}          # GAE, one backward pass
    R_t   = Â_t + V_φ(s_t)              # value target (= bootstrapped return)
actor  loss:  − Σ log π_θ(a_t|s_t) · Â_t      # push up high-advantage actions
critic loss:  Σ ( V_φ(s_t) − R_t )²           # fit the value to its target
(+ entropy bonus on the actor, from lesson 07)

In practice γ ≈ 0.99 and λ ≈ 0.95 are the canonical defaults: λ sits high but not at 1, accepting a little bias to cut a lot of variance. The widget below shows why that sweet spot exists.

Interactive · the λ dial on a toy trajectory

Below is a fixed short trajectory with a deliberately imperfect critic V_φ (so bias is real). For each λ from 0 to 1 we estimate the GAE advantage of the first step many times over noisy reward draws, then measure two things against the true advantage: the bias (squared, how far the average estimate is off — driven by the bad critic) and the variance (how much the estimate jitters across draws). Drag λ and watch the curves cross.

GAE(λ): bias falls, variance rises — the sweet spot lives in the middle

λ = 0 is pure one-step TD (low variance, but the biased critic dominates). λ = 1 is Monte-Carlo (unbiased, but the rewards' noise dominates). Total error (bias² + variance) is minimized near λ ≈ 0.95. Drag the slider; the white marker tracks your λ on the curves.

λ: 0.95 reward noise σ: 0.15

0.95

Bias²

—

Variance

—

Total error

—

Show the JS that runs this widget (≈25 lines)

// Toy trajectory: T steps, fixed rewards, a critic V that is WRONG on the
// later states (so leaning on it = bias). V(s0) is correct ⇒ λ=1 is unbiased.
const gamma = 0.99, T = 30;
const Vhat  = Vtrue.slice();
for (let i = 25; i < T - 1; i++) Vhat[i] += 3.5 * (i % 2 ? -1 : 1);  // bad critic

// GAE advantage of step 0 from one noisy reward draw, given λ and the critic V.
function gae(lam, V, sigma) {
  let A = 0;
  for (let t = T - 2; t >= 0; t--) {
    const r     = rewards[t] + sigma * randn();      // noisy reward
    const delta = r + gamma * V[t + 1] - V[t];       // TD-error δ_t
    A = delta + gamma * lam * A;                      // GAE recursion (backwards)
  }
  return A;                                           // advantage of step 0
}
// Fixed unbiased target = MC return − exact V(s0). Many draws → bias & variance.
const est  = Array.from({length: 4000}, () => gae(lam, Vhat, sigma));
const mean = avg(est);
const bias2 = (mean - trueAdvantage)**2, variance = varOf(est);

Two things to notice. At λ=0 the estimate barely jitters (low variance) but sits well off the true advantage — the biased critic is the only thing speaking. At λ=1 the bias vanishes (the critic is no longer trusted) but the variance balloons as eight noisy rewards pile up. The total-error curve is a valley with its floor near λ ≈ 0.95 — which is exactly why that is the default.

A great advantage estimate is still not enough

Suppose GAE hands us a near-perfect advantage. We plug it into the policy gradient and take a step. Here is the problem that the next three lessons exist to solve: the gradient tells us a direction in parameter space, but says nothing about how far to move. Plain policy gradient takes a raw step θ ← θ + α ∇_θJ — and a small step in parameter space can be a huge, catastrophic change in the policy's actual behavior (the mapping from θ to π_θ is wildly nonlinear). One overshoot and the policy collapses; on-policy, you cannot un-collapse it, because the data you would need to recover is generated by the broken policy.

The villain of lessons 09–10

Accurate gradient direction + uncontrolled step size = a policy that can destroy itself in a single update. We need to bound how much the policy is allowed to change per step, measured in policy space (not parameter space). That bounded region is a trust region — the heart of TRPO (lesson 10). And to take a big step safely we will first need to reuse data from the old policy, which means reweighting it: importance sampling (lesson 09).

Takeaway

The advantage A = Q − V is the right thing to weight the policy gradient by, but we can only estimate it, and every estimate trades bias for variance. The TD-error δ is the cheap, biased, low-variance end; the Monte-Carlo return is the expensive, unbiased, high-variance end; GAE(λ) is the exponentially-weighted blend of all the n-step estimators in between, with λ — the same λ as TD(λ) — as the bias↔variance dial (sweet spot ≈ 0.95). But even a perfect advantage leaves the step-size problem unsolved: plain PG can change the policy catastrophically, which is why we need a trust region.