rl_foundations / lessons / 08 · advantage & GAE lesson 08 / 32

Advantage functions — Actor-Critic, GAE, and the road to TRPO

Lesson 07's single best variance tool was the baseline, which turned the policy gradient into an advantage-weighted update: A = Q − V. But there is no free lunch — estimating A itself trades bias against variance, and one knob, λ, dials between the two extremes.

Where we left off: the baseline became an advantage

Lesson 07 proved the policy-gradient theorem and then sharpened it. Reward-to-go replaced the full return with future reward only; a state-dependent baseline b(s) was subtracted without introducing bias. The natural baseline is the state-value Vπ(s) (the critic from lesson 03), and subtracting it from the action-value gives the quantity the gradient really cares about:

θ J(θ) = 𝔼π [ ∇θ log πθ(a|s) · Aπ(s,a) ] ,    Aπ(s,a) = Qπ(s,a) − Vπ(s)

The advantage Aπ(s,a) answers exactly one question: was this action better or worse than what the policy would do on average from this state? Positive advantage pushes the action's log-probability up; negative pushes it down. This is the Actor-Critic update in one line — the actor πθ moves, the critic Vφ supplies the baseline.

The catch nobody mentioned
We do not have Aπ. We have to estimate it. And every way of estimating it sits somewhere on a bias↔variance spectrum: the more we lean on sampled future rewards, the higher the variance; the more we lean on a learned value function Vφ, the lower the variance but the more we inherit the critic's bias (it is only approximate, and bootstrapping compounds its errors). This lesson is about that spectrum and the one knob that slides along it.

The one-step TD-error is the cheapest advantage estimate

Start at the low-variance end. Suppose we trust the critic Vφ. Then the simplest estimate of how good action a in state s was is the TD-error — the same δ from value-based learning in lesson 02:

δt = rt + γ Vφ(st+1) − Vφ(st)

Why is this an advantage estimate? Because 𝔼[ rt + γ Vπ(st+1) | st, at ] = Qπ(st, at), so if the critic were exact, 𝔼[δt] = Qπ − Vπ = Aπ. It uses exactly one real reward and then immediately defers to the critic for everything after. That makes it:

n-step returns interpolate

The opposite extreme is to trust no critic at all: roll the actual trajectory out and sum the real discounted rewards — the Monte-Carlo return from lesson 03. In between sits the family of n-step advantage estimators: use n real rewards, then bootstrap with the critic:

Â(n)t = ( rt + γ rt+1 + ⋯ + γn−1 rt+n−1 + γn Vφ(st+n) ) − Vφ(st)

As n grows you trade away bias (fewer terms come from the approximate critic) but take on variance (more random rewards accumulate). At the limits:

EstimatorReal rewards usedBiasVariance
Â(1) = δt (one-step TD)1high (all critic)low
Â(n)nmiddlingmiddling
Â(∞) = Monte-Carloall (to episode end)nonehigh

A neat identity makes the next step possible: each n-step advantage is just a discounted sum of TD-errors,

Â(n)t = δt + γ δt+1 + γ2 δt+2 + ⋯ + γn−1 δt+n−1

(The intermediate Vφ terms telescope.) Which n should you pick? Any single choice is arbitrary. The better idea is to not pick — average them all.

GAE(λ): an exponentially-weighted blend of every n-step estimator

Generalized Advantage Estimation (Schulman et al., 2016) takes an exponentially-weighted average of all the n-step estimators, with weight λn−1 on the n-step term. Because of the telescoping identity, the whole infinite average collapses into a strikingly simple form — a geometric, γλ-discounted sum of TD-errors:

ÂGAE(γ,λ)t = ∑k=0 (γλ)k δt+k = δt + γλ · ÂGAEt+1

The right-hand recursion is what you actually code: walk the trajectory backwards, accumulating δt + γλ · (the advantage already computed for the next step). One pass, no extra storage.

The single new knob is λ ∈ [0,1], and it is the bias↔variance dial:

λGAE reduces toBiasVariance
λ = 0δt — pure one-step TDmaximal (all critic)minimal
0 < λ < 1blend of all n-steptunabletunable
λ = 1Monte-Carlo advantage ∑ γkr − Vnone (unbiased)maximal
Yes, it is the same λ as TD(λ)
The exponentially-weighted blend of n-step estimators is precisely the eligibility-trace idea behind TD(λ). GAE is TD(λ) applied to the advantage rather than the value. λ=0 is the one-step bootstrap, λ=1 is the full Monte-Carlo return — and everything in between trades bias for variance, exactly as it does in TD(λ). Two discount-like factors are at play: γ discounts the future (a property of the problem), while λ discounts how far we trust real rewards before deferring to the critic (a property of the estimator).

A2C with GAE — the practical recipe

Advantage Actor-Critic (A2C, the synchronous sibling of A3C from lesson 06) with GAE is the workhorse on-policy loop:

collect a batch of trajectories with the current π_θ
for each trajectory (backwards):
    δ_t   = r_t + γ V_φ(s_{t+1}) − V_φ(s_t)
    Â_t   = δ_t + γλ · Â_{t+1}          # GAE, one backward pass
    R_t   = Â_t + V_φ(s_t)              # value target (= bootstrapped return)
actor  loss:  − Σ log π_θ(a_t|s_t) · Â_t      # push up high-advantage actions
critic loss:  Σ ( V_φ(s_t) − R_t )²           # fit the value to its target
(+ entropy bonus on the actor, from lesson 07)

In practice γ ≈ 0.99 and λ ≈ 0.95 are the canonical defaults: λ sits high but not at 1, accepting a little bias to cut a lot of variance. The widget below shows why that sweet spot exists.

Interactive · the λ dial on a toy trajectory

Below is a fixed short trajectory with a deliberately imperfect critic Vφ (so bias is real). For each λ from 0 to 1 we estimate the GAE advantage of the first step many times over noisy reward draws, then measure two things against the true advantage: the bias (squared, how far the average estimate is off — driven by the bad critic) and the variance (how much the estimate jitters across draws). Drag λ and watch the curves cross.

GAE(λ): bias falls, variance rises — the sweet spot lives in the middle
λ = 0 is pure one-step TD (low variance, but the biased critic dominates). λ = 1 is Monte-Carlo (unbiased, but the rewards' noise dominates). Total error (bias² + variance) is minimized near λ ≈ 0.95. Drag the slider; the white marker tracks your λ on the curves.
λ
0.95
Bias²
Variance
Total error
Show the JS that runs this widget (≈25 lines)
// Toy trajectory: T steps, fixed rewards, a critic V that is WRONG on the
// later states (so leaning on it = bias). V(s0) is correct ⇒ λ=1 is unbiased.
const gamma = 0.99, T = 30;
const Vhat  = Vtrue.slice();
for (let i = 25; i < T - 1; i++) Vhat[i] += 3.5 * (i % 2 ? -1 : 1);  // bad critic

// GAE advantage of step 0 from one noisy reward draw, given λ and the critic V.
function gae(lam, V, sigma) {
  let A = 0;
  for (let t = T - 2; t >= 0; t--) {
    const r     = rewards[t] + sigma * randn();      // noisy reward
    const delta = r + gamma * V[t + 1] - V[t];       // TD-error δ_t
    A = delta + gamma * lam * A;                      // GAE recursion (backwards)
  }
  return A;                                           // advantage of step 0
}
// Fixed unbiased target = MC return − exact V(s0). Many draws → bias & variance.
const est  = Array.from({length: 4000}, () => gae(lam, Vhat, sigma));
const mean = avg(est);
const bias2 = (mean - trueAdvantage)**2, variance = varOf(est);

Two things to notice. At λ=0 the estimate barely jitters (low variance) but sits well off the true advantage — the biased critic is the only thing speaking. At λ=1 the bias vanishes (the critic is no longer trusted) but the variance balloons as eight noisy rewards pile up. The total-error curve is a valley with its floor near λ ≈ 0.95 — which is exactly why that is the default.

A great advantage estimate is still not enough

Suppose GAE hands us a near-perfect advantage. We plug it into the policy gradient and take a step. Here is the problem that the next three lessons exist to solve: the gradient tells us a direction in parameter space, but says nothing about how far to move. Plain policy gradient takes a raw step θ ← θ + α ∇θJ — and a small step in parameter space can be a huge, catastrophic change in the policy's actual behavior (the mapping from θ to πθ is wildly nonlinear). One overshoot and the policy collapses; on-policy, you cannot un-collapse it, because the data you would need to recover is generated by the broken policy.

The villain of lessons 09–10
Accurate gradient direction + uncontrolled step size = a policy that can destroy itself in a single update. We need to bound how much the policy is allowed to change per step, measured in policy space (not parameter space). That bounded region is a trust region — the heart of TRPO (lesson 10). And to take a big step safely we will first need to reuse data from the old policy, which means reweighting it: importance sampling (lesson 09).

Takeaway
The advantage A = Q − V is the right thing to weight the policy gradient by, but we can only estimate it, and every estimate trades bias for variance. The TD-error δ is the cheap, biased, low-variance end; the Monte-Carlo return is the expensive, unbiased, high-variance end; GAE(λ) is the exponentially-weighted blend of all the n-step estimators in between, with λ — the same λ as TD(λ) — as the bias↔variance dial (sweet spot ≈ 0.95). But even a perfect advantage leaves the step-size problem unsolved: plain PG can change the policy catastrophically, which is why we need a trust region.