Importance sampling — on-policy vs off-policy
A3C and policy gradient throw their data away every step. Here is the one identity that lets you reuse samples drawn from an old policy — and the variance bomb hidden inside it that will shape the next two lessons.
What broke: on-policy is sample-hungry
Look back at the last three lessons. The policy-gradient estimator (lesson 07) and A3C's parallel actors (lesson 06) all share one inconvenient property: the gradient is an expectation under the current policy,
The trajectories τ must come from πθ — the policy as it is right now. The moment you take one gradient step, θ changes, the data you collected was generated by the old πθ, and it is now stale. You sample a batch, take exactly one step, and discard it. That is what on-policy means, and it is brutally expensive: in the LLM world a "sample" is a full generated response from a billion-parameter model.
We would love to take several gradient steps on one batch, or train on data from a previous iteration. But the expectation is over the wrong distribution. The fix is a single identity from statistics.
The importance-sampling identity
Suppose you want the expected value of some function f(x) under a distribution p, but you can only draw samples from a different distribution q. Multiply and divide by q:
q(x) · f(x) = 𝔼x ∼ q[ p(x)
q(x) · f(x) ]
That is the whole trick. The expectation under p equals an expectation under q, provided you reweight every sample by the ratio p(x)/q(x). Samples that p finds likely but q rarely produces get amplified; samples q over-produces get shrunk. The estimate is unbiased for any valid q — as long as q(x) > 0 wherever p(x) f(x) ≠ 0 (you cannot reweight a sample you never draw).
πold(a|s)
Now the data does not have to be fresh. Collect a batch from πold, then take several improvement steps, each one correcting for the mismatch with the factor ρ. This single move is what turns the sample-hungry on-policy gradient into something you can reuse.
On-policy vs off-policy, defined cleanly
This identity is the precise statement of a distinction the orientation flagged and lesson 02 already exploited:
| On-policy | Off-policy | |
|---|---|---|
| Definition | Learn about the policy you are following. Data must come from the current π. | Learn about a target policy from data generated by a different behavior policy. |
| Examples | REINFORCE, vanilla policy gradient (L07), A3C/A2C (L06), SARSA. | Q-learning / DQN (L02), and policy gradient with importance sampling. |
| Sample reuse | None — discard each batch after one step. | Yes — replay buffers (L02), or IS-reweighted batches. |
Why is Q-learning off-policy? Recall its update from lesson 02:
The bootstrap target uses maxa' Q(s',a') — the value of the greedy action, the action the optimal policy would take. But the action that actually generated the transition (s,a,r,s') came from some exploratory behavior policy (ε-greedy). The data is collected under one policy; the thing being learned is a different one. That decoupling — learning about the greedy target while behaving exploratorily — is exactly off-policy learning, and it is why a replay buffer is even legal in DQN. Vanilla policy gradient has no such max; its expectation is tied to the sampling policy, so it is on-policy until we bolt on ρ.
The catch: variance blows up when the policies disagree
The IS estimate is unbiased, which sounds free. It is not. Its variance depends on how far πθ has drifted from πold. The reweighted estimator is
N ∑i ρ(xi) f(xi), xi ∼ q
and its variance scales with 𝔼q[ρ² f²]. When the two distributions overlap well, every ρ ≈ 1 and the estimator behaves like ordinary Monte Carlo. But when they disagree, a handful of samples land where p ≫ q and receive enormous weights, while the rest get weights near zero. The estimate is then effectively driven by one or two lucky draws — high variance, sometimes catastrophically so.
A clean diagnostic is the effective sample size: out of your N drawn samples, how many are actually "doing work" after reweighting?
∑i wi², wi = ρ(xi)
If all weights are equal, ESS = N. If one weight dominates, ESS → 1 — you paid for N samples but got the statistical power of one. The widget below makes this collapse visceral: slide the gap between the two policies and watch ESS crater while the estimate's variance explodes. Large gap = unusable estimate. That is the bug, and the bug is the lesson.
Interactive · the importance-sampling variance bomb
Two Gaussian "policies": πold (orange, fixed — the data source) and πθ (blue, the target). We want 𝔼πθ[f] for a simple test function f(x)=x², but we only get to sample from πold and reweight by ρ = πθ/πold. Drag the gap to push the policies apart, then resample.
Why this matters: it forces the trust region
The variance bomb is not a nuisance to engineer around — it is a constraint on how we are allowed to learn. Importance sampling only buys us sample reuse when πθ stays close to πold, i.e. while ρ ≈ 1. So the recipe writes itself:
- Optimize the IS-reweighted objective L(θ) = 𝔼πold[ ρ · A ] — the surrogate objective.
- But constrain the policy so it cannot wander far from πold, keeping ρ well-behaved.
That constraint is a trust region, and turning "keep πθ close to πold" into a hard, measurable bound — a limit on the KL divergence KL[πold ‖ πθ] — is exactly what TRPO does in the next lesson. And the very ratio ρ you just watched explode is the quantity that PPO (lesson 11) will simply clip to [1−ε, 1+ε]: a cheap, brute-force way to forbid the large weights instead of constraining the distribution. Every stabilization trick in the modern policy-gradient lineage is, at heart, a way to keep this one fraction near 1.