rl_foundations / lessons / 09 · importance sampling lesson 09 / 32

Importance sampling — on-policy vs off-policy

A3C and policy gradient throw their data away every step. Here is the one identity that lets you reuse samples drawn from an old policy — and the variance bomb hidden inside it that will shape the next two lessons.

What broke: on-policy is sample-hungry

Look back at the last three lessons. The policy-gradient estimator (lesson 07) and A3C's parallel actors (lesson 06) all share one inconvenient property: the gradient is an expectation under the current policy,

θ J(θ) = 𝔼τ ∼ πθ [ ∑t A(st, at) · ∇θ log πθ(at|st) ]

The trajectories τ must come from πθ — the policy as it is right now. The moment you take one gradient step, θ changes, the data you collected was generated by the old πθ, and it is now stale. You sample a batch, take exactly one step, and discard it. That is what on-policy means, and it is brutally expensive: in the LLM world a "sample" is a full generated response from a billion-parameter model.

We would love to take several gradient steps on one batch, or train on data from a previous iteration. But the expectation is over the wrong distribution. The fix is a single identity from statistics.

The importance-sampling identity

Suppose you want the expected value of some function f(x) under a distribution p, but you can only draw samples from a different distribution q. Multiply and divide by q:

𝔼x ∼ p[ f(x) ] = ∑x p(x) f(x) = ∑x q(x) · p(x)
q(x)
· f(x) = 𝔼x ∼ q[ p(x)
q(x)
· f(x) ]

That is the whole trick. The expectation under p equals an expectation under q, provided you reweight every sample by the ratio p(x)/q(x). Samples that p finds likely but q rarely produces get amplified; samples q over-produces get shrunk. The estimate is unbiased for any valid q — as long as q(x) > 0 wherever p(x) f(x) ≠ 0 (you cannot reweight a sample you never draw).

The policy ratio ρ
Set p = πθ (the policy we want to improve) and q = πold (the policy that actually generated the data). The reweighting factor is the importance-sampling ratio, which we will write the same way for the rest of the course:
ρ(s,a) = πθ(a|s)
πold(a|s)
With it, the gradient/objective under the new policy can be estimated from old samples:
𝔼a ∼ πθ[ A(s,a) ] = 𝔼a ∼ πold[ ρ(s,a) · A(s,a) ]

Now the data does not have to be fresh. Collect a batch from πold, then take several improvement steps, each one correcting for the mismatch with the factor ρ. This single move is what turns the sample-hungry on-policy gradient into something you can reuse.

On-policy vs off-policy, defined cleanly

This identity is the precise statement of a distinction the orientation flagged and lesson 02 already exploited:

 On-policyOff-policy
DefinitionLearn about the policy you are following. Data must come from the current π.Learn about a target policy from data generated by a different behavior policy.
ExamplesREINFORCE, vanilla policy gradient (L07), A3C/A2C (L06), SARSA.Q-learning / DQN (L02), and policy gradient with importance sampling.
Sample reuseNone — discard each batch after one step.Yes — replay buffers (L02), or IS-reweighted batches.

Why is Q-learning off-policy? Recall its update from lesson 02:

Q(s,a) ← Q(s,a) + α [ r + γ maxa' Q(s', a') − Q(s,a) ]

The bootstrap target uses maxa' Q(s',a') — the value of the greedy action, the action the optimal policy would take. But the action that actually generated the transition (s,a,r,s') came from some exploratory behavior policy (ε-greedy). The data is collected under one policy; the thing being learned is a different one. That decoupling — learning about the greedy target while behaving exploratorily — is exactly off-policy learning, and it is why a replay buffer is even legal in DQN. Vanilla policy gradient has no such max; its expectation is tied to the sampling policy, so it is on-policy until we bolt on ρ.

The catch: variance blows up when the policies disagree

The IS estimate is unbiased, which sounds free. It is not. Its variance depends on how far πθ has drifted from πold. The reweighted estimator is

Ê = 1
N
i ρ(xi) f(xi), xi ∼ q

and its variance scales with 𝔼q[ρ² f²]. When the two distributions overlap well, every ρ ≈ 1 and the estimator behaves like ordinary Monte Carlo. But when they disagree, a handful of samples land where p ≫ q and receive enormous weights, while the rest get weights near zero. The estimate is then effectively driven by one or two lucky draws — high variance, sometimes catastrophically so.

A clean diagnostic is the effective sample size: out of your N drawn samples, how many are actually "doing work" after reweighting?

ESS = ( ∑i wi
i wi²
, wi = ρ(xi)

If all weights are equal, ESS = N. If one weight dominates, ESS → 1 — you paid for N samples but got the statistical power of one. The widget below makes this collapse visceral: slide the gap between the two policies and watch ESS crater while the estimate's variance explodes. Large gap = unusable estimate. That is the bug, and the bug is the lesson.

Interactive · the importance-sampling variance bomb

Two Gaussian "policies": πold (orange, fixed — the data source) and πθ (blue, the target). We want 𝔼πθ[f] for a simple test function f(x)=x², but we only get to sample from πold and reweight by ρ = πθold. Drag the gap to push the policies apart, then resample.

Reweighting across a policy gap
Orange = πold (samples drawn here). Blue = πθ (what we want the average under). Each sample's dot size is its IS weight ρ. As the gap grows, one or two fat-weighted samples hijack the estimate — variance explodes, effective sample size collapses toward 1.
IS estimate Ê
True 𝔼πθ[f]
Estimator std (100 batches)
Eff. sample size / N
Max single weight
Show the JS that runs this widget (≈25 lines)
// π_old ~ N(0,1) is the sampler; π_θ ~ N(gap,1) is the target.
const OLD_MU = 0, SIG = 1;
function gauss(x, mu){ return Math.exp(-((x-mu)**2)/(2*SIG*SIG)); }  // unnormalised; norm cancels in ρ
function f(x){ return x*x; }                                          // test function

function batch(gap, N){
  let xs=[], ws=[];
  for (let i=0;i<N;i++){
    const x = OLD_MU + SIG*randn();        // x ~ π_old
    const w = gauss(x, gap) / gauss(x, OLD_MU);   // ρ = π_θ(x)/π_old(x)
    xs.push(x); ws.push(w);
  }
  const W  = ws.reduce((a,b)=>a+b,0);
  const est = ws.reduce((s,w,i)=>s + w*f(xs[i]), 0) / W;   // self-normalised IS estimate
  const ess = (W*W) / ws.reduce((s,w)=>s + w*w, 0);        // effective sample size
  return { est, ess, maxw: Math.max(...ws)/W };
}
Read the dial
At a small gap the dots are uniformly sized, ESS/N sits near 1, and Ê tracks the truth across resamples. Push the gap past ~3 and watch: a single orange sample out in the blue tail gets a giant weight, ESS/N plunges toward 0, and the estimate jumps wildly from batch to batch even though it is still unbiased. Unbiased and useless are not mutually exclusive.

Why this matters: it forces the trust region

The variance bomb is not a nuisance to engineer around — it is a constraint on how we are allowed to learn. Importance sampling only buys us sample reuse when πθ stays close to πold, i.e. while ρ ≈ 1. So the recipe writes itself:

  1. Optimize the IS-reweighted objective L(θ) = 𝔼πold[ ρ · A ] — the surrogate objective.
  2. But constrain the policy so it cannot wander far from πold, keeping ρ well-behaved.

That constraint is a trust region, and turning "keep πθ close to πold" into a hard, measurable bound — a limit on the KL divergence KL[πold ‖ πθ] — is exactly what TRPO does in the next lesson. And the very ratio ρ you just watched explode is the quantity that PPO (lesson 11) will simply clip to [1−ε, 1+ε]: a cheap, brute-force way to forbid the large weights instead of constraining the distribution. Every stabilization trick in the modern policy-gradient lineage is, at heart, a way to keep this one fraction near 1.

Takeaway
Importance sampling reweights data from an old policy by ρ = πθ(a|s)/πold(a|s), converting the sample-hungry on-policy gradient into something you can reuse off-policy — that decoupling of "the policy you follow" from "the policy you learn about" is exactly what makes Q-learning's bootstrap legal too. The catch: the estimate's variance explodes as the policies diverge (effective sample size → 1), so we must keep πθ close to πold. That requirement is the surrogate objective plus a trust region — TRPO next, and the clip PPO puts on this very ρ two lessons on.