Importance sampling — on-policy vs off-policy

A3C and policy gradient throw their data away every step. Here is the one identity that lets you reuse samples drawn from an old policy — and the variance bomb hidden inside it that will shape the next two lessons.

What broke: on-policy is sample-hungry

Look back at the last three lessons. The policy-gradient estimator (lesson 07) and A3C's parallel actors (lesson 06) all share one inconvenient property: the gradient is an expectation under the current policy,

∇_θ J(θ) = 𝔼_{τ ∼ π_θ} [ ∑_t A(s_t, a_t) · ∇_θ log π_θ(a_t|s_t) ]

The trajectories τ must come from π_θ — the policy as it is right now. The moment you take one gradient step, θ changes, the data you collected was generated by the old π_θ, and it is now stale. You sample a batch, take exactly one step, and discard it. That is what on-policy means, and it is brutally expensive: in the LLM world a "sample" is a full generated response from a billion-parameter model.

We would love to take several gradient steps on one batch, or train on data from a previous iteration. But the expectation is over the wrong distribution. The fix is a single identity from statistics.

The importance-sampling identity

Suppose you want the expected value of some function f(x) under a distribution p, but you can only draw samples from a different distribution q. Multiply and divide by q:

𝔼_{x ∼ p}[ f(x) ] = ∑_x p(x) f(x) = ∑_x q(x) · p(x)

q(x) · f(x) = 𝔼_{x ∼ q}[ p(x)

q(x) · f(x) ]

That is the whole trick. The expectation under p equals an expectation under q, provided you reweight every sample by the ratio p(x)/q(x). Samples that p finds likely but q rarely produces get amplified; samples q over-produces get shrunk. The estimate is unbiased for any valid q — as long as q(x) > 0 wherever p(x) f(x) ≠ 0 (you cannot reweight a sample you never draw).

The policy ratio ρ

Set p = π_θ (the policy we want to improve) and q = π_old (the policy that actually generated the data). The reweighting factor is the importance-sampling ratio, which we will write the same way for the rest of the course:

ρ(s,a) = π_θ(a|s)

π_old(a|s)

With it, the gradient/objective under the new policy can be estimated from old samples:

𝔼_{a ∼ π_θ}[ A(s,a) ] = 𝔼_{a ∼ π_old}[ ρ(s,a) · A(s,a) ]

Now the data does not have to be fresh. Collect a batch from π_old, then take several improvement steps, each one correcting for the mismatch with the factor ρ. This single move is what turns the sample-hungry on-policy gradient into something you can reuse.

On-policy vs off-policy, defined cleanly

This identity is the precise statement of a distinction the orientation flagged and lesson 02 already exploited:

	On-policy	Off-policy
Definition	Learn about the policy you are following. Data must come from the current π.	Learn about a target policy from data generated by a different behavior policy.
Examples	REINFORCE, vanilla policy gradient (L07), A3C/A2C (L06), SARSA.	Q-learning / DQN (L02), and policy gradient with importance sampling.
Sample reuse	None — discard each batch after one step.	Yes — replay buffers (L02), or IS-reweighted batches.

Why is Q-learning off-policy? Recall its update from lesson 02:

Q(s,a) ← Q(s,a) + α [ r + γ max_a' Q(s', a') − Q(s,a) ]

The bootstrap target uses max_a' Q(s',a') — the value of the greedy action, the action the optimal policy would take. But the action that actually generated the transition (s,a,r,s') came from some exploratory behavior policy (ε-greedy). The data is collected under one policy; the thing being learned is a different one. That decoupling — learning about the greedy target while behaving exploratorily — is exactly off-policy learning, and it is why a replay buffer is even legal in DQN. Vanilla policy gradient has no such max; its expectation is tied to the sampling policy, so it is on-policy until we bolt on ρ.

The catch: variance blows up when the policies disagree

The IS estimate is unbiased, which sounds free. It is not. Its variance depends on how far π_θ has drifted from π_old. The reweighted estimator is

Ê = 1

N ∑_i ρ(x_i) f(x_i), x_i ∼ q

and its variance scales with 𝔼_q[ρ² f²]. When the two distributions overlap well, every ρ ≈ 1 and the estimator behaves like ordinary Monte Carlo. But when they disagree, a handful of samples land where p ≫ q and receive enormous weights, while the rest get weights near zero. The estimate is then effectively driven by one or two lucky draws — high variance, sometimes catastrophically so.

A clean diagnostic is the effective sample size: out of your N drawn samples, how many are actually "doing work" after reweighting?

ESS = ( ∑_i w_i )²

∑_i w_i², w_i = ρ(x_i)

If all weights are equal, ESS = N. If one weight dominates, ESS → 1 — you paid for N samples but got the statistical power of one. The widget below makes this collapse visceral: slide the gap between the two policies and watch ESS crater while the estimate's variance explodes. Large gap = unusable estimate. That is the bug, and the bug is the lesson.

Interactive · the importance-sampling variance bomb

Two Gaussian "policies": π_old (orange, fixed — the data source) and π_θ (blue, the target). We want 𝔼_{π_θ}[f] for a simple test function f(x)=x², but we only get to sample from π_old and reweight by ρ = π_θ/π_old. Drag the gap to push the policies apart, then resample.

Reweighting across a policy gap

Orange = π_old (samples drawn here). Blue = π_θ (what we want the average under). Each sample's dot size is its IS weight ρ. As the gap grows, one or two fat-weighted samples hijack the estimate — variance explodes, effective sample size collapses toward 1.

gap (π_θ mean shift): 0.40

IS estimate Ê

—

True 𝔼_{π_θ}[f]

—

Estimator std (100 batches)

—

Eff. sample size / N

—

Max single weight

—

Show the JS that runs this widget (≈25 lines)

// π_old ~ N(0,1) is the sampler; π_θ ~ N(gap,1) is the target.
const OLD_MU = 0, SIG = 1;
function gauss(x, mu){ return Math.exp(-((x-mu)**2)/(2*SIG*SIG)); }  // unnormalised; norm cancels in ρ
function f(x){ return x*x; }                                          // test function

function batch(gap, N){
  let xs=[], ws=[];
  for (let i=0;i<N;i++){
    const x = OLD_MU + SIG*randn();        // x ~ π_old
    const w = gauss(x, gap) / gauss(x, OLD_MU);   // ρ = π_θ(x)/π_old(x)
    xs.push(x); ws.push(w);
  }
  const W  = ws.reduce((a,b)=>a+b,0);
  const est = ws.reduce((s,w,i)=>s + w*f(xs[i]), 0) / W;   // self-normalised IS estimate
  const ess = (W*W) / ws.reduce((s,w)=>s + w*w, 0);        // effective sample size
  return { est, ess, maxw: Math.max(...ws)/W };
}

Read the dial

At a small gap the dots are uniformly sized, ESS/N sits near 1, and Ê tracks the truth across resamples. Push the gap past ~3 and watch: a single orange sample out in the blue tail gets a giant weight, ESS/N plunges toward 0, and the estimate jumps wildly from batch to batch even though it is still unbiased. Unbiased and useless are not mutually exclusive.

Why this matters: it forces the trust region

The variance bomb is not a nuisance to engineer around — it is a constraint on how we are allowed to learn. Importance sampling only buys us sample reuse when π_θ stays close to π_old, i.e. while ρ ≈ 1. So the recipe writes itself:

Optimize the IS-reweighted objective L(θ) = 𝔼_{π_old}[ ρ · A ] — the surrogate objective.
But constrain the policy so it cannot wander far from π_old, keeping ρ well-behaved.

That constraint is a trust region, and turning "keep π_θ close to π_old" into a hard, measurable bound — a limit on the KL divergence KL[π_old ‖ π_θ] — is exactly what TRPO does in the next lesson. And the very ratio ρ you just watched explode is the quantity that PPO (lesson 11) will simply clip to [1−ε, 1+ε]: a cheap, brute-force way to forbid the large weights instead of constraining the distribution. Every stabilization trick in the modern policy-gradient lineage is, at heart, a way to keep this one fraction near 1.

Takeaway

Importance sampling reweights data from an old policy by ρ = π_θ(a|s)/π_old(a|s), converting the sample-hungry on-policy gradient into something you can reuse off-policy — that decoupling of "the policy you follow" from "the policy you learn about" is exactly what makes Q-learning's bootstrap legal too. The catch: the estimate's variance explodes as the policies diverge (effective sample size → 1), so we must keep π_θ close to π_old. That requirement is the surrogate objective plus a trust region — TRPO next, and the clip PPO puts on this very ρ two lessons on.