all lessons / reinforcement learning / 20 · Offline RL lesson 20 / 87

Offline RL — BCQ (Batch-Constrained Q-learning)

Lesson 19 named the disease: the Bellman backup queries Q(s', a') at actions the dataset never showed, those out-of-distribution estimates blow up, and there is no fresh interaction to correct them. BCQ's cure is almost embarrassingly direct — never consider an action the data doesn't support.

What broke (and where the error enters)

Recall the offline Q-learning target (lesson 05's bootstrap, now run on a frozen dataset D with no environment):

y = r + γ · maxa' Qθ'(s', a')

That maxa' is the leak. It searches over all actions, and a neural critic that has only ever been trained on the dataset's actions returns garbage — usually large and positive — for actions far from the data (extrapolation error, lesson 19). The max then selects exactly those over-estimated phantoms, writes them into the target, the next backup propagates them, and with no real rollout to ever observe "no, that action was bad," the error compounds. The value diverges.

The single sentence from lesson 19
Offline RL fails because bootstrapping queries Q off the data manifold. Off-policy learning (lesson 12) is necessary to reuse a fixed dataset, but on its own it is not sufficient — the unconstrained maxa' walks straight off the support and trusts whatever it finds there.

The fix: starve the error at its source

If the problem is that the max ranges over OOD actions, then shrink what the max is allowed to range over. BCQ replaces the unconstrained maximization with a maximization over only those actions that look like they came from the dataset. Formally, restrict the next action to the support of the behavior policy πb (the policy that generated D):

y = r + γ · maxa' : πb(a'|s') > 0 Qθ'(s', a')

Read it: take the max over actions the behavior policy could plausibly have produced. If the critic is never queried off the manifold, it is never trained on a leaked target, and the lesson-16 blowup has nothing to feed on. This is the policy-constraint pole of offline RL: we constrain the actor — the set of actions it may ever propose — and leave the critic's learning rule untouched.

How do you "restrict to the support" without the simulator?

We don't know πb in closed form; we only have its samples — the dataset's (s,a) pairs. So BCQ learns a generative model of it. Concretely:

  1. A VAE Gω models the behavior policy. Train a conditional variational auto-encoder on the dataset to reconstruct a given s. Once trained, sampling a ∼ Gω(·|s) produces actions that look like the data at that state — a differentiable stand-in for πb(·|s).
  2. Propose n candidates per state. At a state s, draw {a1,…,an} ∼ Gω(·|s). These are the only actions BCQ will ever score — all of them sit on the data manifold by construction.
  3. A small perturbation network ξφ nudges each candidate. The VAE may not have sampled the exact best in-support action, so a learned perturbation ξφ(s, a) ∈ [−Φ, Φ] adds a bounded tweak. The bound Φ is the whole safety story: small Φ stays glued to the data; large Φ lets the perturbation wander off-support and re-opens the leak.
  4. The policy = pick the candidate with the highest Q.
π(s) = argmaxai + ξφ(s, ai)   Qθ( s, ai + ξφ(s, ai) ) ,    ai ∼ Gω(·|s)

The critic is then trained with this constrained argmax inside the target (BCQ also softens it with a twin-critic min, the TD3 trick from lesson 18, to fight residual overestimation). The Bellman backup now only ever asks Q about actions the data supports — so the extrapolation error from lesson 19 is cut off at the source, not patched after the fact.

Two dials, one interpolation
BCQ has a hidden knob in Φ (and in n). At Φ=0, n=1 the policy is behavioral cloning — it just replays the VAE, pure imitation (lesson 17). As Φ and n grow, it slides toward unconstrained Q-learning and its divergence. BCQ lives deliberately near the imitation end: it will only improve on the data where the data gives it room to.

Why this is the actor-side cure

Stay oriented on the fork (lesson 00). Offline RL has two distinct ways to keep the backup honest, and they touch different halves of actor–critic:

PoleWhat it constrainsMechanismLesson
Policy-constraintthe actor — which actions may be proposedgenerative model of πb + bounded perturbation; the max only ranges over in-support candidatesBCQ (here)
Value-regularizationthe critic — what value OOD actions getpush Q down on the policy's actions so OOD spikes can never win the max in the first placeCQL (next)

BCQ keeps the standard Bellman/TD update and clamps the action set. It is the cleaner fix to state, but the more machinery to build: a whole VAE plus a perturbation net plus twin critics. Lesson 21's CQL will ask whether we can get the same protection by editing the value function directly — no generative model at all.

Interactive · BCQ vs vanilla offline Q-learning, same dataset

One state slice, a continuous action a ∈ [−1, 1]. The fixed dataset (blue ticks) only ever sampled actions inside a narrow support band near the good region; nothing was logged in the wide OOD zones on either side. Both learners share that exact dataset and the same true return curve (grey). The only difference is the action constraint:

Offline value learning on a frozen dataset — toggle the action constraint
Top: learned Qθ(a) (red) vs true return (grey), with the dataset's logged actions as ticks and the shaded support band. Bottom: max |Q| over training. Turn the constraint OFF and watch the red curve grow a phantom ridge in the unseen zones — that is the bug, and it is the lesson.
Train steps
0
max |Q|
chosen a
on support?
Show the core JS (≈28 lines)
// Q(a) is a small RBF regressor over the action axis (the "critic").
// DATA: logged actions live only inside the support band [lo,hi].
// The Bellman target maxes over a candidate set:
function candidates(constraint){
  if(constraint){                       // BCQ: VAE proposes only in-support actions
    return sampleVAE(n).map(a => clip(a + perturb(a, PHI), lo-PHI, hi+PHI));
  } else {                              // vanilla: max ranges over the WHOLE axis
    return linspace(-1, 1, n);
  }
}
function trainStep(constraint){
  for(const [a, r] of dataset){          // fit Q toward bootstrapped target
    const cs = candidates(constraint);
    const aMax = argmaxBy(cs, c => Q(c)); // <-- the leak (or its plug)
    const y = r + GAMMA * Q(aMax);        // self-referential backup
    Q.fitToward(a, y);                    // gradient step on the RBF weights
  }
}
// constraint OFF -> aMax can be an OOD action with phantom-high Q -> max|Q| blows up.
// constraint ON  -> aMax is always on the data manifold -> Q stays bounded.

Tip: train with the constraint ON first and note max |Q| settling to a small number with the chosen action sitting in the green band. Then hit Reset, turn it OFF, and retrain — the same dataset now produces a runaway max |Q| and an action that bolts into a zone the data never visited. Sliding Φ up while ON shows the in-between: the perturbation slowly lets the policy peek outside the band, and the protection degrades smoothly.

The bug is the lesson
Nothing about the dataset changed between ON and OFF — same transitions, same rewards, same critic. The only difference is whether the backup is allowed to query OOD actions. That single permission is the entire gap between a stable offline value and a divergent one.

Where this goes next

BCQ constrains the actor: build a generative model of the behavior policy, propose only in-support candidates, and let a bounded perturbation improve them. The cost is the machinery — a VAE and a perturbation network on top of twin critics. Lesson 21's CQL attacks the same OOD-overestimation from the opposite pole: leave the action set alone and instead push the value function down on out-of-distribution actions, so the unconstrained max can never be fooled by a phantom in the first place. Policy-constraint vs value-pessimism — the two poles of offline RL.

Takeaway
BCQ cures lesson 19's extrapolation error by starving it at the source: a VAE models the behavior policy and proposes only data-supported candidate actions, a small bounded perturbation ξφ (size Φ) refines them, and the policy maximizes Q over only those candidates. Because the Bellman backup is never queried at an out-of-distribution action, max|Q| stays bounded and the learned policy stays on the data manifold — the policy-constraint pole of offline RL, in contrast to CQL's value-pessimism coming next.