Offline RL — BCQ (Batch-Constrained Q-learning)
Lesson 19 named the disease: the Bellman backup queries Q(s', a') at actions the dataset never showed, those out-of-distribution estimates blow up, and there is no fresh interaction to correct them. BCQ's cure is almost embarrassingly direct — never consider an action the data doesn't support.
What broke (and where the error enters)
Recall the offline Q-learning target (lesson 05's bootstrap, now run on a frozen dataset D with no environment):
That maxa' is the leak. It searches over all actions, and a neural critic that has only ever been trained on the dataset's actions returns garbage — usually large and positive — for actions far from the data (extrapolation error, lesson 19). The max then selects exactly those over-estimated phantoms, writes them into the target, the next backup propagates them, and with no real rollout to ever observe "no, that action was bad," the error compounds. The value diverges.
The fix: starve the error at its source
If the problem is that the max ranges over OOD actions, then shrink what the max is allowed to range over. BCQ replaces the unconstrained maximization with a maximization over only those actions that look like they came from the dataset. Formally, restrict the next action to the support of the behavior policy πb (the policy that generated D):
Read it: take the max over actions the behavior policy could plausibly have produced. If the critic is never queried off the manifold, it is never trained on a leaked target, and the lesson-16 blowup has nothing to feed on. This is the policy-constraint pole of offline RL: we constrain the actor — the set of actions it may ever propose — and leave the critic's learning rule untouched.
How do you "restrict to the support" without the simulator?
We don't know πb in closed form; we only have its samples — the dataset's (s,a) pairs. So BCQ learns a generative model of it. Concretely:
- A VAE Gω models the behavior policy. Train a conditional variational auto-encoder on the dataset to reconstruct a given s. Once trained, sampling a ∼ Gω(·|s) produces actions that look like the data at that state — a differentiable stand-in for πb(·|s).
- Propose n candidates per state. At a state s, draw {a1,…,an} ∼ Gω(·|s). These are the only actions BCQ will ever score — all of them sit on the data manifold by construction.
- A small perturbation network ξφ nudges each candidate. The VAE may not have sampled the exact best in-support action, so a learned perturbation ξφ(s, a) ∈ [−Φ, Φ] adds a bounded tweak. The bound Φ is the whole safety story: small Φ stays glued to the data; large Φ lets the perturbation wander off-support and re-opens the leak.
- The policy = pick the candidate with the highest Q.
The critic is then trained with this constrained argmax inside the target (BCQ also softens it with a twin-critic min, the TD3 trick from lesson 18, to fight residual overestimation). The Bellman backup now only ever asks Q about actions the data supports — so the extrapolation error from lesson 19 is cut off at the source, not patched after the fact.
Why this is the actor-side cure
Stay oriented on the fork (lesson 00). Offline RL has two distinct ways to keep the backup honest, and they touch different halves of actor–critic:
| Pole | What it constrains | Mechanism | Lesson |
|---|---|---|---|
| Policy-constraint | the actor — which actions may be proposed | generative model of πb + bounded perturbation; the max only ranges over in-support candidates | BCQ (here) |
| Value-regularization | the critic — what value OOD actions get | push Q down on the policy's actions so OOD spikes can never win the max in the first place | CQL (next) |
BCQ keeps the standard Bellman/TD update and clamps the action set. It is the cleaner fix to state, but the more machinery to build: a whole VAE plus a perturbation net plus twin critics. Lesson 21's CQL will ask whether we can get the same protection by editing the value function directly — no generative model at all.
Interactive · BCQ vs vanilla offline Q-learning, same dataset
One state slice, a continuous action a ∈ [−1, 1]. The fixed dataset (blue ticks) only ever sampled actions inside a narrow support band near the good region; nothing was logged in the wide OOD zones on either side. Both learners share that exact dataset and the same true return curve (grey). The only difference is the action constraint:
- Constraint OFF (vanilla): the target's maxa' ranges over all of [−1,1]. The critic extrapolates a rising ridge into the unseen zones, the max latches onto it, and max |Q| explodes while the chosen action flees off the data — the lesson-16 disease, live.
- Constraint ON (BCQ): the VAE proposes candidates only inside the support band (green); the max is taken over those alone. Q stays bounded near the true return and the chosen action (orange) stays on the data manifold.
Tip: train with the constraint ON first and note max |Q| settling to a small number with the chosen action sitting in the green band. Then hit Reset, turn it OFF, and retrain — the same dataset now produces a runaway max |Q| and an action that bolts into a zone the data never visited. Sliding Φ up while ON shows the in-between: the perturbation slowly lets the policy peek outside the band, and the protection degrades smoothly.
Where this goes next
BCQ constrains the actor: build a generative model of the behavior policy, propose only in-support candidates, and let a bounded perturbation improve them. The cost is the machinery — a VAE and a perturbation network on top of twin critics. Lesson 21's CQL attacks the same OOD-overestimation from the opposite pole: leave the action set alone and instead push the value function down on out-of-distribution actions, so the unconstrained max can never be fooled by a phantom in the first place. Policy-constraint vs value-pessimism — the two poles of offline RL.