Offline RL — Conservative Q-Learning
BCQ (lesson 20) fixed the disease from the actor side — it forbade the policy from naming out-of-distribution actions. CQL attacks the same disease from the critic side: push Q down on OOD actions so the policy is never tempted by them in the first place.
What broke, recapped in one breath
Offline RL (lesson 19) learns from a frozen dataset D = {(s, a, r, s′)} collected by some unknown behavior policy β. The killer was extrapolation error: the Bellman target bootstraps through
and that maxa′ happily reaches for actions the dataset never contains. On an OOD action Qθ is a pure hallucination, the max seeks out the largest hallucination, and with no fresh interaction to falsify it the error compounds every iteration — Q blows up to fantasy values.
BCQ's answer (lesson 20): build a generative model of β, only ever let the policy consider actions that model would plausibly emit, so the dangerous maxa′ is taken over in-distribution actions only. It works — but it needs a VAE, a perturbation network, and the assumption that the generative model faithfully covers the data manifold. That is a lot of moving parts to fix one number that is too big.
The regularizer
Standard offline fitted-Q minimizes the Bellman error on the dataset:
CQL bolts one extra term in front of it. In words: minimize Q on actions sampled from the current policy (these are the suspect, possibly-OOD actions the policy would chase), while maximizing Q on the actions that actually appear in the dataset (these we trust — they have real returns behind them):
Stare at the bracket. The first expectation samples actions from the policy we are learning — exactly the actions whose Q the policy will try to maximize, so exactly the ones at risk of being inflated. CQL pushes those down. The second expectation samples actions from the data — the trustworthy ones — and CQL pulls those up (the minus sign). The net effect is a wedge: data actions are held high, policy-proposed actions are squashed, and the wider the gap between "what the policy wants" and "what the data shows," the harder the squash.
Why this gives a conservative lower bound
The point of the wedge is provable: for a large enough α, the learned Qθ is a lower bound on the true Qπ — at least for the value of any state under the policy, 𝔼a∼πQθ(s,a) ≤ Vπ(s). Intuitively: we deliberately bias Q downward wherever the policy strays from the data, so the estimate can no longer be optimistic about the unknown. An agent that underestimates the value of actions it cannot verify will simply avoid them — which is precisely the behavior we want offline. Pessimism is the safe default when you cannot collect more data.
Contrast lesson 19's failure: vanilla offline Q-learning is an upper-biased estimator (the max selects the rosiest hallucination). CQL flips the sign of the bias. Over-pessimism never explodes; it only makes the agent timid. And timid is recoverable; divergent is not.
BCQ vs CQL — the two poles of offline RL
Lessons 20 and 18 are not rival hacks; they are the two clean ways to attack one disease. Memorize this table — it is the whole shape of offline RL.
| Axis | BCQ (lesson 20) | CQL (lesson 21) |
|---|---|---|
| Side of the fork it touches | Policy (actor) — constrain which a are allowed | Value (critic) — reshape Q itself |
| Mechanism | Generative model of β proposes candidates; act only over those | Regularizer pushes Q down on OOD, up on data |
| Core mantra | "Stay on the data manifold." | "Be pessimistic off the data manifold." |
| Extra machinery | VAE + perturbation net | One scalar penalty term α |
| Failure if overdone | Policy = clone of β (no improvement) | Q crushed flat (no improvement) |
| Failure if underdone | OOD actions leak in → blowup | α→0 → reverts to lesson 19 blowup |
They are policy-constraint vs value-pessimism — the same destination (don't trust unseen actions) reached from opposite ends of the actor–critic fork (lesson 00). CQL's appeal is simplicity: no second network to train, no manifold to model, just one penalty weight added to a loss you already had.
The conservatism weight α
Everything rides on one scalar. α trades blowup against blindness:
- α → 0: the regularizer vanishes; you are back to plain offline fitted-Q. OOD Q-values are free to inflate, the max chases them, returns blow up — exactly the lesson-16 failure.
- α just right: OOD Q-values are suppressed to around or below the data Q-values. The policy stops being lured off-manifold but can still distinguish good in-data actions from bad ones — so it improves.
- α huge: the push-down term dominates the Bellman term. Every Q — even for good data actions — gets crushed toward the floor. The value landscape goes flat, no action looks better than any other, and the policy cannot improve at all (over-pessimism).
Interactive · the conservatism dial
A toy with one state and a row of actions. The shaded green region marks actions the dataset covers (in-distribution); outside it is OOD. The grey dashed line is the true Q (which CQL never sees off-data). The blue bars are the learned Qθ after applying the CQL penalty. Drag α and watch: at α=0 the OOD bars spike far above truth (blowup, lesson 19); raise α and the OOD bars get suppressed toward / below the data bars; crank it to the max and every bar collapses flat — no useful policy left.
The lesson is in the orange marker. Near α=0 the greedy action sits outside the green band, chasing a hallucinated spike — the agent would execute an action no one ever tried, with no evidence it is safe. At a good α the marker snaps back inside the band and lands on the genuinely best covered action. Push α to the ceiling and every bar is the same height — the argmax is now arbitrary noise, and the "policy" is meaningless.
Where this leaves us
We have now closed the offline arc. Lesson 19 named the disease (extrapolation error from the OOD max); lesson 20 cured it from the policy side (BCQ, stay on the manifold); lesson 21 cured it from the value side (CQL, be pessimistic off the manifold). Both are instances of one principle — do not trust what the data cannot support — applied to the two halves of the actor–critic fork. That principle returns with real money on the line in lesson 60 (offline/conservative portfolio optimization) and behind the wheel in lesson 58 (offline RL for autonomous driving).