all lessons / reinforcement learning / 21 · Offline RL lesson 21 / 87

Offline RL — Conservative Q-Learning

BCQ (lesson 20) fixed the disease from the actor side — it forbade the policy from naming out-of-distribution actions. CQL attacks the same disease from the critic side: push Q down on OOD actions so the policy is never tempted by them in the first place.

What broke, recapped in one breath

Offline RL (lesson 19) learns from a frozen dataset D = {(s, a, r, s′)} collected by some unknown behavior policy β. The killer was extrapolation error: the Bellman target bootstraps through

y = r + γ · maxa′ Qθ(s′, a′)

and that maxa′ happily reaches for actions the dataset never contains. On an OOD action Qθ is a pure hallucination, the max seeks out the largest hallucination, and with no fresh interaction to falsify it the error compounds every iteration — Q blows up to fantasy values.

BCQ's answer (lesson 20): build a generative model of β, only ever let the policy consider actions that model would plausibly emit, so the dangerous maxa′ is taken over in-distribution actions only. It works — but it needs a VAE, a perturbation network, and the assumption that the generative model faithfully covers the data manifold. That is a lot of moving parts to fix one number that is too big.

The CQL idea in one sentence
If the problem is that Q is over-optimistic on OOD actions, don't police which actions the policy may pick — just train Q to be pessimistic there. Add a regularizer that pushes Q down on actions the current policy favors (suspected OOD) and pulls it back up on actions actually in the dataset. No generative model required.

The regularizer

Standard offline fitted-Q minimizes the Bellman error on the dataset:

𝓛Bellman(θ) = 𝔼(s,a,r,s′) ∼ D [ ( Qθ(s,a) − (r + γ · Qθ̄(s′, π(s′))) )2 ]

CQL bolts one extra term in front of it. In words: minimize Q on actions sampled from the current policy (these are the suspect, possibly-OOD actions the policy would chase), while maximizing Q on the actions that actually appear in the dataset (these we trust — they have real returns behind them):

minθ   α · 𝔼s ∼ D [ 𝔼a ∼ π(·|s) Qθ(s,a)  −  𝔼a ∼ D Qθ(s,a) ]  +  𝓛Bellman(θ)

Stare at the bracket. The first expectation samples actions from the policy we are learning — exactly the actions whose Q the policy will try to maximize, so exactly the ones at risk of being inflated. CQL pushes those down. The second expectation samples actions from the data — the trustworthy ones — and CQL pulls those up (the minus sign). The net effect is a wedge: data actions are held high, policy-proposed actions are squashed, and the wider the gap between "what the policy wants" and "what the data shows," the harder the squash.

A note on the practical form
Papers usually write the push-down term as a log Σa exp Qθ(s,a) (a soft-max over actions) rather than a raw expectation over π. Same spirit: it is large exactly where some action's Q is large, so minimizing it flattens the optimistic peaks. We use the plain 𝔼πQ − 𝔼DQ form here because it makes the "down on policy actions, up on data actions" intuition transparent.

Why this gives a conservative lower bound

The point of the wedge is provable: for a large enough α, the learned Qθ is a lower bound on the true Qπ — at least for the value of any state under the policy, 𝔼a∼πQθ(s,a) ≤ Vπ(s). Intuitively: we deliberately bias Q downward wherever the policy strays from the data, so the estimate can no longer be optimistic about the unknown. An agent that underestimates the value of actions it cannot verify will simply avoid them — which is precisely the behavior we want offline. Pessimism is the safe default when you cannot collect more data.

Contrast lesson 19's failure: vanilla offline Q-learning is an upper-biased estimator (the max selects the rosiest hallucination). CQL flips the sign of the bias. Over-pessimism never explodes; it only makes the agent timid. And timid is recoverable; divergent is not.

BCQ vs CQL — the two poles of offline RL

Lessons 20 and 18 are not rival hacks; they are the two clean ways to attack one disease. Memorize this table — it is the whole shape of offline RL.

AxisBCQ (lesson 20)CQL (lesson 21)
Side of the fork it touchesPolicy (actor) — constrain which a are allowedValue (critic) — reshape Q itself
MechanismGenerative model of β proposes candidates; act only over thoseRegularizer pushes Q down on OOD, up on data
Core mantra"Stay on the data manifold.""Be pessimistic off the data manifold."
Extra machineryVAE + perturbation netOne scalar penalty term α
Failure if overdonePolicy = clone of β (no improvement)Q crushed flat (no improvement)
Failure if underdoneOOD actions leak in → blowupα→0 → reverts to lesson 19 blowup

They are policy-constraint vs value-pessimism — the same destination (don't trust unseen actions) reached from opposite ends of the actor–critic fork (lesson 00). CQL's appeal is simplicity: no second network to train, no manifold to model, just one penalty weight added to a loss you already had.

The conservatism weight α

Everything rides on one scalar. α trades blowup against blindness:

The knife-edge
CQL has no free lunch: it converts BCQ's "which actions" question into a "how much pessimism" question. Tune α too low and you keep the divergence; too high and you trade a divergent agent for a useless one. The widget below is that exact dial.

Interactive · the conservatism dial

A toy with one state and a row of actions. The shaded green region marks actions the dataset covers (in-distribution); outside it is OOD. The grey dashed line is the true Q (which CQL never sees off-data). The blue bars are the learned Qθ after applying the CQL penalty. Drag α and watch: at α=0 the OOD bars spike far above truth (blowup, lesson 19); raise α and the OOD bars get suppressed toward / below the data bars; crank it to the max and every bar collapses flat — no useful policy left.

CQL: suppressing OOD Q-values with the conservatism weight α
Green band = actions the dataset covers. Grey dashed = true Q. Blue = learned Qθ. The orange marker shows the action the greedy policy (argmax Qθ) would pick — watch it leave the data band when α is too small, and become meaningless when α is too large.
Greedy action
Picked action is
Max OOD Q − max data Q
Verdict
Show the JS that runs this widget (≈25 lines)
// One state. Each action has a TRUE Q; the dataset only COVERS some of them.
const trueQ   = [0.9, 1.4, 1.8, 1.6, 1.1, 0.7, 0.5, 0.4];   // ground truth
const covered = [false,true, true, true, true, false,false,false]; // in dataset?

// Vanilla offline fitted-Q HALLUCINATES on OOD actions: optimistic spikes.
const hallucination = [3.6, 0, 0, 0, 0, 3.9, 3.2, 2.8];  // big where uncovered

function learnedQ(alpha){
  return trueQ.map((q,i) => {
    if (covered[i]) return q - 0.12*alpha;        // data actions: gently pressed
    const optimistic = q + hallucination[i];      // OOD: starts inflated (α=0 ⇒ blowup)
    return optimistic - alpha*(hallucination[i]+1.6); // CQL pushes OOD DOWN ∝ α
  });
}                                                  // α huge ⇒ everything crushed flat

The lesson is in the orange marker. Near α=0 the greedy action sits outside the green band, chasing a hallucinated spike — the agent would execute an action no one ever tried, with no evidence it is safe. At a good α the marker snaps back inside the band and lands on the genuinely best covered action. Push α to the ceiling and every bar is the same height — the argmax is now arbitrary noise, and the "policy" is meaningless.

Where this leaves us

We have now closed the offline arc. Lesson 19 named the disease (extrapolation error from the OOD max); lesson 20 cured it from the policy side (BCQ, stay on the manifold); lesson 21 cured it from the value side (CQL, be pessimistic off the manifold). Both are instances of one principle — do not trust what the data cannot support — applied to the two halves of the actor–critic fork. That principle returns with real money on the line in lesson 60 (offline/conservative portfolio optimization) and behind the wheel in lesson 58 (offline RL for autonomous driving).

Takeaway
CQL adds a single regularizer to the Bellman loss — push Q down on the policy's (suspected-OOD) actions, up on the dataset's actions — yielding a conservative lower bound on Q with no generative model needed. It is the value-pessimism pole of offline RL, the mirror image of BCQ's policy-constraint pole. Its whole difficulty collapses into one scalar α: too small reverts to the lesson-16 blowup, too large crushes every Q flat so the policy can't improve — conservatism is a dial you must aim, not a switch you flip.