all lessons / reinforcement learning / 19 · Frontier lesson 19 / 87

Frontier — offline (batch) RL

Every method so far needed a live environment to try things in. But you cannot let a learning agent experiment on a patient, a self-driving car in traffic, or a trading book. Offline RL learns a policy from a fixed, pre-collected dataset — no new interaction at all — and that single restriction breaks the Bellman backup in a subtle, compounding way.

What broke

DDPG, TD3, SAC (lesson 18), DQN (lesson 05), every policy-gradient method (lessons 06, 07) — all of them assume the same thing: the agent acts in the world, sees what happens, and tries again. That online loop is how it discovers that a bad action is bad. Take the loop away and you are in the offline (or batch) setting:

The offline setting
You are handed a static dataset D = { (s, a, r, s′) } collected by some behavior policy πβ (a human, an old controller, a logging system). You must output a good policy π using only D. You may never call env.step(). No exploration, no corrections, no new data — ever.

This is exactly what you want for medicine (learn a treatment policy from hospital records), driving (learn from millions of logged human-driven miles), and finance (learn from historical trades). It is also where our cleanest algorithm quietly detonates.

"But I already learned this — it's just off-policy"

A reasonable first thought: we solved learning-from-other-data back in lesson 12. Off-policy learning lets us train π from data generated by a different policy πβ. Q-learning is off-policy by construction (the max in its target bootstraps toward the greedy action, not the one that was logged). So just run Q-learning on D. Done?

Off-policy is necessary but NOT sufficient
Off-policy methods were designed to learn from data while still collecting more. When a bad estimate appears, the next rollout visits that region and the error gets corrected. Offline RL removes the collection. The error-correction channel is gone — and that is the whole problem.

The importance-sampling fix from lesson 12 does not save us either. To reweight returns from πβ to π you multiply by the ratio ρ = π/πβ along the trajectory. When the learned policy wanders far from the behavior policy — precisely what we want it to do, to be better than πβ — that product of ratios explodes, and the estimate's variance becomes useless (lesson 12's blowup, now with no fresh samples to tame it).

The real villain: distributional shift & extrapolation error

Recall the Q-learning / TD target — the bootstrap from lesson 05, in its continuous-control form (lesson 18):

y = r + γ · Qθ(s′, a′)    where  a′ = argmaxa Qθ(s′, a)  (or a′ ∼ π(·|s′))

Read where the action a′ comes from. It is not from the dataset. It is whatever action the current learned policy thinks is best at s′. The reward r and the state s′ are real (logged), but a′ is chosen by the policy we are training.

Now suppose a′ is an action the dataset never showed at s′ — an out-of-distribution (OOD) action. The network Qθ has never received a single gradient telling it what that action is worth. A neural net asked to evaluate an input far from its training data does not say "I don't know" — it extrapolates, and extrapolation is biased upward: some OOD action gets an absurdly high Q-value by accident.

Online RL would catch this instantly: the policy, now greedy toward that fantasy action, would try it, receive the real (low) reward, and the Q-value would snap back down. Offline, there is no trying. So the lie persists — and worse, it feeds itself:

  OOD action a′ gets an over-estimated Q (extrapolation)
        │
        ▼
  argmax / policy now PREFERS that fantasy action
        │
        ▼
  it becomes the bootstrap target  y = r + γ·Q(s′, a′)   ← target inflates
        │
        ▼
  fit Q(s,a) toward the inflated y   → over-estimate spreads to (s,a)
        │
        └──────────► next backup queries an even more OOD action ...  Q → ∞
Extrapolation error compounds through the Bellman backup
Each Bellman iteration takes the max over actions and copies the most over-estimated value backward one step. With no environment to refute it, the over-estimate is never corrected — it accumulates across iterations and propagates across states. The values diverge to nonsense, the policy chases the largest hallucination, and the agent that looked brilliant on paper is catastrophic in the world.

The trigger is distributional shift: the distribution of (s, a) pairs the policy queries drifts away from the distribution the dataset covers. The more the learned policy π deviates from the behavior policy πβ, the more OOD actions it queries, and the worse the extrapolation.

Interactive · offline Q-learning detonating on a fixed dataset

No environment here — just a fixed dataset and the Bellman backup. There are 11 actions per state. The behavior policy only ever logged actions in a central band; how wide that band is, is the coverage knob. We run offline fitted Q-iteration: each sweep refits Q toward r + γ·maxa Q(s′, a), taking the max over all actions, including ones the data never showed.

Watch the orange bars (OOD actions, outside the logged band). At low coverage they get queried by the max, extrapolate upward, and blow up over iterations — dragging the whole value estimate to infinity. Slide coverage to full and the same algorithm is perfectly fine. The bug is the lesson: it is not the algorithm, it is the missing data.

Fitted Q-iteration on a static dataset · the OOD blow-up
Each Bellman sweep refits Q(s,a) ← r + γ·maxa′ Q(s′,a′) for the one fixed state shown. Green bars = actions the dataset covers; orange = OOD actions the behavior policy never logged. The dashed line is the true Q. Lower the coverage and the orange bars run away.
Sweeps
0
max Q (any action)
0.00
Greedy action
In-data?
Show the JS that runs this widget (≈22 lines)
// 11 actions; behavior policy only logged a central band of width = coverage.
// trueQ is a smooth hump peaking at the middle action.
function sweep() {
  const next = Q.slice();
  for (let a = 0; a < N; a++) {
    // bootstrap target: r + γ·max over ALL actions of Q at the next state.
    // (one-state toy: next state ≈ this state, so we max over our own Q)
    const maxNext = Math.max(...Q);          // ← the lesson 05 max
    if (inData[a]) {
      // covered: a real reward anchors the fit → stays near trueQ
      next[a] = reward[a] + gamma * maxNext * 0.0 + (1-LR)*Q[a] + LR*(trueQ[a]);
    } else {
      // OOD: no reward to anchor. The net extrapolates the bootstrap upward.
      next[a] = (1-LR)*Q[a] + LR*(gamma * maxNext + OOD_BIAS);
    }
  }
  Q = next;                                   // no env.step() ever — error can't be refuted
}

Three families of fixes

Every offline-RL method is, at heart, a way to stop the backup from trusting OOD actions. They split into three families — and the next two lessons take the first two in depth.

FamilyIdeaWhere it actsLesson
(a) Policy constraint Force π to stay close to πβ so it only ever proposes in-data actions. Never let the backup query an OOD a′. the actor / action selection BCQ — lesson 20
(b) Value pessimism Don't trust optimistic Q-values; push Q down on OOD actions so the policy is never lured toward them. Learn a conservative lower bound. the critic / value function CQL — lesson 21
(c) Uncertainty penalty Estimate how uncertain Q is (e.g. disagreement across an ensemble) and subtract that uncertainty from the value — OOD regions are uncertain, so they get penalized automatically. the value target (model-based offline RL; touched in 22)
The two poles
Families (a) and (b) are the same fight from opposite ends. BCQ constrains the policy so the OOD actions are never asked about; CQL conservatizes the value so OOD actions are never worth asking about. Either way the principle is pessimism in the face of missing data — the offline mirror image of the optimism that drove exploration in lesson 08. Online RL is optimistic so it explores; offline RL must be pessimistic because it cannot.

Why this is the right next chapter

We have now removed the last comfortable assumption. Lesson 17 removed the reward signal (imitation/IRL); lesson 18 removed the discrete action set (continuous control); lesson 19 removes the environment itself. What remains is data and the Bellman equation — and we have just seen that the Bellman max, the engine that made Q-learning work online, is exactly what makes it fail offline.

Takeaway
Offline RL learns from a fixed dataset with no new interaction — essential when acting to learn is dangerous or impossible. The core failure is distributional shift / extrapolation error: the Bellman bootstrap (the max from lesson 05) queries Q at out-of-distribution actions the data never showed; those values extrapolate upward, and with no environment to refute them the over-estimate compounds through every backup until the values diverge. Off-policy learning (lesson 12) is necessary but not sufficient. The fix is pessimism, in one of three flavors: constrain the policy (BCQ, lesson 20), regularize the value (CQL, lesson 21), or penalize by uncertainty.