Frontier — offline (batch) RL
Every method so far needed a live environment to try things in. But you cannot let a learning agent experiment on a patient, a self-driving car in traffic, or a trading book. Offline RL learns a policy from a fixed, pre-collected dataset — no new interaction at all — and that single restriction breaks the Bellman backup in a subtle, compounding way.
What broke
DDPG, TD3, SAC (lesson 18), DQN (lesson 05), every policy-gradient method (lessons 06, 07) — all of them assume the same thing: the agent acts in the world, sees what happens, and tries again. That online loop is how it discovers that a bad action is bad. Take the loop away and you are in the offline (or batch) setting:
env.step(). No exploration, no corrections, no new data — ever.
This is exactly what you want for medicine (learn a treatment policy from hospital records), driving (learn from millions of logged human-driven miles), and finance (learn from historical trades). It is also where our cleanest algorithm quietly detonates.
"But I already learned this — it's just off-policy"
A reasonable first thought: we solved learning-from-other-data back in lesson 12. Off-policy learning lets us train π from data generated by a different policy πβ. Q-learning is off-policy by construction (the max in its target bootstraps toward the greedy action, not the one that was logged). So just run Q-learning on D. Done?
The importance-sampling fix from lesson 12 does not save us either. To reweight returns from πβ to π you multiply by the ratio ρ = π/πβ along the trajectory. When the learned policy wanders far from the behavior policy — precisely what we want it to do, to be better than πβ — that product of ratios explodes, and the estimate's variance becomes useless (lesson 12's blowup, now with no fresh samples to tame it).
The real villain: distributional shift & extrapolation error
Recall the Q-learning / TD target — the bootstrap from lesson 05, in its continuous-control form (lesson 18):
Read where the action a′ comes from. It is not from the dataset. It is whatever action the current learned policy thinks is best at s′. The reward r and the state s′ are real (logged), but a′ is chosen by the policy we are training.
Now suppose a′ is an action the dataset never showed at s′ — an out-of-distribution (OOD) action. The network Qθ has never received a single gradient telling it what that action is worth. A neural net asked to evaluate an input far from its training data does not say "I don't know" — it extrapolates, and extrapolation is biased upward: some OOD action gets an absurdly high Q-value by accident.
Online RL would catch this instantly: the policy, now greedy toward that fantasy action, would try it, receive the real (low) reward, and the Q-value would snap back down. Offline, there is no trying. So the lie persists — and worse, it feeds itself:
OOD action a′ gets an over-estimated Q (extrapolation)
│
▼
argmax / policy now PREFERS that fantasy action
│
▼
it becomes the bootstrap target y = r + γ·Q(s′, a′) ← target inflates
│
▼
fit Q(s,a) toward the inflated y → over-estimate spreads to (s,a)
│
└──────────► next backup queries an even more OOD action ... Q → ∞
The trigger is distributional shift: the distribution of (s, a) pairs the policy queries drifts away from the distribution the dataset covers. The more the learned policy π deviates from the behavior policy πβ, the more OOD actions it queries, and the worse the extrapolation.
Interactive · offline Q-learning detonating on a fixed dataset
No environment here — just a fixed dataset and the Bellman backup. There are 11 actions per state. The behavior policy only ever logged actions in a central band; how wide that band is, is the coverage knob. We run offline fitted Q-iteration: each sweep refits Q toward r + γ·maxa Q(s′, a), taking the max over all actions, including ones the data never showed.
Watch the orange bars (OOD actions, outside the logged band). At low coverage they get queried by the max, extrapolate upward, and blow up over iterations — dragging the whole value estimate to infinity. Slide coverage to full and the same algorithm is perfectly fine. The bug is the lesson: it is not the algorithm, it is the missing data.
Three families of fixes
Every offline-RL method is, at heart, a way to stop the backup from trusting OOD actions. They split into three families — and the next two lessons take the first two in depth.
| Family | Idea | Where it acts | Lesson |
|---|---|---|---|
| (a) Policy constraint | Force π to stay close to πβ so it only ever proposes in-data actions. Never let the backup query an OOD a′. | the actor / action selection | BCQ — lesson 20 |
| (b) Value pessimism | Don't trust optimistic Q-values; push Q down on OOD actions so the policy is never lured toward them. Learn a conservative lower bound. | the critic / value function | CQL — lesson 21 |
| (c) Uncertainty penalty | Estimate how uncertain Q is (e.g. disagreement across an ensemble) and subtract that uncertainty from the value — OOD regions are uncertain, so they get penalized automatically. | the value target | (model-based offline RL; touched in 22) |
Why this is the right next chapter
We have now removed the last comfortable assumption. Lesson 17 removed the reward signal (imitation/IRL); lesson 18 removed the discrete action set (continuous control); lesson 19 removes the environment itself. What remains is data and the Bellman equation — and we have just seen that the Bellman max, the engine that made Q-learning work online, is exactly what makes it fail offline.