Robot control (下) — autonomous driving
A self-driving car is a robot whose mistakes are measured in human lives. That single fact rules out the one thing every online RL method takes for granted — the freedom to try a bad action and see what happens. This lesson is about how the autonomous-driving stack is built almost entirely out of the methods we introduced precisely because online trial-and-error is forbidden.
- Imitation learning & inverse RL — lesson 17: behavioral cloning from human-driving logs (and its compounding-error caveat → why DAgger-style on-policy correction matters), plus inverse RL to recover a human-like reward.
- Offline (batch) RL — lesson 19: learn from a fixed fleet log, never
env.step()a real car. - BCQ & CQL — lessons 20–18: constrain the policy to safe, in-distribution actions so the Bellman backup never dreams up a maneuver the data never showed.
Why pure online RL is unacceptable here
Recall the loop from orientation: act, observe reward, learn, repeat. The thing that makes RL work is that the agent gets to be wrong — repeatedly, cheaply — until it discovers what is good. DQN needed millions of frames of Atari; SAC needed millions of simulator steps on a pendulum (lesson 18). Exploration is not an optional flourish; it is the engine. And on a public road, the engine is the problem.
So the field made a trade that runs through this whole part of the course: give up online exploration; pay for it with data you already have. Humans have driven trillions of miles, much of it logged. That is not a live environment — it is a fixed dataset. Which is to say: autonomous driving is, structurally, an imitation + offline-RL problem. The methods of lessons 17 and 16–18 were not academic exercises; they are the load-bearing walls of the AD stack.
Layer 1 — imitation from human-driving logs
The cheapest, oldest, and still most-used building block is behavioral cloning (lesson 17): treat each logged frame as a supervised example — camera/LiDAR state in, the human's steering and pedal out — and fit a policy by maximum likelihood:
This is how the first end-to-end driving nets (ALVINN in 1989; NVIDIA's PilotNet in 2016) worked: map pixels straight to steering. It is fast and needs no simulator. And it carries the exact pathology we diagnosed in lesson 17.
The lesson-14 cure was DAgger: let the learner drive, and have the expert label the recovery states the learner actually drifts into ("you're drifting right — steer left"). For driving this is realized as on-policy correction: run the policy (in sim, or with a safety driver), collect the off-center states it produces, and add the human's correct response to the training set. NVIDIA's system did a static version of this by deliberately augmenting logs with synthetic off-center camera views labeled with the recovery steering — manufacturing the very recovery states pure BC never sees.
Layer 2 — offline RL on the fleet
Imitation only ever reaches human quality, and it copies moves rather than optimizing outcomes. To actually improve — smoother, safer, more efficient than the average logged driver — we want value-based or actor-critic learning. But online is forbidden. This is precisely the offline RL setting of lesson 19: a static dataset D = {(s,a,r,s′)} from a behavior policy πβ (the fleet of human drivers), and you must output a good policy using only D, never calling env.step().
And we already know how that breaks. Run plain Q-learning on the log and the maxa′Q(s′,a′) in the Bellman target will query the value of actions the data never contained — out-of-distribution (OOD) actions — where the network's estimate is pure, unchecked extrapolation. With no online step to ever feel the consequence, those phantom values feed back through the bootstrap and inflate without bound. The car's "value function" comes to believe an impossible maneuver is brilliant.
| Fix family | Idea | For driving |
|---|---|---|
| BCQ — policy-constraint (L17) | only ever consider actions the data supports: a generative model of πβ proposes candidates, maximize Q over those + a small perturbation. | "only choose maneuvers human drivers actually performed in states like this." |
| CQL — value-pessimism (L18) | push Q down on the policy's own (possibly OOD) actions, up on dataset actions → a conservative lower bound. | "refuse to believe a maneuver is good until the logs prove it." |
Either way the principle is the same and it is exactly what a car needs: stay in-distribution. The constraint that made offline RL correct in lesson 19–21 is, in driving, the constraint that makes it safe — it forbids the policy from confidently doing something no human ever did.
Layer 3 — inverse RL for a human-like reward
What reward should the offline learner even maximize? "Don't crash" is too sparse to shape smooth, courteous, human-like driving; hand-engineering a reward that trades off progress, comfort, lane-keeping, gap-acceptance and politeness is brittle and endless. So we borrow the other half of lesson 17: inverse RL. Instead of writing the reward, recover the one good human drivers appear to optimize from their demonstrations:
MaxEnt IRL picks the least-committal reward consistent with the logs; the recovered R then drives the offline learner. Because we recover intent rather than copy moves, the resulting policy generalizes to states the log never showed — the whole reason IRL beat plain BC in lesson 17. (This is the same move RLHF's reward model makes from preferences instead of demonstrations.)
The thing simulation can't fix: other cars react to you
Lesson 57 leaned on the simulator to be sample-efficient and to bridge sim-to-real. Driving breaks an assumption the simulator can't paper over: the other cars are agents, not scenery. When you nose into a merge, the driver in the target lane reacts — speeds up to close the gap, or lifts off to let you in. Your action changes their policy, which changes your next state's distribution. The environment is non-stationary from your point of view: this is the multi-agent setting, and it quietly violates the fixed-P(s′|s,a) MDP that lessons 01–21 assumed.
The safety layer — don't trust the learner with the brakes
No amount of conservatism makes a learned policy verifiable enough to be the last word. Production stacks wrap the learned controller in a separate, simple, checkable safety layer: a non-learned shield (a reachability / responsibility model such as RSS, or a control-barrier function) that takes the policy's proposed action and clips it whenever it would violate a hard guarantee — minimum following distance, a feasible braking profile, staying inside the drivable surface. The learner proposes; the shield disposes.
learned policy ──► proposed action a
│
┌───────▼────────┐
│ SAFETY LAYER │ is a still safe? (min gap, braking feasible?)
└───────┬────────┘
safe │ │ unsafe
▼ ▼
execute a clip → safe fallback (brake / hold lane)
This is the same instinct as PPO's clip and offline RL's in-distribution constraint — bound the action you'll actually commit to — but enforced by a hand-verified rule rather than learned values, because here the cost of being wrong is unbounded. The widget below is exactly this: a learned policy that occasionally drifts (pure BC, the bug from lesson 17) versus the same policy behind a shield that clips unsafe accelerations.
Interactive · lane-merge: behavioral cloning vs RL + safety layer
Your car (orange) follows a lead car (blue) that brakes and accelerates unpredictably — a one-dimensional car-following / merge gap. The state is the gap to the lead and the relative speed; the action is your acceleration. Two controllers drive it:
- Pure BC — a cloned controller fit to logged "good gaps." Like lesson 17, it only saw near-ideal states, so when noise pushes the gap into a region it never trained on, it commands the wrong acceleration, compounds, and occasionally rear-ends the lead. The bug is the lesson.
- RL + safety layer — the same imperfect controller, but every proposed acceleration passes through a shield that clips it whenever the resulting gap would be unsafe given the closing speed (a feasible-braking check). Drift still happens; collisions essentially don't.
Run many episodes and watch the collision-rate KPI diverge. Turn the noise up to stress both; turn the safety margin down to watch the shield's protection erode toward BC.
Why does BC collide while the shield doesn't, given they share the same controller? BC's gain was fit on logged near-ideal gaps; in the small-gap region it never saw, its response is too weak (the conditioning failure from lesson 17), so under noise it occasionally fails to brake in time and closes the gap to zero. The safety layer never relies on the learned response being right in that region — it clips to a guaranteed feasible brake — so the rare drift is caught before it becomes a collision. That is the entire architectural argument for driving: learn the nominal behavior, but bound it with something you can verify.
Mapping back to the spine
Hold the value / policy / model map from orientation in your head and place each layer:
| Driving layer | Core method | Place on the map |
|---|---|---|
| Clone human steering from logs | Behavioral cloning (L14) | policy, learned by supervised MLE — no value, no model |
| Fix the drift | DAgger / on-policy correction (L14) | label the policy's own off-distribution states |
| Improve past human level | Offline RL: BCQ / CQL (L16–18) | value learning, constrained to in-distribution actions |
| Where the reward comes from | Inverse RL (L14) | recover R from demonstrations, then forward RL |
| Other cars react | Multi-agent / non-stationary P | breaks the fixed-model MDP assumption |
| Don't crash, ever | Safety layer (clip) | bound the committed action — the PPO-clip / offline-constraint instinct, hard-verified |
Read top to bottom, the AD stack is the answer to one question: how do you do RL when you are never allowed to explore? Imitation gives you a starting policy from data you already have; offline RL improves it without ever touching the road, by refusing to leave the data distribution; inverse RL supplies the reward; and a verified shield catches the residual mistakes. This is the safety-critical reason offline and imitation methods exist at all — driving is the domain where "you cannot afford to be wrong even once" is literally true, and the entire frontier of lessons 17 and 16–18 is the response.