Robot control (下) — autonomous driving

A self-driving car is a robot whose mistakes are measured in human lives. That single fact rules out the one thing every online RL method takes for granted — the freedom to try a bad action and see what happens. This lesson is about how the autonomous-driving stack is built almost entirely out of the methods we introduced precisely because online trial-and-error is forbidden.

What this lesson reuses

This is an applications lesson — it invents no new algorithm. It is the safety-critical payoff of three earlier ideas:

Imitation learning & inverse RL — lesson 17: behavioral cloning from human-driving logs (and its compounding-error caveat → why DAgger-style on-policy correction matters), plus inverse RL to recover a human-like reward.
Offline (batch) RL — lesson 19: learn from a fixed fleet log, never env.step() a real car.
BCQ & CQL — lessons 20–18: constrain the policy to safe, in-distribution actions so the Bellman backup never dreams up a maneuver the data never showed.

Why pure online RL is unacceptable here

Recall the loop from orientation: act, observe reward, learn, repeat. The thing that makes RL work is that the agent gets to be wrong — repeatedly, cheaply — until it discovers what is good. DQN needed millions of frames of Atari; SAC needed millions of simulator steps on a pendulum (lesson 18). Exploration is not an optional flourish; it is the engine. And on a public road, the engine is the problem.

The arithmetic that ends the discussion

An RL agent learns that crashing is bad by crashing. To estimate Q(s, a) for "swerve hard at 60 mph" it must, in effect, sample that action and feel the negative reward. There is no version of "explore a little" that is acceptable when one bad sample is a fatality. You cannot crash a real car a million times to fill in the value function. The free trial-and-error that powers every online method is exactly what driving cannot afford.

So the field made a trade that runs through this whole part of the course: give up online exploration; pay for it with data you already have. Humans have driven trillions of miles, much of it logged. That is not a live environment — it is a fixed dataset. Which is to say: autonomous driving is, structurally, an imitation + offline-RL problem. The methods of lessons 17 and 16–18 were not academic exercises; they are the load-bearing walls of the AD stack.

Layer 1 — imitation from human-driving logs

The cheapest, oldest, and still most-used building block is behavioral cloning (lesson 17): treat each logged frame as a supervised example — camera/LiDAR state in, the human's steering and pedal out — and fit a policy by maximum likelihood:

min_θ 𝔼_{(s,a) ∼ D_human} [ −log π_θ(a | s) ]

This is how the first end-to-end driving nets (ALVINN in 1989; NVIDIA's PilotNet in 2016) worked: map pixels straight to steering. It is fast and needs no simulator. And it carries the exact pathology we diagnosed in lesson 17.

Compounding error, on a road this time

BC is trained only on states a good human visited — the car centered in its lane. The moment the cloned policy makes a small error it drifts to a pose no human ever demonstrated (slightly off-center, slightly mis-angled). There the policy is extrapolating, so it errs again, drifting further out. Errors don't cancel; they compound: total cost grows like ε·T² instead of ε·T. On a road, "off-distribution state" is a polite name for "heading toward the barrier."

The lesson-14 cure was DAgger: let the learner drive, and have the expert label the recovery states the learner actually drifts into ("you're drifting right — steer left"). For driving this is realized as on-policy correction: run the policy (in sim, or with a safety driver), collect the off-center states it produces, and add the human's correct response to the training set. NVIDIA's system did a static version of this by deliberately augmenting logs with synthetic off-center camera views labeled with the recovery steering — manufacturing the very recovery states pure BC never sees.

Layer 2 — offline RL on the fleet

Imitation only ever reaches human quality, and it copies moves rather than optimizing outcomes. To actually improve — smoother, safer, more efficient than the average logged driver — we want value-based or actor-critic learning. But online is forbidden. This is precisely the offline RL setting of lesson 19: a static dataset D = {(s,a,r,s′)} from a behavior policy π_β (the fleet of human drivers), and you must output a good policy using only D, never calling env.step().

And we already know how that breaks. Run plain Q-learning on the log and the max_a′Q(s′,a′) in the Bellman target will query the value of actions the data never contained — out-of-distribution (OOD) actions — where the network's estimate is pure, unchecked extrapolation. With no online step to ever feel the consequence, those phantom values feed back through the bootstrap and inflate without bound. The car's "value function" comes to believe an impossible maneuver is brilliant.

Fix family	Idea	For driving
BCQ — policy-constraint (L17)	only ever consider actions the data supports: a generative model of π_β proposes candidates, maximize Q over those + a small perturbation.	"only choose maneuvers human drivers actually performed in states like this."
CQL — value-pessimism (L18)	push Q down on the policy's own (possibly OOD) actions, up on dataset actions → a conservative lower bound.	"refuse to believe a maneuver is good until the logs prove it."

Either way the principle is the same and it is exactly what a car needs: stay in-distribution. The constraint that made offline RL correct in lesson 19–21 is, in driving, the constraint that makes it safe — it forbids the policy from confidently doing something no human ever did.

Layer 3 — inverse RL for a human-like reward

What reward should the offline learner even maximize? "Don't crash" is too sparse to shape smooth, courteous, human-like driving; hand-engineering a reward that trades off progress, comfort, lane-keeping, gap-acceptance and politeness is brittle and endless. So we borrow the other half of lesson 17: inverse RL. Instead of writing the reward, recover the one good human drivers appear to optimize from their demonstrations:

forward RL: R → π* inverse RL: π^human (via logs) → R

MaxEnt IRL picks the least-committal reward consistent with the logs; the recovered R then drives the offline learner. Because we recover intent rather than copy moves, the resulting policy generalizes to states the log never showed — the whole reason IRL beat plain BC in lesson 17. (This is the same move RLHF's reward model makes from preferences instead of demonstrations.)

The thing simulation can't fix: other cars react to you

Lesson 57 leaned on the simulator to be sample-efficient and to bridge sim-to-real. Driving breaks an assumption the simulator can't paper over: the other cars are agents, not scenery. When you nose into a merge, the driver in the target lane reacts — speeds up to close the gap, or lifts off to let you in. Your action changes their policy, which changes your next state's distribution. The environment is non-stationary from your point of view: this is the multi-agent setting, and it quietly violates the fixed-P(s′|s,a) MDP that lessons 01–21 assumed.

Why this compounds the offline problem

Multi-agent interaction makes the distribution shift worse, not better. A policy that drives differently from the logged humans will elicit different reactions from surrounding traffic — pushing the whole interaction off the data distribution, exactly where the offline value estimates are unreliable. The in-distribution constraint (BCQ/CQL) is doing double duty: keeping your actions safe, and keeping the interaction close to scenarios the data actually covers.

The safety layer — don't trust the learner with the brakes

No amount of conservatism makes a learned policy verifiable enough to be the last word. Production stacks wrap the learned controller in a separate, simple, checkable safety layer: a non-learned shield (a reachability / responsibility model such as RSS, or a control-barrier function) that takes the policy's proposed action and clips it whenever it would violate a hard guarantee — minimum following distance, a feasible braking profile, staying inside the drivable surface. The learner proposes; the shield disposes.

  learned policy ──► proposed action a
                          │
                  ┌───────▼────────┐
                  │  SAFETY LAYER  │   is a still safe? (min gap, braking feasible?)
                  └───────┬────────┘
              safe │              │ unsafe
                   ▼              ▼
            execute a       clip → safe fallback (brake / hold lane)

This is the same instinct as PPO's clip and offline RL's in-distribution constraint — bound the action you'll actually commit to — but enforced by a hand-verified rule rather than learned values, because here the cost of being wrong is unbounded. The widget below is exactly this: a learned policy that occasionally drifts (pure BC, the bug from lesson 17) versus the same policy behind a shield that clips unsafe accelerations.

Interactive · lane-merge: behavioral cloning vs RL + safety layer

Your car (orange) follows a lead car (blue) that brakes and accelerates unpredictably — a one-dimensional car-following / merge gap. The state is the gap to the lead and the relative speed; the action is your acceleration. Two controllers drive it:

Pure BC — a cloned controller fit to logged "good gaps." Like lesson 17, it only saw near-ideal states, so when noise pushes the gap into a region it never trained on, it commands the wrong acceleration, compounds, and occasionally rear-ends the lead. The bug is the lesson.
RL + safety layer — the same imperfect controller, but every proposed acceleration passes through a shield that clips it whenever the resulting gap would be unsafe given the closing speed (a feasible-braking check). Drift still happens; collisions essentially don't.

Run many episodes and watch the collision-rate KPI diverge. Turn the noise up to stress both; turn the safety margin down to watch the shield's protection erode toward BC.

Car-following: pure BC vs RL + safety layer

Blue = lead car (random braking). Orange = your car. Pure BC drifts off-distribution and sometimes collides (red flash); the safety layer clips unsafe accelerations so collisions go to ~0. Each "▶ Run" simulates a batch of episodes and updates the collision rate.

controller: noise σ: 0.30 safety margin: 0.80

Mode

pure BC

Episodes run

Collisions

Collision rate

—

Show the core JS (≈26 lines)

// State: gap g to lead, relative speed dv. Action: our accel a.
// BC controller: fit to "good gaps" near g* — brittle far from g*.
function bc(g, dv){
  const a = KP*(g - GSTAR) - KD*dv;        // proportional-derivative on the gap
  // off-distribution penalty: BC was never trained on small gaps,
  // so its gain is *wrong-signed* there (the compounding-error bug).
  return (g < GSTAR*0.5) ? a*0.15 : a;     // too-weak response when too close
}

// Safety layer: clip the proposed accel so a feasible brake keeps gap > margin.
function shield(a, g, dv, margin){
  const gapAfter = g + dv*DT;              // predicted gap next step
  if (gapAfter < margin) return Math.min(a, -MAXBRAKE); // force hard brake
  return a;
}

function episode(mode, sigma, margin){
  let g = GSTAR, dv = 0;
  for (let t=0; t<T; t++){
    let a = bc(g, dv);                      // learner proposes
    if (mode === 'safe') a = shield(a, g, dv, margin);
    const lead = leadAccel(t) + sigma*randn();
    dv += (a - lead)*DT;  g += dv*DT;       // car-following dynamics
    if (g <= 0) return true;                // COLLISION
  }
  return false;
}

Why does BC collide while the shield doesn't, given they share the same controller? BC's gain was fit on logged near-ideal gaps; in the small-gap region it never saw, its response is too weak (the conditioning failure from lesson 17), so under noise it occasionally fails to brake in time and closes the gap to zero. The safety layer never relies on the learned response being right in that region — it clips to a guaranteed feasible brake — so the rare drift is caught before it becomes a collision. That is the entire architectural argument for driving: learn the nominal behavior, but bound it with something you can verify.

Mapping back to the spine

Hold the value / policy / model map from orientation in your head and place each layer:

Driving layer	Core method	Place on the map
Clone human steering from logs	Behavioral cloning (L14)	policy, learned by supervised MLE — no value, no model
Fix the drift	DAgger / on-policy correction (L14)	label the policy's own off-distribution states
Improve past human level	Offline RL: BCQ / CQL (L16–18)	value learning, constrained to in-distribution actions
Where the reward comes from	Inverse RL (L14)	recover R from demonstrations, then forward RL
Other cars react	Multi-agent / non-stationary P	breaks the fixed-model MDP assumption
Don't crash, ever	Safety layer (clip)	bound the committed action — the PPO-clip / offline-constraint instinct, hard-verified

Read top to bottom, the AD stack is the answer to one question: how do you do RL when you are never allowed to explore? Imitation gives you a starting policy from data you already have; offline RL improves it without ever touching the road, by refusing to leave the data distribution; inverse RL supplies the reward; and a verified shield catches the residual mistakes. This is the safety-critical reason offline and imitation methods exist at all — driving is the domain where "you cannot afford to be wrong even once" is literally true, and the entire frontier of lessons 17 and 16–18 is the response.

Takeaway

Autonomous driving forbids the one thing online RL needs — free trial-and-error — because a single bad exploratory action can be fatal. So the stack is built from the methods that learn without exploring: behavioral cloning from human logs (with DAgger-style on-policy correction to beat its ε·T² compounding error), offline RL with BCQ/CQL to improve past human level while staying in-distribution, and inverse RL to recover a human-like reward — all wrapped in a non-learned safety layer that clips any action toward an unsafe state. The widget shows the payoff bluntly: the same imperfect controller goes from occasional collisions (pure BC drifting off-distribution) to near-zero collisions once a verified shield bounds its actions. Driving is why the offline and imitation frontiers are not optional.