all lessons / reinforcement learning / 68 · Autonomous-driving path planning lesson 68 / 87

Autonomous-driving path planning

A driving policy that turns a steering wheel and a pedal is a continuous-control MDP — but one where physics, the law, and a safety case all sit inside the loop. The binding difficulties of this domain are different from a game's: the action space is dimensionally inconsistent and physically constrained (a steering angle and a throttle are not the same unit, and the tires can only deliver so much grip); the reward must respect hard safety constraints you may never violate, not just optimize; the demonstrations from human drivers must be fused with exploration without forgetting; the observation is a high-dimensional, partially-observable sensor stream; and almost all training happens in simulation that does not match the real car. Each one names a mechanism.

The method — five steps, every lesson
Same loop as lesson 67. (1) Formulate the MDP — state, action, reward, transition, horizon. (2) Diagnose the one property that makes this MDP hard. (3) Engineer the mechanism that removes exactly that difficulty. (4) Guard it in production — detect when it breaks and fall back. (5) Iterate. For a driving policy the difficulties arrive in a fixed order — physics first, safety second, demonstrations third, perception fourth, the sim-to-real gap last — and the lesson runs the loop once per layer.

1 · Formulate — the MDP behind a path-planning policy

Intuition. The car sees the road through cameras, lidar and radar, decides how to steer and how hard to accelerate or brake, and is rewarded for staying on its path without crashing or breaking the law. That is an MDP — but unlike a game, the transition is real physics and the reward has a region (a crash, a revoked license) that is not "low reward," it is forbidden. The four pieces below each have an awkward part that the rest of the lesson removes.

MDP = (S, A, R, P, γ)  ·  maximize  E[ Σₜ γᵗ rₜ ]   subject to   E[ Σₜ cₜ ] ≤ d
PieceFor a path-planning policyThe awkward part
State Smulti-frame BEV raster (≈256×256×8), lidar point cloud, ego speed / yaw-rate, mapa single frame is not Markov — occlusion and motion need history
Action Asteering-wheel angle δsw and longitudinal accel a (or torque)different units, and the tires impose a nonlinear friction limit
Reward Rpath-tracking error, comfort, progress — plus cost signals cₜ for violationssome costs (collision, license loss) are hard constraints, not penalties
Transition Pvehicle dynamics + CAN-bus actuation + other road userslearned in simulation; the real car has latency, friction and sensor gaps

Same pattern as lesson 67: the rightmost column is the lesson. But notice the new term in the objective — a constraint E[Σ cₜ] ≤ d. Driving is the first domain in this track where reward maximization alone is not enough; safety is a separate budget that the optimizer must respect.

2 · The action space — unify the units, then respect the friction circle

Intuition. A network that outputs "steering angle in degrees" and "acceleration in m/s²" is asking the optimizer to compare apples to oranges: the two outputs have different units and wildly different scales, so gradients are imbalanced and the policy is hard to train. Worse, the two are not independent — a tire can only produce so much total force, and force spent turning is force you cannot spend braking. Steer hard and brake hard and the tire saturates: the car slides. The fix is to stop letting the network speak in wheel angles and pedals, and let it speak in tire forces, which share one unit (newtons) and have one clean physical limit (a circle).

Fy = Cα·δf − Cα·(lf·r)/v   |   Fx = m·a   ⇒   action = (Fx, Fy) in newtons

Engineering detail — four layers from input to actuator. The steering-wheel angle δsw maps to a front-wheel angle δf through a calibratable, speed-dependent ratio i(v); a single-track (bicycle) model then turns δf into a lateral force Fy, while longitudinal force is simply Fx=m·a. Now the action space is (Fx, Fy), both in newtons — the unit mismatch is gone. The hard part is the grip limit, handled with a differentiable friction-circle normalization:

ρ = √(Fx² + Fy²) / (μ·m·g)   |   ρ ≤ 1 ⇒ pass through;   ρ > 1 ⇒ project onto the circle, keep direction, set magnitude 1

where μ is the live road-friction estimate (identified online from the ESC wheel-speed variance). The projection uses a smooth √(x²+y²) approximation so the whole map is differentiable and trains under autograd. After the network emits a normalized unit-vector, an actuator de-normalization multiplies back by μ·m·g and inverts the maps — a = Fx/m, δf = atan((Fy + Cα·lf·r/v)/Cα), then the inverse ratio gives δsw. Because μ, v and r are real-time signals, de-normalization is naturally speed-gain-scheduled — no lookup table needed. A final differentiable rate-limit barrier caps steering slew at 540°/s (written in tanh form so gradients flow), and the network learns to "turn the wheel without throwing it." A small training trick closes the loop: penalize the region ρ > 0.95 with −10 reward and inject ±5% noise into the μ estimate so the policy is robust to a slippery surprise.

Why this is the right shape
In the normalized space a unit vector corresponds one-to-one with a physically reachable tire force, so the policy is dynamically consistent by construction: it can never command a force the tires cannot deliver. A closed-loop double-lane-change at 120 km/h runs with no extra calibration. The optimizer never wastes capacity learning physics it can be given — exactly the lesson-01 discipline of not reaching for a tool until a row of the MDP table demands it.

3 · The hard constraint — safety is a budget, not a penalty

Intuition. You can shape a game's reward freely because the worst case is losing a match. A car that runs a red light can lose its license, and "−1000 reward" does not capture "you are not allowed to do this." The right frame is constrained RL: maximize the driving reward subject to the expected cost of violations staying under a budget. In China a driver has 12 license points per year; that whole-year budget has to be converted into a per-decision-step allowance the optimizer can enforce.

Engineering detail — sizing the budget (CPO). Assign cost weights per violation (speeding by 10%: c = 3; crossing a solid line: c = 1). A car driving ~4 h/day logs ≈1.5k h/year ≈ 150k transitions. Spread 12 points across the year and the per-step constraint is

di = 12 / 150k ≈ 8×10⁻⁵ points/step   |   rt = rpath − λ·ct,   E[ct] ≤ di

This E[ct] ≤ di is written straight into the optimizer as the Constrained Policy Optimization (CPO) constraint. If the business cares about the probability of losing the license rather than the expected score, swap in a chance constraint P(Σ ct > 12) ≤ 1% and solve with Chance-Constrained CPO for a tighter di.

Engineering detail — the Lagrange multiplier is a two-timescale system. The dual variable λ trades reward against safety, and its update frequency matters more than its value. The dual must move much slower than the policy (α ≪ β); otherwise the policy gradient is yanked around every step, variance explodes and the policy chatters. But too slow and constraint cost accumulates, forcing a hard second-order correction. The sweet spot is an intermediate band: update λ once per large batch or episode, in practice at 1/20–1/50 of the policy-update rate, backed by a real-time safety filter for emergencies.

Engineering detail — hard-coding safety with a Control Barrier Function. Lagrangian methods make safety likely; a CBF makes it certain. Define a safety function h(x) (positive inside the safe set) and require its derivative to keep the set forward-invariant. Wrap the policy output anet in a tiny quadratic program that finds the nearest safe action:

asafe = argmina ‖a − anet‖²   s.t.   ḣ(x,a) ≥ −α·h(x),   a ∈ [amin, amax]

Solve with OSQP/qpth, and use the KKT conditions to get ∂asafe/∂anet so the QP layer is a differentiable module registered with autograd — the policy trains end-to-end through it. Because unsafe actions are corrected before they ever execute, the reward function needs no safety penalty at all, which speeds training ~30%. In production the QP layer compiles to a TensorRT plugin (INT8, ≈0.8 ms); an offline lookup table stands by so that if the solver ever fails, the car falls back to maximum-conservative braking — meeting the single-point-fault metric of ASIL-D.

The two ways constrained RL fails
(1) Wrong timescale. If λ updates as fast as the policy, the advantage estimate fights a moving penalty: variance blows up and the policy oscillates instead of converging. Keep α ≪ β. (2) Non-convex or non-stationary constraints. A CBF's QP is only sound when h(x) is convex; for non-convex safe sets you must split into convex sub-regions or use a robust CBF, and when the law itself changes (a violation drops from 6 points to 3) the budget di shifts — Meta-CPO lets the policy re-adapt without retraining from scratch. Treat both as design decisions, not defaults.

4 · Imitation + RL — learn from humans without forgetting them

Intuition. Pure RL on a car is too dangerous and too slow to explore from scratch, so you start from human demonstrations (behavior cloning) and fine-tune with RL. But two things go wrong: the demonstrations cover only the states a competent human visits (so the policy is lost the moment it drifts off-distribution), and once RL takes over it can wander so far that it forgets the human prior entirely. The mechanism is a KL leash to the demonstration policy that loosens as the car proves itself over real mileage.

βt = max( β0 − k·t , βmin )   |   10⁴ km ≈ 2×10⁶ steps,   k ≈ 5×10⁻⁵ ⇒ ≈10% decay per 10⁴ km,   βmin = 1e-4

Engineering detail. The behavior-cloning KL coefficient β starts high — the policy hugs the human prior — and decays with cumulative mileage so exploration widens only as on-road evidence accrues. β never reaches zero (βmin = 1e-4) so a residual leash always holds, which is what keeps a safety-driver takeover rate under the company red line of <0.1 per 1000 km. The schedule is data-aware: with >95% highway coverage you can decay faster (k up to 1×10⁻⁴) to explore sooner; with <70% urban coverage switch to a conservative exponential βt = β0·0.9999ᵗ. If PPO's clip already limits update size, start β an order of magnitude lower so the two constraints don't stack and kill exploration.

Engineering detail — fix the off-distribution states, don't just clamp them. Use DAgger: run the policy, find the frames where it diverges from the expert (high ensemble variance over a 300 ms window, confirmed across 3 consecutive frames), and instead of discarding those failures, trigger a conservative fallback (gentle stop), send the clip to a human-review queue, and re-inject the corrected action as a pseudo-label — weak-label self-correction. In a complex-intersection scenario four DAgger rounds cut the expert-failure rate from 8.3% to 1.1% and lifted closed-loop success 12.7%, at only +3.8% labeling cost. When you also train a GAIL discriminator to imitate human style, watch for mode collapse: a discriminator that locks onto one driving mode makes lane-keeping fail. Deploy a lightweight discriminator replica in the domain controller; if its confidence stays below 0.2 for 20 ms, request takeover and log the scene — driving the lane-keep success rate from 92.4% to 99.7% on a long highway-loop test.

5 · The perception state — one frame is not Markov

Intuition. A single bird's-eye-view frame loses information: you cannot tell a parked car from a slow one, and an occluded pedestrian simply isn't there. Acting on one frame violates the Markov property, and the policy makes jerky, unsafe decisions. The fix is to make the state a sufficient statistic by carrying recent history in a recurrent hidden state.

Engineering detail. A 256×256×8 multi-channel BEV (semantics, speed, occupancy…) is down-sampled by three 3×3 convolutions to 32×32×64, fed through a ConvGRU (hidden 32×32×128) so the state ht summarizes history, then an MLP policy head (512, 256, |A|). On an automotive SoC this runs end-to-end in ~12 ms, inside the 10 Hz control budget, and closed-loop lane-change success rose 18% — evidence the state is "Markov enough." Training uses truncated BPTT (k=8); inference keeps a hidden-state queue. When a task needs to remember something 5 s ago (a light that turned red while occluded), the GRU's gradients vanish — add a small memory-augmented head: keep N=20 historical BEV tokens and run light cross-frame attention, or cascade a short-horizon GRU with a long-horizon key-value memory. To prove the history actually helps, measure the conditional-entropy drop H(st+1|ot) vs H(st+1|o0:t), and ablate the GRU to confirm the closed-loop collision rate rises without it.

1 · FORMULATE S, A, R, P + cost tire-force action 2 · DIAGNOSE friction limit, hard safety 3 · ENGINEER friction norm, CPO, CBF, BC+DAgger 4 · GUARD shadow mode, QP fallback brake 5 · ITERATE sim2real closing the sim-to-real gap re-exposes a physics mismatch — re-run the loop

6 · Sim-to-real — close the gap, then bound the loss

Intuition. The policy is trained in a simulator whose dynamics, latency and rendered images are all slightly wrong. Domain randomization over those parameters makes the policy robust, but you still need to (a) find the right randomization distribution automatically and (b) prove, before going on a public road, that no plausible mismatch can tank performance. The first is bilevel optimization; the second is a Lipschitz-style return bound.

Engineering detail — model the CAN-bus delay as part of the MDP. A command issued now arrives late and attenuated: a_real(t) = a_cmd(t − τ) · (1 − ε) + ε·n(t), turning "late" into a combined amplitude + phase error the agent feels during training. Convert a 1 ms delay into a path error via 2·v·sin(Δφ) — about 3 mm of lateral error at 60 km/h — so the reward can carry it directly: rt = −(clat·elat² + cdelay·τ²). Randomize τ across 10–50 ms in training, then do one online calibration on the real car (read true CAN timestamps via a diagnostic tool, correct λ₂) for near-zero-shot transfer. If the bus upgrades to CAN-FD at 8 Mbit/s the worst-case delay drops from ~15 ms to ~2 ms, so the delay-penalty weight must drop an order of magnitude or the policy turns timidly over-conservative.

Engineering detail — bound the worst-case return before you drive. To answer "if the camera fogs up, is the policy still safe?" without retraining: map clean and degraded observations through a frozen ResNet-18 to compute an empirical Wasserstein-1 distance W between them, estimate the policy and Q-network Lipschitz constants by power iteration (Lπ, LQ), and plug into

Rmin ≥ R(π, D0) − LQ·Lπ·Wmax·γ/(1−γ)

With γ=0.99 the discount factor γ/(1−γ)=99; a worked example gives Rmin ≥ 865 − 2.15·1.83·0.132·99 ≈ 813, a ≤6% drop — inside the "no more than 10% degradation" acceptance line. Re-estimating after a renderer or driver-version change needs only ~50k fresh frames and runs in ~3 minutes, no retraining. The widget below makes this bound tangible: feel how the perception gap and the network's Lipschitz constants set the worst-case return — and how spectral normalization (capping L below 1) buys back margin.

Sim-to-real: how the perception gap bounds worst-case return

A degraded observation is a distribution shift of size W; it can cost you at most LQ·Lπ·W·γ/(1−γ) in return. Shrinking the Lipschitz constants with spectral normalization is the cheapest lever — it multiplies the whole penalty down.

discount coeff
return penalty
R_min / R₀
verdict

Diagnosing a stubborn gap — causal confusion
When the bound is exceeded on the real car, the cause is often a confounder the simulator got wrong (a sensor that drifts with temperature). Intervene on each candidate factor with a structural model, estimate its average causal effect on return via double machine learning, and write the high-impact factors into a causal-aware reward shaping and force them into the randomization distribution. In one diagnosis a single confounder accounted for −18.6% of return; correcting it cut the value gap from 0.42 to 0.09 and the whole loop — locate to closed-loop verify — took 3 weeks versus a 60%-longer brute-force grid search.

Further considerations

The through-line
Every section is one row of the MDP table turned into a mechanism: inconsistent + grip-limited action → tire-force normalization on a differentiable friction circle; hard safety → a CPO budget sized from the 12-point license, a slow Lagrange dual, and a CBF-QP safety layer; scarce, off-distribution demonstrations → mileage-decayed BC-KL plus DAgger self-correction; non-Markov perception → multi-frame BEV with a recurrent state; sim-to-real transfer → modeled CAN delay, automatic domain randomization, and a Lipschitz return bound with causal-confusion diagnosis. As in lesson 67, you never reached for a tool until a row of the table demanded it — and closing the last gap re-exposed the first, which is why the loop is a loop.