Autonomous-driving path planning

A driving policy that turns a steering wheel and a pedal is a continuous-control MDP — but one where physics, the law, and a safety case all sit inside the loop. The binding difficulties of this domain are different from a game's: the action space is dimensionally inconsistent and physically constrained (a steering angle and a throttle are not the same unit, and the tires can only deliver so much grip); the reward must respect hard safety constraints you may never violate, not just optimize; the demonstrations from human drivers must be fused with exploration without forgetting; the observation is a high-dimensional, partially-observable sensor stream; and almost all training happens in simulation that does not match the real car. Each one names a mechanism.

The method — five steps, every lesson

Same loop as lesson 67. (1) Formulate the MDP — state, action, reward, transition, horizon. (2) Diagnose the one property that makes this MDP hard. (3) Engineer the mechanism that removes exactly that difficulty. (4) Guard it in production — detect when it breaks and fall back. (5) Iterate. For a driving policy the difficulties arrive in a fixed order — physics first, safety second, demonstrations third, perception fourth, the sim-to-real gap last — and the lesson runs the loop once per layer.

1 · Formulate — the MDP behind a path-planning policy

Intuition. The car sees the road through cameras, lidar and radar, decides how to steer and how hard to accelerate or brake, and is rewarded for staying on its path without crashing or breaking the law. That is an MDP — but unlike a game, the transition is real physics and the reward has a region (a crash, a revoked license) that is not "low reward," it is forbidden. The four pieces below each have an awkward part that the rest of the lesson removes.

MDP = (S, A, R, P, γ) · maximize E[ Σₜ γᵗ rₜ ] subject to E[ Σₜ cₜ ] ≤ d

Piece	For a path-planning policy	The awkward part
State S	multi-frame BEV raster (≈256×256×8), lidar point cloud, ego speed / yaw-rate, map	a single frame is not Markov — occlusion and motion need history
Action A	steering-wheel angle δ_sw and longitudinal accel a (or torque)	different units, and the tires impose a nonlinear friction limit
Reward R	path-tracking error, comfort, progress — plus cost signals cₜ for violations	some costs (collision, license loss) are hard constraints, not penalties
Transition P	vehicle dynamics + CAN-bus actuation + other road users	learned in simulation; the real car has latency, friction and sensor gaps

Same pattern as lesson 67: the rightmost column is the lesson. But notice the new term in the objective — a constraint E[Σ cₜ] ≤ d. Driving is the first domain in this track where reward maximization alone is not enough; safety is a separate budget that the optimizer must respect.

2 · The action space — unify the units, then respect the friction circle

Intuition. A network that outputs "steering angle in degrees" and "acceleration in m/s²" is asking the optimizer to compare apples to oranges: the two outputs have different units and wildly different scales, so gradients are imbalanced and the policy is hard to train. Worse, the two are not independent — a tire can only produce so much total force, and force spent turning is force you cannot spend braking. Steer hard and brake hard and the tire saturates: the car slides. The fix is to stop letting the network speak in wheel angles and pedals, and let it speak in tire forces, which share one unit (newtons) and have one clean physical limit (a circle).

F_y = C_α·δ_f − C_α·(l_f·r)/v | F_x = m·a ⇒ action = (F_x, F_y) in newtons

Engineering detail — four layers from input to actuator. The steering-wheel angle δ_sw maps to a front-wheel angle δ_f through a calibratable, speed-dependent ratio i(v); a single-track (bicycle) model then turns δ_f into a lateral force F_y, while longitudinal force is simply F_x=m·a. Now the action space is (F_x, F_y), both in newtons — the unit mismatch is gone. The hard part is the grip limit, handled with a differentiable friction-circle normalization:

ρ = √(F_x² + F_y²) / (μ·m·g) | ρ ≤ 1 ⇒ pass through; ρ > 1 ⇒ project onto the circle, keep direction, set magnitude 1

where μ is the live road-friction estimate (identified online from the ESC wheel-speed variance). The projection uses a smooth √(x²+y²) approximation so the whole map is differentiable and trains under autograd. After the network emits a normalized unit-vector, an actuator de-normalization multiplies back by μ·m·g and inverts the maps — a = F_x/m, δ_f = atan((F_y + C_α·l_f·r/v)/C_α), then the inverse ratio gives δ_sw. Because μ, v and r are real-time signals, de-normalization is naturally speed-gain-scheduled — no lookup table needed. A final differentiable rate-limit barrier caps steering slew at 540°/s (written in tanh form so gradients flow), and the network learns to "turn the wheel without throwing it." A small training trick closes the loop: penalize the region ρ > 0.95 with −10 reward and inject ±5% noise into the μ estimate so the policy is robust to a slippery surprise.

Why this is the right shape

In the normalized space a unit vector corresponds one-to-one with a physically reachable tire force, so the policy is dynamically consistent by construction: it can never command a force the tires cannot deliver. A closed-loop double-lane-change at 120 km/h runs with no extra calibration. The optimizer never wastes capacity learning physics it can be given — exactly the lesson-01 discipline of not reaching for a tool until a row of the MDP table demands it.

3 · The hard constraint — safety is a budget, not a penalty

Intuition. You can shape a game's reward freely because the worst case is losing a match. A car that runs a red light can lose its license, and "−1000 reward" does not capture "you are not allowed to do this." The right frame is constrained RL: maximize the driving reward subject to the expected cost of violations staying under a budget. In China a driver has 12 license points per year; that whole-year budget has to be converted into a per-decision-step allowance the optimizer can enforce.

Engineering detail — sizing the budget (CPO). Assign cost weights per violation (speeding by 10%: c = 3; crossing a solid line: c = 1). A car driving ~4 h/day logs ≈1.5k h/year ≈ 150k transitions. Spread 12 points across the year and the per-step constraint is

d_i = 12 / 150k ≈ 8×10⁻⁵ points/step | r_t = r_path − λ·c_t, E[c_t] ≤ d_i

This E[c_t] ≤ d_i is written straight into the optimizer as the Constrained Policy Optimization (CPO) constraint. If the business cares about the probability of losing the license rather than the expected score, swap in a chance constraint P(Σ c_t > 12) ≤ 1% and solve with Chance-Constrained CPO for a tighter d_i.

Engineering detail — the Lagrange multiplier is a two-timescale system. The dual variable λ trades reward against safety, and its update frequency matters more than its value. The dual must move much slower than the policy (α ≪ β); otherwise the policy gradient is yanked around every step, variance explodes and the policy chatters. But too slow and constraint cost accumulates, forcing a hard second-order correction. The sweet spot is an intermediate band: update λ once per large batch or episode, in practice at 1/20–1/50 of the policy-update rate, backed by a real-time safety filter for emergencies.

Engineering detail — hard-coding safety with a Control Barrier Function. Lagrangian methods make safety likely; a CBF makes it certain. Define a safety function h(x) (positive inside the safe set) and require its derivative to keep the set forward-invariant. Wrap the policy output a_net in a tiny quadratic program that finds the nearest safe action:

a_safe = argmin_a ‖a − a_net‖² s.t. ḣ(x,a) ≥ −α·h(x), a ∈ [a_min, a_max]

Solve with OSQP/qpth, and use the KKT conditions to get ∂a_safe/∂a_net so the QP layer is a differentiable module registered with autograd — the policy trains end-to-end through it. Because unsafe actions are corrected before they ever execute, the reward function needs no safety penalty at all, which speeds training ~30%. In production the QP layer compiles to a TensorRT plugin (INT8, ≈0.8 ms); an offline lookup table stands by so that if the solver ever fails, the car falls back to maximum-conservative braking — meeting the single-point-fault metric of ASIL-D.

The two ways constrained RL fails

(1) Wrong timescale. If λ updates as fast as the policy, the advantage estimate fights a moving penalty: variance blows up and the policy oscillates instead of converging. Keep α ≪ β. (2) Non-convex or non-stationary constraints. A CBF's QP is only sound when h(x) is convex; for non-convex safe sets you must split into convex sub-regions or use a robust CBF, and when the law itself changes (a violation drops from 6 points to 3) the budget d_i shifts — Meta-CPO lets the policy re-adapt without retraining from scratch. Treat both as design decisions, not defaults.

4 · Imitation + RL — learn from humans without forgetting them

Intuition. Pure RL on a car is too dangerous and too slow to explore from scratch, so you start from human demonstrations (behavior cloning) and fine-tune with RL. But two things go wrong: the demonstrations cover only the states a competent human visits (so the policy is lost the moment it drifts off-distribution), and once RL takes over it can wander so far that it forgets the human prior entirely. The mechanism is a KL leash to the demonstration policy that loosens as the car proves itself over real mileage.

β_t = max( β₀ − k·t , β_min ) | 10⁴ km ≈ 2×10⁶ steps, k ≈ 5×10⁻⁵ ⇒ ≈10% decay per 10⁴ km, β_min = 1e-4

Engineering detail. The behavior-cloning KL coefficient β starts high — the policy hugs the human prior — and decays with cumulative mileage so exploration widens only as on-road evidence accrues. β never reaches zero (β_min = 1e-4) so a residual leash always holds, which is what keeps a safety-driver takeover rate under the company red line of <0.1 per 1000 km. The schedule is data-aware: with >95% highway coverage you can decay faster (k up to 1×10⁻⁴) to explore sooner; with <70% urban coverage switch to a conservative exponential β_t = β₀·0.9999ᵗ. If PPO's clip already limits update size, start β an order of magnitude lower so the two constraints don't stack and kill exploration.

Engineering detail — fix the off-distribution states, don't just clamp them. Use DAgger: run the policy, find the frames where it diverges from the expert (high ensemble variance over a 300 ms window, confirmed across 3 consecutive frames), and instead of discarding those failures, trigger a conservative fallback (gentle stop), send the clip to a human-review queue, and re-inject the corrected action as a pseudo-label — weak-label self-correction. In a complex-intersection scenario four DAgger rounds cut the expert-failure rate from 8.3% to 1.1% and lifted closed-loop success 12.7%, at only +3.8% labeling cost. When you also train a GAIL discriminator to imitate human style, watch for mode collapse: a discriminator that locks onto one driving mode makes lane-keeping fail. Deploy a lightweight discriminator replica in the domain controller; if its confidence stays below 0.2 for 20 ms, request takeover and log the scene — driving the lane-keep success rate from 92.4% to 99.7% on a long highway-loop test.

5 · The perception state — one frame is not Markov

Intuition. A single bird's-eye-view frame loses information: you cannot tell a parked car from a slow one, and an occluded pedestrian simply isn't there. Acting on one frame violates the Markov property, and the policy makes jerky, unsafe decisions. The fix is to make the state a sufficient statistic by carrying recent history in a recurrent hidden state.

Engineering detail. A 256×256×8 multi-channel BEV (semantics, speed, occupancy…) is down-sampled by three 3×3 convolutions to 32×32×64, fed through a ConvGRU (hidden 32×32×128) so the state h_t summarizes history, then an MLP policy head (512, 256, |A|). On an automotive SoC this runs end-to-end in ~12 ms, inside the 10 Hz control budget, and closed-loop lane-change success rose 18% — evidence the state is "Markov enough." Training uses truncated BPTT (k=8); inference keeps a hidden-state queue. When a task needs to remember something 5 s ago (a light that turned red while occluded), the GRU's gradients vanish — add a small memory-augmented head: keep N=20 historical BEV tokens and run light cross-frame attention, or cascade a short-horizon GRU with a long-horizon key-value memory. To prove the history actually helps, measure the conditional-entropy drop H(s_t+1|o_t) vs H(s_t+1|o_0:t), and ablate the GRU to confirm the closed-loop collision rate rises without it.

6 · Sim-to-real — close the gap, then bound the loss

Intuition. The policy is trained in a simulator whose dynamics, latency and rendered images are all slightly wrong. Domain randomization over those parameters makes the policy robust, but you still need to (a) find the right randomization distribution automatically and (b) prove, before going on a public road, that no plausible mismatch can tank performance. The first is bilevel optimization; the second is a Lipschitz-style return bound.

Engineering detail — model the CAN-bus delay as part of the MDP. A command issued now arrives late and attenuated: a_real(t) = a_cmd(t − τ) · (1 − ε) + ε·n(t), turning "late" into a combined amplitude + phase error the agent feels during training. Convert a 1 ms delay into a path error via 2·v·sin(Δφ) — about 3 mm of lateral error at 60 km/h — so the reward can carry it directly: r_t = −(c_lat·e_lat² + c_delay·τ²). Randomize τ across 10–50 ms in training, then do one online calibration on the real car (read true CAN timestamps via a diagnostic tool, correct λ₂) for near-zero-shot transfer. If the bus upgrades to CAN-FD at 8 Mbit/s the worst-case delay drops from ~15 ms to ~2 ms, so the delay-penalty weight must drop an order of magnitude or the policy turns timidly over-conservative.

Engineering detail — bound the worst-case return before you drive. To answer "if the camera fogs up, is the policy still safe?" without retraining: map clean and degraded observations through a frozen ResNet-18 to compute an empirical Wasserstein-1 distance W between them, estimate the policy and Q-network Lipschitz constants by power iteration (L_π, L_Q), and plug into

R_min ≥ R(π, D₀) − L_Q·L_π·W_max·γ/(1−γ)

With γ=0.99 the discount factor γ/(1−γ)=99; a worked example gives R_min ≥ 865 − 2.15·1.83·0.132·99 ≈ 813, a ≤6% drop — inside the "no more than 10% degradation" acceptance line. Re-estimating after a renderer or driver-version change needs only ~50k fresh frames and runs in ~3 minutes, no retraining. The widget below makes this bound tangible: feel how the perception gap and the network's Lipschitz constants set the worst-case return — and how spectral normalization (capping L below 1) buys back margin.

Diagnosing a stubborn gap — causal confusion

When the bound is exceeded on the real car, the cause is often a confounder the simulator got wrong (a sensor that drifts with temperature). Intervene on each candidate factor with a structural model, estimate its average causal effect on return via double machine learning, and write the high-impact factors into a causal-aware reward shaping and force them into the randomization distribution. In one diagnosis a single confounder accounted for −18.6% of return; correcting it cut the value gap from 0.42 to 0.09 and the whole loop — locate to closed-loop verify — took 3 weeks versus a 60%-longer brute-force grid search.

Further considerations

More actuators reshape the friction limit. Add rear-wheel steering and the action becomes (F_x, F_y, δ_r) — the friction circle becomes a 3-D ellipsoid, normalized by embedding the covariance via a Cholesky factor (shrinks turning radius ~8%). With four independently-driven wheels the action is 4-D: map to per-tire forces, then use a 4×2 pseudo-inverse to project onto the circle so an unsaturated wheel automatically compensates for a saturated one — better limit-handling avoidance.
The bus delay becomes a game. In a platoon, 4–5 cars share one bus and arbitration delay turns into a dynamic, adversarial variable. Feed bus load into the state so the policy learns to throttle its own broadcast rate near congestion — folding communication QoS into the policy and unifying sense-decide-communicate.
Robustness to other agents' violations. Other drivers break rules too, which can force you into a passive violation. Model the opponents' policy uncertainty as a robust CPO with worst-case d_i, so the annual point budget holds against any opponent behavior — and for multi-vehicle collision avoidance use shared barrier functions solved with distributed QP / ADMM as the agent count grows.
tanh-Gaussian vs. GMM action heads. A tanh-squashed Gaussian gives a strictly bounded, monotone, single-mode output that passes module-level FMEA easily — the right choice for a safety-certified chip. A Gaussian mixture explores multimodal maneuvers better but is harder to certify. A common two-stage pattern: explore with a GMM in training, then distill the best mode into a tanh policy for deployment (DAgger roll-outs, KL-minimization, a boundary penalty λ·max(0,|a|−1)² so the tanh net doesn't over-stretch σ chasing tail mass).
Functional-safety evidence. ASIL targets want a delay-failure probability measured to FIT level (10⁻⁷/h for ASIL-B); turn the τ distribution into a random-fault injector during training so the policy's delay-failure probability is estimated up front, cutting late safety-validation iterations.

The through-line

Every section is one row of the MDP table turned into a mechanism: inconsistent + grip-limited action → tire-force normalization on a differentiable friction circle; hard safety → a CPO budget sized from the 12-point license, a slow Lagrange dual, and a CBF-QP safety layer; scarce, off-distribution demonstrations → mileage-decayed BC-KL plus DAgger self-correction; non-Markov perception → multi-frame BEV with a recurrent state; sim-to-real transfer → modeled CAN delay, automatic domain randomization, and a Lipschitz return bound with causal-confusion diagnosis. As in lesson 67, you never reached for a tool until a row of the table demanded it — and closing the last gap re-exposed the first, which is why the loop is a loop.