Autonomous-driving path planning
A driving policy that turns a steering wheel and a pedal is a continuous-control MDP — but one where physics, the law, and a safety case all sit inside the loop. The binding difficulties of this domain are different from a game's: the action space is dimensionally inconsistent and physically constrained (a steering angle and a throttle are not the same unit, and the tires can only deliver so much grip); the reward must respect hard safety constraints you may never violate, not just optimize; the demonstrations from human drivers must be fused with exploration without forgetting; the observation is a high-dimensional, partially-observable sensor stream; and almost all training happens in simulation that does not match the real car. Each one names a mechanism.
1 · Formulate — the MDP behind a path-planning policy
Intuition. The car sees the road through cameras, lidar and radar, decides how to steer and how hard to accelerate or brake, and is rewarded for staying on its path without crashing or breaking the law. That is an MDP — but unlike a game, the transition is real physics and the reward has a region (a crash, a revoked license) that is not "low reward," it is forbidden. The four pieces below each have an awkward part that the rest of the lesson removes.
| Piece | For a path-planning policy | The awkward part |
|---|---|---|
| State S | multi-frame BEV raster (≈256×256×8), lidar point cloud, ego speed / yaw-rate, map | a single frame is not Markov — occlusion and motion need history |
| Action A | steering-wheel angle δsw and longitudinal accel a (or torque) | different units, and the tires impose a nonlinear friction limit |
| Reward R | path-tracking error, comfort, progress — plus cost signals cₜ for violations | some costs (collision, license loss) are hard constraints, not penalties |
| Transition P | vehicle dynamics + CAN-bus actuation + other road users | learned in simulation; the real car has latency, friction and sensor gaps |
Same pattern as lesson 67: the rightmost column is the lesson. But notice the new term in the objective — a constraint E[Σ cₜ] ≤ d. Driving is the first domain in this track where reward maximization alone is not enough; safety is a separate budget that the optimizer must respect.
2 · The action space — unify the units, then respect the friction circle
Intuition. A network that outputs "steering angle in degrees" and "acceleration in m/s²" is asking the optimizer to compare apples to oranges: the two outputs have different units and wildly different scales, so gradients are imbalanced and the policy is hard to train. Worse, the two are not independent — a tire can only produce so much total force, and force spent turning is force you cannot spend braking. Steer hard and brake hard and the tire saturates: the car slides. The fix is to stop letting the network speak in wheel angles and pedals, and let it speak in tire forces, which share one unit (newtons) and have one clean physical limit (a circle).
Engineering detail — four layers from input to actuator. The steering-wheel angle δsw maps to a front-wheel angle δf through a calibratable, speed-dependent ratio i(v); a single-track (bicycle) model then turns δf into a lateral force Fy, while longitudinal force is simply Fx=m·a. Now the action space is (Fx, Fy), both in newtons — the unit mismatch is gone. The hard part is the grip limit, handled with a differentiable friction-circle normalization:
where μ is the live road-friction estimate (identified online from the ESC wheel-speed variance). The projection uses a smooth √(x²+y²) approximation so the whole map is differentiable and trains under autograd. After the network emits a normalized unit-vector, an actuator de-normalization multiplies back by μ·m·g and inverts the maps — a = Fx/m, δf = atan((Fy + Cα·lf·r/v)/Cα), then the inverse ratio gives δsw. Because μ, v and r are real-time signals, de-normalization is naturally speed-gain-scheduled — no lookup table needed. A final differentiable rate-limit barrier caps steering slew at 540°/s (written in tanh form so gradients flow), and the network learns to "turn the wheel without throwing it." A small training trick closes the loop: penalize the region ρ > 0.95 with −10 reward and inject ±5% noise into the μ estimate so the policy is robust to a slippery surprise.
3 · The hard constraint — safety is a budget, not a penalty
Intuition. You can shape a game's reward freely because the worst case is losing a match. A car that runs a red light can lose its license, and "−1000 reward" does not capture "you are not allowed to do this." The right frame is constrained RL: maximize the driving reward subject to the expected cost of violations staying under a budget. In China a driver has 12 license points per year; that whole-year budget has to be converted into a per-decision-step allowance the optimizer can enforce.
Engineering detail — sizing the budget (CPO). Assign cost weights per violation (speeding by 10%: c = 3; crossing a solid line: c = 1). A car driving ~4 h/day logs ≈1.5k h/year ≈ 150k transitions. Spread 12 points across the year and the per-step constraint is
This E[ct] ≤ di is written straight into the optimizer as the Constrained Policy Optimization (CPO) constraint. If the business cares about the probability of losing the license rather than the expected score, swap in a chance constraint P(Σ ct > 12) ≤ 1% and solve with Chance-Constrained CPO for a tighter di.
Engineering detail — the Lagrange multiplier is a two-timescale system. The dual variable λ trades reward against safety, and its update frequency matters more than its value. The dual must move much slower than the policy (α ≪ β); otherwise the policy gradient is yanked around every step, variance explodes and the policy chatters. But too slow and constraint cost accumulates, forcing a hard second-order correction. The sweet spot is an intermediate band: update λ once per large batch or episode, in practice at 1/20–1/50 of the policy-update rate, backed by a real-time safety filter for emergencies.
Engineering detail — hard-coding safety with a Control Barrier Function. Lagrangian methods make safety likely; a CBF makes it certain. Define a safety function h(x) (positive inside the safe set) and require its derivative to keep the set forward-invariant. Wrap the policy output anet in a tiny quadratic program that finds the nearest safe action:
Solve with OSQP/qpth, and use the KKT conditions to get ∂asafe/∂anet so the QP layer is a differentiable module registered with autograd — the policy trains end-to-end through it. Because unsafe actions are corrected before they ever execute, the reward function needs no safety penalty at all, which speeds training ~30%. In production the QP layer compiles to a TensorRT plugin (INT8, ≈0.8 ms); an offline lookup table stands by so that if the solver ever fails, the car falls back to maximum-conservative braking — meeting the single-point-fault metric of ASIL-D.
4 · Imitation + RL — learn from humans without forgetting them
Intuition. Pure RL on a car is too dangerous and too slow to explore from scratch, so you start from human demonstrations (behavior cloning) and fine-tune with RL. But two things go wrong: the demonstrations cover only the states a competent human visits (so the policy is lost the moment it drifts off-distribution), and once RL takes over it can wander so far that it forgets the human prior entirely. The mechanism is a KL leash to the demonstration policy that loosens as the car proves itself over real mileage.
Engineering detail. The behavior-cloning KL coefficient β starts high — the policy hugs the human prior — and decays with cumulative mileage so exploration widens only as on-road evidence accrues. β never reaches zero (βmin = 1e-4) so a residual leash always holds, which is what keeps a safety-driver takeover rate under the company red line of <0.1 per 1000 km. The schedule is data-aware: with >95% highway coverage you can decay faster (k up to 1×10⁻⁴) to explore sooner; with <70% urban coverage switch to a conservative exponential βt = β0·0.9999ᵗ. If PPO's clip already limits update size, start β an order of magnitude lower so the two constraints don't stack and kill exploration.
Engineering detail — fix the off-distribution states, don't just clamp them. Use DAgger: run the policy, find the frames where it diverges from the expert (high ensemble variance over a 300 ms window, confirmed across 3 consecutive frames), and instead of discarding those failures, trigger a conservative fallback (gentle stop), send the clip to a human-review queue, and re-inject the corrected action as a pseudo-label — weak-label self-correction. In a complex-intersection scenario four DAgger rounds cut the expert-failure rate from 8.3% to 1.1% and lifted closed-loop success 12.7%, at only +3.8% labeling cost. When you also train a GAIL discriminator to imitate human style, watch for mode collapse: a discriminator that locks onto one driving mode makes lane-keeping fail. Deploy a lightweight discriminator replica in the domain controller; if its confidence stays below 0.2 for 20 ms, request takeover and log the scene — driving the lane-keep success rate from 92.4% to 99.7% on a long highway-loop test.
5 · The perception state — one frame is not Markov
Intuition. A single bird's-eye-view frame loses information: you cannot tell a parked car from a slow one, and an occluded pedestrian simply isn't there. Acting on one frame violates the Markov property, and the policy makes jerky, unsafe decisions. The fix is to make the state a sufficient statistic by carrying recent history in a recurrent hidden state.
Engineering detail. A 256×256×8 multi-channel BEV (semantics, speed, occupancy…) is down-sampled by three 3×3 convolutions to 32×32×64, fed through a ConvGRU (hidden 32×32×128) so the state ht summarizes history, then an MLP policy head (512, 256, |A|). On an automotive SoC this runs end-to-end in ~12 ms, inside the 10 Hz control budget, and closed-loop lane-change success rose 18% — evidence the state is "Markov enough." Training uses truncated BPTT (k=8); inference keeps a hidden-state queue. When a task needs to remember something 5 s ago (a light that turned red while occluded), the GRU's gradients vanish — add a small memory-augmented head: keep N=20 historical BEV tokens and run light cross-frame attention, or cascade a short-horizon GRU with a long-horizon key-value memory. To prove the history actually helps, measure the conditional-entropy drop H(st+1|ot) vs H(st+1|o0:t), and ablate the GRU to confirm the closed-loop collision rate rises without it.
6 · Sim-to-real — close the gap, then bound the loss
Intuition. The policy is trained in a simulator whose dynamics, latency and rendered images are all slightly wrong. Domain randomization over those parameters makes the policy robust, but you still need to (a) find the right randomization distribution automatically and (b) prove, before going on a public road, that no plausible mismatch can tank performance. The first is bilevel optimization; the second is a Lipschitz-style return bound.
Engineering detail — model the CAN-bus delay as part of the MDP. A command issued now arrives late and attenuated: a_real(t) = a_cmd(t − τ) · (1 − ε) + ε·n(t), turning "late" into a combined amplitude + phase error the agent feels during training. Convert a 1 ms delay into a path error via 2·v·sin(Δφ) — about 3 mm of lateral error at 60 km/h — so the reward can carry it directly: rt = −(clat·elat² + cdelay·τ²). Randomize τ across 10–50 ms in training, then do one online calibration on the real car (read true CAN timestamps via a diagnostic tool, correct λ₂) for near-zero-shot transfer. If the bus upgrades to CAN-FD at 8 Mbit/s the worst-case delay drops from ~15 ms to ~2 ms, so the delay-penalty weight must drop an order of magnitude or the policy turns timidly over-conservative.
Engineering detail — bound the worst-case return before you drive. To answer "if the camera fogs up, is the policy still safe?" without retraining: map clean and degraded observations through a frozen ResNet-18 to compute an empirical Wasserstein-1 distance W between them, estimate the policy and Q-network Lipschitz constants by power iteration (Lπ, LQ), and plug into
With γ=0.99 the discount factor γ/(1−γ)=99; a worked example gives Rmin ≥ 865 − 2.15·1.83·0.132·99 ≈ 813, a ≤6% drop — inside the "no more than 10% degradation" acceptance line. Re-estimating after a renderer or driver-version change needs only ~50k fresh frames and runs in ~3 minutes, no retraining. The widget below makes this bound tangible: feel how the perception gap and the network's Lipschitz constants set the worst-case return — and how spectral normalization (capping L below 1) buys back margin.
Further considerations
- More actuators reshape the friction limit. Add rear-wheel steering and the action becomes (Fx, Fy, δr) — the friction circle becomes a 3-D ellipsoid, normalized by embedding the covariance via a Cholesky factor (shrinks turning radius ~8%). With four independently-driven wheels the action is 4-D: map to per-tire forces, then use a 4×2 pseudo-inverse to project onto the circle so an unsaturated wheel automatically compensates for a saturated one — better limit-handling avoidance.
- The bus delay becomes a game. In a platoon, 4–5 cars share one bus and arbitration delay turns into a dynamic, adversarial variable. Feed bus load into the state so the policy learns to throttle its own broadcast rate near congestion — folding communication QoS into the policy and unifying sense-decide-communicate.
- Robustness to other agents' violations. Other drivers break rules too, which can force you into a passive violation. Model the opponents' policy uncertainty as a robust CPO with worst-case di, so the annual point budget holds against any opponent behavior — and for multi-vehicle collision avoidance use shared barrier functions solved with distributed QP / ADMM as the agent count grows.
- tanh-Gaussian vs. GMM action heads. A tanh-squashed Gaussian gives a strictly bounded, monotone, single-mode output that passes module-level FMEA easily — the right choice for a safety-certified chip. A Gaussian mixture explores multimodal maneuvers better but is harder to certify. A common two-stage pattern: explore with a GMM in training, then distill the best mode into a tanh policy for deployment (DAgger roll-outs, KL-minimization, a boundary penalty λ·max(0,|a|−1)² so the tanh net doesn't over-stretch σ chasing tail mass).
- Functional-safety evidence. ASIL targets want a delay-failure probability measured to FIT level (10⁻⁷/h for ASIL-B); turn the τ distribution into a random-fault injector during training so the policy's delay-failure probability is estimated up front, cutting late safety-validation iterations.