Intelligent traffic-signal control
A controller watches an intersection through noisy, sparse sensors and decides which movement gets the green and for how long. Casting that as an MDP is easy; making it converge and pass a traffic-authority acceptance test is the hard part. The binding difficulties here are a partially observed, heterogeneous, high-dimensional state, an action space riddled with hard safety constraints (minimum green, phase conflicts), a multi-objective reward that hides perverse optima (suppressing side streets to flatter the arterial), non-stationary demand (peak/off-peak, holidays, storms), and multi-intersection coordination under communication delay. Each one names a tool.
1 · Formulate — the MDP behind a signal controller
Intuition. A signal controller observes how many cars are waiting on each approach, picks a green phase, holds it for a while, and is rewarded by how little everyone waited. That is already a Markov Decision Process: the detector readings are the state, the choice of phase and duration is the action, the change in total delay is the reward, and traffic physics is the transition. Everything below is one of these four pieces turning out to be awkward at a real intersection.
| Piece | For a signalised intersection | The awkward part |
|---|---|---|
| State S | per-lane queue length, occupancy, saturation, turning flows, next-phase countdown | fused from heterogeneous sensors of unequal trust → partially observable, high-dimensional |
| Action A | which phase to serve next × how long to hold the green | most transitions are illegal: minimum-green, pedestrian clearance, conflicting movements |
| Reward R | −(queue + delay) shaped with throughput, emissions, fairness | multi-objective; the literal optimum starves side streets and parks vehicles |
| Transition P | car-following + arrivals from upstream intersections + driver turns | non-stationary (peak shifts, weather) and coupled across intersections |
The MDP table writes itself, and the rightmost column — the awkward part — is the lesson. Each row's difficulty has a named mechanism that answers it, and we reach for none of them before its row demands it.
2 · Diagnose — a fused, partially observed, high-dimensional state
Intuition. You never see the true traffic on a link; you see proxies. ETC gantries on the highway mainline are accurate (over 95% detection) but only cover a few links. Floating-car (GPS-equipped taxi / ride-hail) data covers everywhere but at low penetration — inside Beijing's 5th Ring the taxi share is roughly 3%, so for most lanes you have almost no direct observation. Loop and video detectors fail in heavy rain. So the "state" is really a belief assembled from sources of wildly different trust, and a big intersection's raw state vector is huge.
Engineering detail — fuse as a POMDP. Treat the unobserved true link parameters as the hidden state s, the heterogeneous (fixed-detector, floating-car) pair as the observation o, and let the policy emit a per-source fusion weight α. Encode the observation history with a Bayesian LSTM, ht = f(o1,a1,…,ot), into a 256-dim belief vector; a final Concrete-Dropout layer emits an uncertainty interval that drives exploration. Two adaptations remove the sparsity that would otherwise stall learning:
- High-trust observations get weight. ETC-gantry readings enter the reward and fusion with a larger coefficient because their precision exceeds 95%.
- Sparse links get imputed first. Where floating-car penetration is below ~3%, fill the gap with Voronoi partition + kriging spatial interpolation before feeding the policy network, so RL is not asked to learn from an almost-empty reward signal.
For the high-dimensional state of a large junction, compress it: a β-VAE encoder maps the raw per-lane tensor down to a ~24-dim latent, with a traffic-semantic constraint in the loss — beyond reconstruction error, a saturation-ordering consistency term (a KL divergence) prevents the compression from flipping phase priorities. A Successor-Feature linear projection then takes the latent to ~18 dims with a provable cumulative-return error bound (≤ 2.1%); freeze that projection as a feature extractor and only update the policy head, keeping policy-gradient variance controlled. Quantised to INT8 it infers in ~22 ms on an edge signal box — within the real-time budget.
3 · Engineer the action space — safety constraints, not preferences
Intuition. Most phase transitions a network could emit are flatly illegal: you cannot cut a green below its minimum (a pedestrian phase is often a hard 15 s while a vehicle phase might want only 20 s), and you cannot serve two conflicting movements at once. If the policy can place probability on those actions it wastes capacity learning "never do that," and — far worse — a single illegal sample on the street is a safety incident. The fix is the same family as masking in game AI: forbid the action before sampling, and never let the constraint leak into the gradient.
Engineering detail — mask at the source, keep the ratio honest. Build a fixed-length legality mask each step from the current phase timers and the conflict matrix; set illegal logits to a large negative before the softmax so their probability is exactly zero. Three rules keep this from poisoning PPO:
- Record the pre-mask logits when writing to the replay buffer, so the importance ratio πnew/πold is computed correctly and unbiased.
- Mask only during environment interaction; introduce no extra gradient on the backward pass, preserving convergence guarantees of PPO / A2C.
- Formally verify before deployment: enumerate every (minimum-green, phase) combination through a state machine to confirm zero violations — the only acceptable bar for a traffic authority.
Higher-level actions — the options framework. Holding a green is naturally a temporally-extended action. Define an "extend-green" option ω with an initiation set (enter only when the current phase saturation is high and the next conflicting queue Qnext ≤ 5 vehicles), an internal policy that simply holds, and a termination βω that fires when the extension reaches 20 s, or saturation drops below 0.3, or an emergency vehicle / red-light-runner is detected. The semi-MDP return collected over the option updates the upper-level Q-learning:
4 · Reward — multi-objective, and quietly hackable
Intuition. "Minimise delay" alone produces ugly optima. Maximise throughput and the agent learns to suppress the side streets and feed only the arterial; add an emissions term naively and it can stall the fleet to game the average. The reward is genuinely multi-objective — queue, delay, emissions, fairness — and each term you add is a new thing the agent will optimise literally.
Engineering detail — monetise and normalise. Put every objective in the same unit. Emissions become a cost per gram from social-cost figures — CO₂ at ~5.5×10⁻⁵ ¥/g (from a ~55 ¥/tonne carbon price), NOx at ~1.5×10⁻² ¥/g, particle-number at ~6×10⁻¹⁵ ¥/#:
Smooth raw emission readings with an EMA (decay α = 0.05, ≈ 1 s window) to reject sensor noise without losing transient peaks, and ramp the penalty in with an adaptive coefficient β(t) so early exploration is not over-suppressed: rt = rthroughput + rlegal + β(t)·renv,ema. With this structure a controller cut NOx ~18% and particle number ~22% at a ~1.2% fuel cost, training stably.
Fairness — write it into the kernel, not a post-hoc patch. Cast the problem as a CMDP: base reward r₀ = instantaneous vehicles served, and a fairness cost C = an EMA (300 s window) of the Gini coefficient of per-road service, with constraint E[C] ≤ 0.3 (the authority's "side-street delay ≤ 1.5× the arterial" rule). The shaping reward adds two clipped hinge penalties whose Lagrange multipliers are updated online by dual gradient ascent:
where Ĝ is a real-time Gini estimate from a centralised fairness critic and θ is a per-lane minimum throughput floor (from GB 14886, ~180 pcu/h). The fairness critic shares a convolutional backbone with the throughput critic but has its own head fit to true Gini with a Huber loss; the replay buffer prioritises high-unfairness samples (p = |δ| + 0.1·max(0, Ĝ−0.3)). A field trial cut arterial delay 12% and side-street delay 38%, dropping Gini from 0.41 to 0.27 — fairness as a constraint, not a contradiction.
5 · Non-stationary demand — the transition that drifts under you
Intuition. Morning peak, evening peak, midday lull, rainstorm, holiday — the arrival process is not one MDP but a family that switches, sometimes abruptly. A policy fit to the peak slowly forgets the off-peak, and a sudden storm can triple demand faster than any change-detector can react. You need to detect the regime, switch smoothly, and not forget.
Engineering detail. Detect a regime shift with a CUSUM statistic on demand, then soft-switch: blend πnew = α·πpeak + (1−α)·πoffpeak with α ramping 0→1 linearly over ~10 minutes, holding KL(πold‖πnew) ≤ 0.05 as a hard constraint (a TRPO-style projection) so drivers never feel a jolt. During a sustained peak, add a non-stationarity penalty β·‖θt − θt−k‖² (β ≈ 1e−4) to resist over-fitting short-term noise. To beat catastrophic forgetting of the off-peak pattern, keep a fixed-size reservoir buffer storing both high-value (return > μ+2σ) and high-entropy samples; mix ~30% replayed with ~70% live data and correct the distribution shift with importance weights. Anticipate calendar effects with a holiday model — MAML pre-training on past festival data so a never-seen holiday adapts in ~3 epochs, plus a progressive-network "year embedding" that freezes the old net and trains only the increment, so the line never regresses.
6 · Multi-intersection coordination — the transition is coupled
Intuition. One intersection's outflow is the next one's inflow, so the controllers are coupled — green-wave coordination is exactly exploiting this. That makes it a multi-agent problem, and two things bite: how to assign credit across agents, and how to stay consistent when the messages between them arrive late.
Engineering detail. Use QMIX: each intersection keeps a local Qi; a monotonic mixing network combines them into Qtot for centralised training, decentralised execution. At inference the edge boxes run only their local Qi and sync parameter deltas (ΔW) every ~30 s; if the central coordinator fails, each box degrades to independent Q-learning to keep the corridor running. For communication delay, weight stale gradients down by their age and feed a time-embedding into the mixer so it learns to compensate; output a delay-confidence σ, and when σ is low, apply a conservative action mask so a late message cannot break a safety constraint. For a city-scale grid, abstract first: cluster tens of thousands of intersections into a few hundred super-agents (cutting per-agent state from ~10⁵ to ~10³ dims), generate samples from a large parallel SUMO simulation, and train an Ape-X-style distributed learner with gradient quantisation + Top-K sparsification (≈ 1/8 the bandwidth) and PopArt to keep multi-city reward scales aligned.
Further considerations
- Sensor-retirement RL. Once city-wide C-V2X penetration passes ~20%, fixed detectors become redundant. Promote fusion to a "sensor back-off" problem: the policy also emits a discrete "switch this detector off" action, with a multi-objective reward trading estimation error against device O&M cost — a precision–cost Pareto frontier matching "do more with less" infrastructure policy.
- Ride-hail state as a richer option set. If floating-car data carries occupancy (empty / hired / parked), lift the model to a semi-Markov decision process whose options are those operating states, mining the implicit signal that fleet behaviour gives about true demand.
- Coupled minimum-green constraints. In a green wave the downstream gmin is constrained by the upstream platoon's arrival time — a coupled temporal constraint. A GNN with neighbouring gmin as node features can jointly emit the legality mask; gap-out (early termination when a detector sees no car) turns the constraint stochastic and calls for a chance-constrained CMDP solved by Lagrangian PPO.
- Concept-drift monitoring. Stream real trajectories back via Kafka and compute MAPE@30min on flow prediction; if it exceeds ~8% trigger an automatic retrain, and if the importance of a key feature (e.g. road-condition level) drops > 20%, widen the sample window and raise the learning rate.
- Federated fairness at city scale. Beyond ~10,000 intersections a centralised fairness critic is a scalability bottleneck; train regional critics locally and aggregate only encrypted Gini gradients with differential-privacy noise, holding global fairness without moving raw data.
- Auditability for acceptance. Log an option's activation-reason vector (ρp, Qnext, remaining green) and a termination reason-code, and submit an option policy table with stated safety boundaries — without an auditable, interpretable trail the system cannot pass a real-vehicle acceptance test.