Intelligent traffic-signal control

A controller watches an intersection through noisy, sparse sensors and decides which movement gets the green and for how long. Casting that as an MDP is easy; making it converge and pass a traffic-authority acceptance test is the hard part. The binding difficulties here are a partially observed, heterogeneous, high-dimensional state, an action space riddled with hard safety constraints (minimum green, phase conflicts), a multi-objective reward that hides perverse optima (suppressing side streets to flatter the arterial), non-stationary demand (peak/off-peak, holidays, storms), and multi-intersection coordination under communication delay. Each one names a tool.

The method — five steps, every lesson

Applied RL is not a grab-bag of tricks. Every domain in this track is the same loop: (1) Formulate the MDP — state, action, reward, transition, horizon. (2) Diagnose the one property that makes the MDP hard. (3) Engineer the mechanism that removes that difficulty. (4) Guard it in production — detect when it breaks and fall back. (5) Iterate. The art is only in which difficulty binds first. This lesson runs the loop on one signalised network.

1 · Formulate — the MDP behind a signal controller

Intuition. A signal controller observes how many cars are waiting on each approach, picks a green phase, holds it for a while, and is rewarded by how little everyone waited. That is already a Markov Decision Process: the detector readings are the state, the choice of phase and duration is the action, the change in total delay is the reward, and traffic physics is the transition. Everything below is one of these four pieces turning out to be awkward at a real intersection.

MDP = (S, A, R, P, γ) · maximize E[ Σₜ γᵗ rₜ ]

Piece	For a signalised intersection	The awkward part
State S	per-lane queue length, occupancy, saturation, turning flows, next-phase countdown	fused from heterogeneous sensors of unequal trust → partially observable, high-dimensional
Action A	which phase to serve next × how long to hold the green	most transitions are illegal: minimum-green, pedestrian clearance, conflicting movements
Reward R	−(queue + delay) shaped with throughput, emissions, fairness	multi-objective; the literal optimum starves side streets and parks vehicles
Transition P	car-following + arrivals from upstream intersections + driver turns	non-stationary (peak shifts, weather) and coupled across intersections

The MDP table writes itself, and the rightmost column — the awkward part — is the lesson. Each row's difficulty has a named mechanism that answers it, and we reach for none of them before its row demands it.

2 · Diagnose — a fused, partially observed, high-dimensional state

Intuition. You never see the true traffic on a link; you see proxies. ETC gantries on the highway mainline are accurate (over 95% detection) but only cover a few links. Floating-car (GPS-equipped taxi / ride-hail) data covers everywhere but at low penetration — inside Beijing's 5th Ring the taxi share is roughly 3%, so for most lanes you have almost no direct observation. Loop and video detectors fail in heavy rain. So the "state" is really a belief assembled from sources of wildly different trust, and a big intersection's raw state vector is huge.

Engineering detail — fuse as a POMDP. Treat the unobserved true link parameters as the hidden state s, the heterogeneous (fixed-detector, floating-car) pair as the observation o, and let the policy emit a per-source fusion weight α. Encode the observation history with a Bayesian LSTM, h_t = f(o₁,a₁,…,o_t), into a 256-dim belief vector; a final Concrete-Dropout layer emits an uncertainty interval that drives exploration. Two adaptations remove the sparsity that would otherwise stall learning:

b_t = BayesianLSTM(o_1:t) · T̂(s′|s,a) = Σ_{b_t} P(s|b_t)·P(s′|b_t,a)

High-trust observations get weight. ETC-gantry readings enter the reward and fusion with a larger coefficient because their precision exceeds 95%.
Sparse links get imputed first. Where floating-car penetration is below ~3%, fill the gap with Voronoi partition + kriging spatial interpolation before feeding the policy network, so RL is not asked to learn from an almost-empty reward signal.

For the high-dimensional state of a large junction, compress it: a β-VAE encoder maps the raw per-lane tensor down to a ~24-dim latent, with a traffic-semantic constraint in the loss — beyond reconstruction error, a saturation-ordering consistency term (a KL divergence) prevents the compression from flipping phase priorities. A Successor-Feature linear projection then takes the latent to ~18 dims with a provable cumulative-return error bound (≤ 2.1%); freeze that projection as a feature extractor and only update the policy head, keeping policy-gradient variance controlled. Quantised to INT8 it infers in ~22 ms on an edge signal box — within the real-time budget.

3 · Engineer the action space — safety constraints, not preferences

Intuition. Most phase transitions a network could emit are flatly illegal: you cannot cut a green below its minimum (a pedestrian phase is often a hard 15 s while a vehicle phase might want only 20 s), and you cannot serve two conflicting movements at once. If the policy can place probability on those actions it wastes capacity learning "never do that," and — far worse — a single illegal sample on the street is a safety incident. The fix is the same family as masking in game AI: forbid the action before sampling, and never let the constraint leak into the gradient.

Engineering detail — mask at the source, keep the ratio honest. Build a fixed-length legality mask each step from the current phase timers and the conflict matrix; set illegal logits to a large negative before the softmax so their probability is exactly zero. Three rules keep this from poisoning PPO:

logits ← logits.masked_fill(mask == 0, −1e9) → softmax · store raw pre-mask logits in the buffer

Record the pre-mask logits when writing to the replay buffer, so the importance ratio π_new/π_old is computed correctly and unbiased.
Mask only during environment interaction; introduce no extra gradient on the backward pass, preserving convergence guarantees of PPO / A2C.
Formally verify before deployment: enumerate every (minimum-green, phase) combination through a state machine to confirm zero violations — the only acceptable bar for a traffic authority.

Two failure modes the mask must avoid

(1) The log(0) trap. If a phase is illegal under both old and new policy, both probabilities are 0 and the ratio computes log(0)→NaN. Treat illegal-action probability as identically 0 on both policies and zero its gradient. (2) Over-masking strategy. The mask enforces rules (conflict, minimum green), never strategy. Pedestrian and vehicle phases need independent g_min dimensions in the action space; sharing one mask across "person" and "car" silently bans legal short vehicle greens. And keep g_min in an external config register the controller reads at startup — police temporarily raise it for exams or large events, and you want zero-code adaptation, not a retrain.

Higher-level actions — the options framework. Holding a green is naturally a temporally-extended action. Define an "extend-green" option ω with an initiation set (enter only when the current phase saturation is high and the next conflicting queue Q_next ≤ 5 vehicles), an internal policy that simply holds, and a termination β_ω that fires when the extension reaches 20 s, or saturation drops below 0.3, or an emergency vehicle / red-light-runner is detected. The semi-MDP return collected over the option updates the upper-level Q-learning:

U_t = Σ_i=0^k−1 γⁱ r_t+i + γ^k max_ω′ Q(s_t+k, ω′)

4 · Reward — multi-objective, and quietly hackable

Intuition. "Minimise delay" alone produces ugly optima. Maximise throughput and the agent learns to suppress the side streets and feed only the arterial; add an emissions term naively and it can stall the fleet to game the average. The reward is genuinely multi-objective — queue, delay, emissions, fairness — and each term you add is a new thing the agent will optimise literally.

Engineering detail — monetise and normalise. Put every objective in the same unit. Emissions become a cost per gram from social-cost figures — CO₂ at ~5.5×10⁻⁵ ¥/g (from a ~55 ¥/tonne carbon price), NOx at ~1.5×10⁻² ¥/g, particle-number at ~6×10⁻¹⁵ ¥/#:

r_env = −(5.5e−5·CO₂_g + 1.5e−2·NOx_g + 6e−15·PN_#)

Smooth raw emission readings with an EMA (decay α = 0.05, ≈ 1 s window) to reject sensor noise without losing transient peaks, and ramp the penalty in with an adaptive coefficient β(t) so early exploration is not over-suppressed: r_t = r_throughput + r_legal + β(t)·r_env,ema. With this structure a controller cut NOx ~18% and particle number ~22% at a ~1.2% fuel cost, training stably.

Fairness — write it into the kernel, not a post-hoc patch. Cast the problem as a CMDP: base reward r₀ = instantaneous vehicles served, and a fairness cost C = an EMA (300 s window) of the Gini coefficient of per-road service, with constraint E[C] ≤ 0.3 (the authority's "side-street delay ≤ 1.5× the arterial" rule). The shaping reward adds two clipped hinge penalties whose Lagrange multipliers are updated online by dual gradient ascent:

r = r₀ − λ·max(0, Ĝ − 0.3) − μ·max(0, θ − min_i T_i)

where Ĝ is a real-time Gini estimate from a centralised fairness critic and θ is a per-lane minimum throughput floor (from GB 14886, ~180 pcu/h). The fairness critic shares a convolutional backbone with the throughput critic but has its own head fit to true Gini with a Huber loss; the replay buffer prioritises high-unfairness samples (p = |δ| + 0.1·max(0, Ĝ−0.3)). A field trial cut arterial delay 12% and side-street delay 38%, dropping Gini from 0.41 to 0.27 — fairness as a constraint, not a contradiction.

Detecting "vehicle suppression" before it ships

The perverse optimum of a throughput reward is to hold minor approaches red and let their queues blow up while the arterial sails. Catch it three ways. (1) Watch the worst-served approach's share of green time — if a 3% share creeps up to ~21% after retraining you fixed it; if a lane's share collapses, alert. (2) Validate a Jain fairness index every ~1000 steps and roll back the policy if it drops below 0.75, to stop policy collapse. (3) Keep a hard prior-rule backstop: GB 14886's "single-point delay ≤ 60 s" enters the reward as a hard constraint, and the reward function is logged in an algorithm-registration table with an interpretability ratio > 0.8 so it can pass a safety review rather than read as a black box.

5 · Non-stationary demand — the transition that drifts under you

Intuition. Morning peak, evening peak, midday lull, rainstorm, holiday — the arrival process is not one MDP but a family that switches, sometimes abruptly. A policy fit to the peak slowly forgets the off-peak, and a sudden storm can triple demand faster than any change-detector can react. You need to detect the regime, switch smoothly, and not forget.

Engineering detail. Detect a regime shift with a CUSUM statistic on demand, then soft-switch: blend π_new = α·π_peak + (1−α)·π_offpeak with α ramping 0→1 linearly over ~10 minutes, holding KL(π_old‖π_new) ≤ 0.05 as a hard constraint (a TRPO-style projection) so drivers never feel a jolt. During a sustained peak, add a non-stationarity penalty β·‖θ_t − θ_t−k‖² (β ≈ 1e−4) to resist over-fitting short-term noise. To beat catastrophic forgetting of the off-peak pattern, keep a fixed-size reservoir buffer storing both high-value (return > μ+2σ) and high-entropy samples; mix ~30% replayed with ~70% live data and correct the distribution shift with importance weights. Anticipate calendar effects with a holiday model — MAML pre-training on past festival data so a never-seen holiday adapts in ~3 epochs, plus a progressive-network "year embedding" that freezes the old net and trains only the increment, so the line never regresses.

6 · Multi-intersection coordination — the transition is coupled

Intuition. One intersection's outflow is the next one's inflow, so the controllers are coupled — green-wave coordination is exactly exploiting this. That makes it a multi-agent problem, and two things bite: how to assign credit across agents, and how to stay consistent when the messages between them arrive late.

Engineering detail. Use QMIX: each intersection keeps a local Q_i; a monotonic mixing network combines them into Q_tot for centralised training, decentralised execution. At inference the edge boxes run only their local Q_i and sync parameter deltas (ΔW) every ~30 s; if the central coordinator fails, each box degrades to independent Q-learning to keep the corridor running. For communication delay, weight stale gradients down by their age and feed a time-embedding into the mixer so it learns to compensate; output a delay-confidence σ, and when σ is low, apply a conservative action mask so a late message cannot break a safety constraint. For a city-scale grid, abstract first: cluster tens of thousands of intersections into a few hundred super-agents (cutting per-agent state from ~10⁵ to ~10³ dims), generate samples from a large parallel SUMO simulation, and train an Ape-X-style distributed learner with gradient quantisation + Top-K sparsification (≈ 1/8 the bandwidth) and PopArt to keep multi-city reward scales aligned.

Green-extension dwell time → throughput vs. fairness

The central knob of adaptive control: how long to hold a saturated green before serving the waiting cross-street. Push the dwell-time cap up and you clear the busy approach but starve the side street (Gini rises); push it down and you waste capacity switching. This is the option's β_ω threshold made tangible.

max green dwell (s) 35 arterial demand (veh/s) 0.9 side-street demand (veh/s) 0.4

throughput

–

side-street wait

–

Gini (fairness)

–

verdict

–

The through-line

Every section is one row of the MDP table turned into a mechanism: fused partial state → POMDP belief encoder + kriging imputation + semantic-constrained compression; safety-constrained action → source-side masking, formal verification, the options framework for green extension; multi-objective reward → monetised normalisation + a fairness CMDP with a suppression detector; non-stationary demand → CUSUM soft-switching + reservoir replay against forgetting; coupled transition → QMIX with delay-aware syncing and an IQL fallback. You never reached for a tool until a row of the table demanded it. That discipline is the whole track.

Further considerations

Sensor-retirement RL. Once city-wide C-V2X penetration passes ~20%, fixed detectors become redundant. Promote fusion to a "sensor back-off" problem: the policy also emits a discrete "switch this detector off" action, with a multi-objective reward trading estimation error against device O&M cost — a precision–cost Pareto frontier matching "do more with less" infrastructure policy.
Ride-hail state as a richer option set. If floating-car data carries occupancy (empty / hired / parked), lift the model to a semi-Markov decision process whose options are those operating states, mining the implicit signal that fleet behaviour gives about true demand.
Coupled minimum-green constraints. In a green wave the downstream g_min is constrained by the upstream platoon's arrival time — a coupled temporal constraint. A GNN with neighbouring g_min as node features can jointly emit the legality mask; gap-out (early termination when a detector sees no car) turns the constraint stochastic and calls for a chance-constrained CMDP solved by Lagrangian PPO.
Concept-drift monitoring. Stream real trajectories back via Kafka and compute MAPE@30min on flow prediction; if it exceeds ~8% trigger an automatic retrain, and if the importance of a key feature (e.g. road-condition level) drops > 20%, widen the sample window and raise the learning rate.
Federated fairness at city scale. Beyond ~10,000 intersections a centralised fairness critic is a scalability bottleneck; train regional critics locally and aggregate only encrypted Gini gradients with differential-privacy noise, holding global fairness without moving raw data.
Auditability for acceptance. Log an option's activation-reason vector (ρ_p, Q_next, remaining green) and a termination reason-code, and submit an option policy table with stated safety boundaries — without an auditable, interpretable trail the system cannot pass a real-vehicle acceptance test.