all lessons / reinforcement learning / 81 · Intelligent traffic-signal control lesson 81 / 87

Intelligent traffic-signal control

A controller watches an intersection through noisy, sparse sensors and decides which movement gets the green and for how long. Casting that as an MDP is easy; making it converge and pass a traffic-authority acceptance test is the hard part. The binding difficulties here are a partially observed, heterogeneous, high-dimensional state, an action space riddled with hard safety constraints (minimum green, phase conflicts), a multi-objective reward that hides perverse optima (suppressing side streets to flatter the arterial), non-stationary demand (peak/off-peak, holidays, storms), and multi-intersection coordination under communication delay. Each one names a tool.

The method — five steps, every lesson
Applied RL is not a grab-bag of tricks. Every domain in this track is the same loop: (1) Formulate the MDP — state, action, reward, transition, horizon. (2) Diagnose the one property that makes the MDP hard. (3) Engineer the mechanism that removes that difficulty. (4) Guard it in production — detect when it breaks and fall back. (5) Iterate. The art is only in which difficulty binds first. This lesson runs the loop on one signalised network.

1 · Formulate — the MDP behind a signal controller

Intuition. A signal controller observes how many cars are waiting on each approach, picks a green phase, holds it for a while, and is rewarded by how little everyone waited. That is already a Markov Decision Process: the detector readings are the state, the choice of phase and duration is the action, the change in total delay is the reward, and traffic physics is the transition. Everything below is one of these four pieces turning out to be awkward at a real intersection.

MDP = (S, A, R, P, γ)  ·  maximize  E[ Σₜ γᵗ rₜ ]
PieceFor a signalised intersectionThe awkward part
State Sper-lane queue length, occupancy, saturation, turning flows, next-phase countdownfused from heterogeneous sensors of unequal trust → partially observable, high-dimensional
Action Awhich phase to serve next × how long to hold the greenmost transitions are illegal: minimum-green, pedestrian clearance, conflicting movements
Reward R−(queue + delay) shaped with throughput, emissions, fairnessmulti-objective; the literal optimum starves side streets and parks vehicles
Transition Pcar-following + arrivals from upstream intersections + driver turnsnon-stationary (peak shifts, weather) and coupled across intersections

The MDP table writes itself, and the rightmost column — the awkward part — is the lesson. Each row's difficulty has a named mechanism that answers it, and we reach for none of them before its row demands it.

2 · Diagnose — a fused, partially observed, high-dimensional state

Intuition. You never see the true traffic on a link; you see proxies. ETC gantries on the highway mainline are accurate (over 95% detection) but only cover a few links. Floating-car (GPS-equipped taxi / ride-hail) data covers everywhere but at low penetration — inside Beijing's 5th Ring the taxi share is roughly 3%, so for most lanes you have almost no direct observation. Loop and video detectors fail in heavy rain. So the "state" is really a belief assembled from sources of wildly different trust, and a big intersection's raw state vector is huge.

Engineering detail — fuse as a POMDP. Treat the unobserved true link parameters as the hidden state s, the heterogeneous (fixed-detector, floating-car) pair as the observation o, and let the policy emit a per-source fusion weight α. Encode the observation history with a Bayesian LSTM, ht = f(o1,a1,…,ot), into a 256-dim belief vector; a final Concrete-Dropout layer emits an uncertainty interval that drives exploration. Two adaptations remove the sparsity that would otherwise stall learning:

bt = BayesianLSTM(o1:t)  ·  T̂(s′|s,a) = Σbt P(s|bt)·P(s′|bt,a)

For the high-dimensional state of a large junction, compress it: a β-VAE encoder maps the raw per-lane tensor down to a ~24-dim latent, with a traffic-semantic constraint in the loss — beyond reconstruction error, a saturation-ordering consistency term (a KL divergence) prevents the compression from flipping phase priorities. A Successor-Feature linear projection then takes the latent to ~18 dims with a provable cumulative-return error bound (≤ 2.1%); freeze that projection as a feature extractor and only update the policy head, keeping policy-gradient variance controlled. Quantised to INT8 it infers in ~22 ms on an edge signal box — within the real-time budget.

3 · Engineer the action space — safety constraints, not preferences

Intuition. Most phase transitions a network could emit are flatly illegal: you cannot cut a green below its minimum (a pedestrian phase is often a hard 15 s while a vehicle phase might want only 20 s), and you cannot serve two conflicting movements at once. If the policy can place probability on those actions it wastes capacity learning "never do that," and — far worse — a single illegal sample on the street is a safety incident. The fix is the same family as masking in game AI: forbid the action before sampling, and never let the constraint leak into the gradient.

Engineering detail — mask at the source, keep the ratio honest. Build a fixed-length legality mask each step from the current phase timers and the conflict matrix; set illegal logits to a large negative before the softmax so their probability is exactly zero. Three rules keep this from poisoning PPO:

logits ← logits.masked_fill(mask == 0, −1e9) → softmax  ·  store raw pre-mask logits in the buffer
Two failure modes the mask must avoid
(1) The log(0) trap. If a phase is illegal under both old and new policy, both probabilities are 0 and the ratio computes log(0)→NaN. Treat illegal-action probability as identically 0 on both policies and zero its gradient. (2) Over-masking strategy. The mask enforces rules (conflict, minimum green), never strategy. Pedestrian and vehicle phases need independent gmin dimensions in the action space; sharing one mask across "person" and "car" silently bans legal short vehicle greens. And keep gmin in an external config register the controller reads at startup — police temporarily raise it for exams or large events, and you want zero-code adaptation, not a retrain.

Higher-level actions — the options framework. Holding a green is naturally a temporally-extended action. Define an "extend-green" option ω with an initiation set (enter only when the current phase saturation is high and the next conflicting queue Qnext ≤ 5 vehicles), an internal policy that simply holds, and a termination βω that fires when the extension reaches 20 s, or saturation drops below 0.3, or an emergency vehicle / red-light-runner is detected. The semi-MDP return collected over the option updates the upper-level Q-learning:

Ut = Σi=0k−1 γi rt+i + γk maxω′ Q(st+k, ω′)

4 · Reward — multi-objective, and quietly hackable

Intuition. "Minimise delay" alone produces ugly optima. Maximise throughput and the agent learns to suppress the side streets and feed only the arterial; add an emissions term naively and it can stall the fleet to game the average. The reward is genuinely multi-objective — queue, delay, emissions, fairness — and each term you add is a new thing the agent will optimise literally.

Engineering detail — monetise and normalise. Put every objective in the same unit. Emissions become a cost per gram from social-cost figures — CO₂ at ~5.5×10⁻⁵ ¥/g (from a ~55 ¥/tonne carbon price), NOx at ~1.5×10⁻² ¥/g, particle-number at ~6×10⁻¹⁵ ¥/#:

renv = −(5.5e−5·CO₂g + 1.5e−2·NOxg + 6e−15·PN#)

Smooth raw emission readings with an EMA (decay α = 0.05, ≈ 1 s window) to reject sensor noise without losing transient peaks, and ramp the penalty in with an adaptive coefficient β(t) so early exploration is not over-suppressed: rt = rthroughput + rlegal + β(t)·renv,ema. With this structure a controller cut NOx ~18% and particle number ~22% at a ~1.2% fuel cost, training stably.

Fairness — write it into the kernel, not a post-hoc patch. Cast the problem as a CMDP: base reward r₀ = instantaneous vehicles served, and a fairness cost C = an EMA (300 s window) of the Gini coefficient of per-road service, with constraint E[C] ≤ 0.3 (the authority's "side-street delay ≤ 1.5× the arterial" rule). The shaping reward adds two clipped hinge penalties whose Lagrange multipliers are updated online by dual gradient ascent:

r = r₀ − λ·max(0, Ĝ − 0.3) − μ·max(0, θ − mini Ti)

where Ĝ is a real-time Gini estimate from a centralised fairness critic and θ is a per-lane minimum throughput floor (from GB 14886, ~180 pcu/h). The fairness critic shares a convolutional backbone with the throughput critic but has its own head fit to true Gini with a Huber loss; the replay buffer prioritises high-unfairness samples (p = |δ| + 0.1·max(0, Ĝ−0.3)). A field trial cut arterial delay 12% and side-street delay 38%, dropping Gini from 0.41 to 0.27 — fairness as a constraint, not a contradiction.

Detecting "vehicle suppression" before it ships
The perverse optimum of a throughput reward is to hold minor approaches red and let their queues blow up while the arterial sails. Catch it three ways. (1) Watch the worst-served approach's share of green time — if a 3% share creeps up to ~21% after retraining you fixed it; if a lane's share collapses, alert. (2) Validate a Jain fairness index every ~1000 steps and roll back the policy if it drops below 0.75, to stop policy collapse. (3) Keep a hard prior-rule backstop: GB 14886's "single-point delay ≤ 60 s" enters the reward as a hard constraint, and the reward function is logged in an algorithm-registration table with an interpretability ratio > 0.8 so it can pass a safety review rather than read as a black box.
1 · FORMULATE S, A, R, P fused belief state 2 · DIAGNOSE partial obs, constraints, drift 3 · ENGINEER masking, CMDP, QMIX coordination 4 · GUARD CUSUM drift, rule fallback 5 · ITERATE re-diagnose removing one difficulty exposes the next — re-run the loop

5 · Non-stationary demand — the transition that drifts under you

Intuition. Morning peak, evening peak, midday lull, rainstorm, holiday — the arrival process is not one MDP but a family that switches, sometimes abruptly. A policy fit to the peak slowly forgets the off-peak, and a sudden storm can triple demand faster than any change-detector can react. You need to detect the regime, switch smoothly, and not forget.

Engineering detail. Detect a regime shift with a CUSUM statistic on demand, then soft-switch: blend πnew = α·πpeak + (1−α)·πoffpeak with α ramping 0→1 linearly over ~10 minutes, holding KL(πold‖πnew) ≤ 0.05 as a hard constraint (a TRPO-style projection) so drivers never feel a jolt. During a sustained peak, add a non-stationarity penalty β·‖θt − θt−k‖² (β ≈ 1e−4) to resist over-fitting short-term noise. To beat catastrophic forgetting of the off-peak pattern, keep a fixed-size reservoir buffer storing both high-value (return > μ+2σ) and high-entropy samples; mix ~30% replayed with ~70% live data and correct the distribution shift with importance weights. Anticipate calendar effects with a holiday model — MAML pre-training on past festival data so a never-seen holiday adapts in ~3 epochs, plus a progressive-network "year embedding" that freezes the old net and trains only the increment, so the line never regresses.

6 · Multi-intersection coordination — the transition is coupled

Intuition. One intersection's outflow is the next one's inflow, so the controllers are coupled — green-wave coordination is exactly exploiting this. That makes it a multi-agent problem, and two things bite: how to assign credit across agents, and how to stay consistent when the messages between them arrive late.

Engineering detail. Use QMIX: each intersection keeps a local Qi; a monotonic mixing network combines them into Qtot for centralised training, decentralised execution. At inference the edge boxes run only their local Qi and sync parameter deltas (ΔW) every ~30 s; if the central coordinator fails, each box degrades to independent Q-learning to keep the corridor running. For communication delay, weight stale gradients down by their age and feed a time-embedding into the mixer so it learns to compensate; output a delay-confidence σ, and when σ is low, apply a conservative action mask so a late message cannot break a safety constraint. For a city-scale grid, abstract first: cluster tens of thousands of intersections into a few hundred super-agents (cutting per-agent state from ~10⁵ to ~10³ dims), generate samples from a large parallel SUMO simulation, and train an Ape-X-style distributed learner with gradient quantisation + Top-K sparsification (≈ 1/8 the bandwidth) and PopArt to keep multi-city reward scales aligned.

Green-extension dwell time → throughput vs. fairness

The central knob of adaptive control: how long to hold a saturated green before serving the waiting cross-street. Push the dwell-time cap up and you clear the busy approach but starve the side street (Gini rises); push it down and you waste capacity switching. This is the option's βω threshold made tangible.

throughput
side-street wait
Gini (fairness)
verdict

The through-line
Every section is one row of the MDP table turned into a mechanism: fused partial state → POMDP belief encoder + kriging imputation + semantic-constrained compression; safety-constrained action → source-side masking, formal verification, the options framework for green extension; multi-objective reward → monetised normalisation + a fairness CMDP with a suppression detector; non-stationary demand → CUSUM soft-switching + reservoir replay against forgetting; coupled transition → QMIX with delay-aware syncing and an IQL fallback. You never reached for a tool until a row of the table demanded it. That discipline is the whole track.

Further considerations