Epidemic intervention strategy
The last domain — and the one where a wrong action is measured in lives and in GDP at the same time. A public-health agency must decide, week after week, how hard to test, how hard to lock down, and where to ship a finite stockpile of vaccine, while the truth it is reacting to — the true number of infections — is never directly observed, the virus it is fighting mutates underneath the policy, and every action sits under hard legal and inventory ceilings. For an epidemic controller the binding difficulties are partial observability (you see reported cases, not infections), a multi-objective reward (health vs. economy, neither dominant), hard safety constraints (stockpile, cold-chain, statute), and a non-stationary transition (the pathogen evolves). Each one names a tool.
1 · Formulate — the MDP behind an epidemic controller
Intuition. A health agency observes a noisy dashboard (reported cases, test positivity, hospital load), chooses a bundle of interventions (test more, distance more, allocate vaccine), and watches the curve bend weeks later. That is a Markov Decision Process — but a leaky one. The thing you would like to be your state, the true infection count, is exactly the thing you cannot measure. Everything below follows from one of the four pieces being awkward.
| Piece | For a regional epidemic controller | The awkward part |
|---|---|---|
| State S | true S/E/I/R compartments, variant mix, stockpile by lot, cold-chain status — ~tens of dimensions | true infections are never observed; you see only lagged, biased reports → partially observable |
| Action A | test intensity (3 levels) × distancing (2 levels) × antigen push (on/off); plus continuous lockdown β and per-lot vaccine allocation | hybrid, and most allocations are physically or legally infeasible |
| Reward R | −(deaths + estimation error) and −(GDP loss + deficit/CPI breaches) | two incommensurable objectives; a scalar weight hides an ethical choice |
| Transition P | epidemic dynamics + human behaviour + the pathogen itself | the variant mutates → non-stationary, and over-intervention warps the data you learn from |
The note in the constraint above the table is not decoration: an epidemic MDP is a Constrained MDP. Maximizing reward is never enough — the policy must provably respect a stockpile and a statute. Notice the pattern we reuse for all 20 domains: the rightmost column is the lesson. Each row's difficulty has a named mechanism that answers it.
2 · Diagnose — what actually makes this MDP hard
Intuition. Three properties bind, roughly in order of how early they bite. First, you are flying on instruments: the state is hidden behind a lagged, undercounting reporting pipeline, so the agent must infer a belief before it can act. Second, the reward has two axes — lives and livelihoods — and no exchange rate falls out of the math; you have to put one in, and then defend it. Third, both the stockpile and the law impose hard ceilings that a reward penalty alone will eventually violate, and the pathogen keeps moving the dynamics underneath you.
3 · Partial observability — infer a belief before you act
Intuition. You can never see how many people are infected today; you can only see how many tested positive, days late, after a screening policy that itself changes the count. If the agent conditions on the raw dashboard it will chase a delayed, biased shadow of the real state. The fix is the one every POMDP lesson reaches for: stop pretending the observation is the state, and instead carry a belief — a compact statistic of the whole history that is sufficient to predict what comes next.
Engineering detail. A four-step recipe takes this to production. (1) Abstract the gap: produce an explicit "missing-observation list" separating what is physically invisible (true infections) from what is legally or cost-masked. (2) Approximate the belief: a recursive network — here a GRU — maps the observation sequence to a low-dimensional sufficient statistic; the GRU hidden layer is the belief state. Validate it offline with a belief-prediction KPI such as next-observation MSE ≤ 0.05. (3) Adapt the algorithm: for the discrete intervention head, train PPO-GRU end-to-end so the policy reads the belief, not the raw counts; add an information-gain term to the reward so the agent is paid to test where its belief entropy is highest. (4) Gray-launch under audit: distil a "fully-observed teacher" (trained where the true compartments are known in simulation) into the partially-observed student, then ship the student on 5% of jurisdictions, watching that the cumulative-return gap to the teacher stays < 1%. Evaluation uses a double metric — MAPE on the hidden-infection estimate and peak-reduction rate — cross-checked with a double-machine-learning counterfactual simulation. The payoff is "estimation-and-decision in one": the same network that infers hidden infection also emits the optimal intervention.
4 · The action space is constrained — project onto what is feasible
Intuition. Two of the agent's actions are not free. Vaccine allocation cannot exceed the stockpile, cannot draw from a lot that is expired or has broken cold chain, and must prefer near-expiry doses. Lockdown intensity β cannot exceed the legal ceiling. A reward penalty for violations is not enough: early in training the agent will propose infeasible bundles, and a single shipped over-allocation is a real-world failure, not a bad gradient. The fix is to make infeasible actions unreachable, by projecting every proposed action onto the feasible set before the environment ever executes it.
Engineering detail — Constrained PPO with a safe layer. The actor outputs a raw allocation; a safe layer analytically projects it onto the feasibility simplex 𝒞 above — an analytic projection so it adds zero latency — and over-temperature lots are masked out of the action dimension entirely, so a physically impossible release can never be sampled. The reward encodes regulatory priority as an order-of-magnitude penalty ladder so the agent learns the hierarchy of "never" rather than averaging it away:
Training upweights cold-chain-anomaly transitions ~10× via importance-sampled replay so the policy becomes sensitive to rare violations, and the Lagrange multiplier λ on the soft constraints is clamped at an upper bound (e.g. 10³) so the dual variable cannot run away into over-conservatism that tanks coverage. Lockdown β is handled symmetrically: the actor emits (μ, log σ), samples β, and the safe layer projects it into [0, βlegal]; SAC with automatic temperature α trains the continuous policy with an entropy bonus, and a sigmoid scaling fixes β=0.5 to "moderate" while β=1 stays within statute. In a real stockpile back-test this structure held 90 consecutive days with zero inventory violations, cut near-expiry wastage from 5.7% to 0.3%, and raised coverage by 4.1% — feasibility and performance at once, not a trade.
5 · The reward is two objectives — make the exchange rate explicit
Intuition. Every action saves lives and costs output; there is no scalar reward until you decide how many dollars a life is worth. Hiding that in a hand-tuned weight is both fragile and indefensible to a regulator. The honest move is to name it: a value of statistical life (VSL) term converts averted deaths into the same units as GDP loss, and the death side is additionally written as a hard constraint, not just a price.
Engineering detail. The economic side is concrete: with α≈10 a 1% output gap maps to −0.1 reward, β≈0.5 suppresses high-variance policies, and Penaltyt fires (−10) only when a macro red line is crossed (deficit > 3% or CPI > 3.5%), else 0. The state ingests ~10 high-frequency proxies — night-light index, transit ridership, blast-furnace utilization — so the agent closes the loop at monthly frequency. The health side is enforced with CPO: the per-episode expected-death constraint Eπ[cost] ≤ ε is updated in the trust region, and a Monte-Carlo back-test checks the realized death rate is 95%-confidence below the threshold across a simulation → closed-environment → live ladder. Offline, a doubly-robust estimator on historical windows showed the RL policy cut cumulative GDP loss by ~1.2 percentage points without tripping any macro red line. To prevent extreme corner solutions on the Pareto front (all-economy or all-health), the policy is constrained away from the front's steep inflection — analogous to capping maximum drawdown — which in deployment kept worst-case outcomes bounded while sacrificing only ~0.8% of headline performance.
6 · Non-stationarity — detect the variant, then adapt fast and safely
Intuition. The hardest property is that the world does not hold still. A new variant changes transmissibility and severity; the transition function P drifts under the policy. A controller trained on last season's dynamics will quietly degrade, and the danger is mistaking "the policy is diverging" for "the environment changed" — opposite fixes. The guard is a three-layer defence: notice the drop, locate the change-point, and adapt conservatively without forgetting what still works.
Engineering detail — three layers. (1) Real-time metrics. Maintain a 1000-step sliding window of mean episode return and its 95% lower confidence bound; if three consecutive windows fall below 90% of the historical best, raise a level-1 alert. Disambiguate with policy KL: KL > 0.02 with a >5% return drop means the policy is diverging; KL ≈ 0 with a return drop points at the environment or reward. (2) Distribution test. Feed the last ~2000 reward samples to BOCPD (Bayesian online change-point detection); a posterior change probability > 0.7 pins the mutation moment, and a χ² test on the state-visitation distribution before vs. after (p < 0.01) confirms a genuine dynamics drift. (3) Safe adjustment. On confirmation, cut the learning rate to 10%, shrink PPO clip ε from 0.2 to 0.05, freeze the shared feature layers and fine-tune only the policy/value heads to avoid catastrophic forgetting; insert the ±500 transitions around the change-point into replay at top priority with truncated (≤2) importance weights. If the system was pre-seeded with a Reptile / Meta-PPO initialization, a few inner-loop steps on the freshly-collected post-mutation trajectories recover most of the original return in seconds rather than hours. If 24 hours later return is still below 80% of best, auto-fall-back to the filed rule-based policy and emit a structured incident report (mutation type, timestamp, impact estimate) for audit.
Further considerations
- Catastrophic forgetting under data shift. When a new data source's distribution jumps, lock the important parameters of the old policy with EWC or MAS regularization and bound the new policy's divergence with a KL constraint, so adaptation does not erase hard-won safe behaviour.
- Black-box source quality. If a feed (mobility, genomics) cannot self-report a confidence, train a meta-confidence network that treats "trust this source?" as a discrete action optimized by policy gradient — folding source reliability into the policy rather than hand-weighting it, with automatic weight decay when a source's variance spikes.
- Multi-agent coordination across tiers. Province / city / district control agencies have different observation granularities; model them as a hierarchical multi-agent POMDP where the upper tier sets a testing budget and the lower tier names specific streets, trained with MADDPG plus a communication protocol so resources are not wasted by each tier acting alone. For coupled stockpiles across warehouse tiers, use multi-agent CPO with a global Lagrangian-multiplier consensus for distributed safety.
- Human behaviour feedback. Over-intervention causes public fatigue, which distorts the very observations you learn from (people stop self-testing). Add a behaviour-fatigue penalty to the reward, and use inverse RL on real mobility data to recover the implicit cost function, making the policy more humane and the data less self-corrupting.
- Robust demand under surges. For a sudden booster campaign, wrap the safe layer in a chance constraint P(Σxb ≤ St) ≥ 1−ε via a CVaR approximation, turning the probabilistic constraint into a differentiable loss that holds under demand volatility.
- Auditability. Persist the belief-vector version and the KL-change curve to an append-only store, enable counterfactual replay, and on continuous learning update only a shadow policy — promote it to production atomically only after it clears a desensitized validation run, archiving the prior policy's fingerprint so rollbacks remain provable.