all lessons / reinforcement learning / 86 · Epidemic intervention strategy lesson 86 / 87

Epidemic intervention strategy

The last domain — and the one where a wrong action is measured in lives and in GDP at the same time. A public-health agency must decide, week after week, how hard to test, how hard to lock down, and where to ship a finite stockpile of vaccine, while the truth it is reacting to — the true number of infections — is never directly observed, the virus it is fighting mutates underneath the policy, and every action sits under hard legal and inventory ceilings. For an epidemic controller the binding difficulties are partial observability (you see reported cases, not infections), a multi-objective reward (health vs. economy, neither dominant), hard safety constraints (stockpile, cold-chain, statute), and a non-stationary transition (the pathogen evolves). Each one names a tool.

The method — five steps, every lesson
Applied RL is not a grab-bag of tricks. Every domain in this track is the same loop: (1) Formulate the MDP — state, action, reward, transition, horizon. (2) Diagnose the one property that makes the MDP hard. (3) Engineer the mechanism that removes that difficulty. (4) Guard it in production — detect when it breaks and fall back. (5) Iterate. The art is only in which difficulty binds first. This is lesson 86 — the same loop you have run nineteen times, now on a problem where the cost function has units of both deaths and dollars.

1 · Formulate — the MDP behind an epidemic controller

Intuition. A health agency observes a noisy dashboard (reported cases, test positivity, hospital load), chooses a bundle of interventions (test more, distance more, allocate vaccine), and watches the curve bend weeks later. That is a Markov Decision Process — but a leaky one. The thing you would like to be your state, the true infection count, is exactly the thing you cannot measure. Everything below follows from one of the four pieces being awkward.

MDP = (S, A, R, P, γ)  ·  maximize  E[ Σₜ γᵗ rₜ ]   subject to   E[ Σₜ cₜ ] ≤ d
PieceFor a regional epidemic controllerThe awkward part
State Strue S/E/I/R compartments, variant mix, stockpile by lot, cold-chain status — ~tens of dimensionstrue infections are never observed; you see only lagged, biased reports → partially observable
Action Atest intensity (3 levels) × distancing (2 levels) × antigen push (on/off); plus continuous lockdown β and per-lot vaccine allocationhybrid, and most allocations are physically or legally infeasible
Reward R−(deaths + estimation error) and −(GDP loss + deficit/CPI breaches)two incommensurable objectives; a scalar weight hides an ethical choice
Transition Pepidemic dynamics + human behaviour + the pathogen itselfthe variant mutates → non-stationary, and over-intervention warps the data you learn from

The note in the constraint above the table is not decoration: an epidemic MDP is a Constrained MDP. Maximizing reward is never enough — the policy must provably respect a stockpile and a statute. Notice the pattern we reuse for all 20 domains: the rightmost column is the lesson. Each row's difficulty has a named mechanism that answers it.

2 · Diagnose — what actually makes this MDP hard

Intuition. Three properties bind, roughly in order of how early they bite. First, you are flying on instruments: the state is hidden behind a lagged, undercounting reporting pipeline, so the agent must infer a belief before it can act. Second, the reward has two axes — lives and livelihoods — and no exchange rate falls out of the math; you have to put one in, and then defend it. Third, both the stockpile and the law impose hard ceilings that a reward penalty alone will eventually violate, and the pathogen keeps moving the dynamics underneath you.

The three binding difficulties — and the tools they name
(1) Partial observability — hidden infections. Reported cases lag and undercount true infections; the Markov state is unobserved. → belief-state RL (recursive networks / particle filters), information-gain bonus. (2) Multi-objective, hard-to-feasibility-constrain reward. Health and economy trade off with no natural unit, while stockpile and statute are absolute. → constrained RL (CPO / Lagrangian / barrier functions) and an explicit value-of-statistical-life term. (3) Non-stationary transition. Variants change the dynamics; aggressive policy distorts behaviour and thus the observations. → change-point detection + meta-RL fast adaptation, behaviour-fatigue penalties.

3 · Partial observability — infer a belief before you act

Intuition. You can never see how many people are infected today; you can only see how many tested positive, days late, after a screening policy that itself changes the count. If the agent conditions on the raw dashboard it will chase a delayed, biased shadow of the real state. The fix is the one every POMDP lesson reaches for: stop pretending the observation is the state, and instead carry a belief — a compact statistic of the whole history that is sufficient to predict what comes next.

bt = fθ(o1:t, a1:t−1)  ≈  P(st | history)  ·  π(at | bt)

Engineering detail. A four-step recipe takes this to production. (1) Abstract the gap: produce an explicit "missing-observation list" separating what is physically invisible (true infections) from what is legally or cost-masked. (2) Approximate the belief: a recursive network — here a GRU — maps the observation sequence to a low-dimensional sufficient statistic; the GRU hidden layer is the belief state. Validate it offline with a belief-prediction KPI such as next-observation MSE ≤ 0.05. (3) Adapt the algorithm: for the discrete intervention head, train PPO-GRU end-to-end so the policy reads the belief, not the raw counts; add an information-gain term to the reward so the agent is paid to test where its belief entropy is highest. (4) Gray-launch under audit: distil a "fully-observed teacher" (trained where the true compartments are known in simulation) into the partially-observed student, then ship the student on 5% of jurisdictions, watching that the cumulative-return gap to the teacher stays < 1%. Evaluation uses a double metric — MAPE on the hidden-infection estimate and peak-reduction rate — cross-checked with a double-machine-learning counterfactual simulation. The payoff is "estimation-and-decision in one": the same network that infers hidden infection also emits the optimal intervention.

Why a belief, not a bigger window of raw observations
Feeding the last k dashboards into a feed-forward policy still leaves the policy reacting to a lagged shadow — and the optimal lag is itself non-stationary. A recursive belief integrates the entire history into a fixed-width statistic, so reporting delay, weekend dips, and screening-policy changes are absorbed rather than chased. When the belief dimension explodes after multimodal fusion (mobility + wastewater + genomics), compress it with a sparse belief representation rather than widening the policy input.

4 · The action space is constrained — project onto what is feasible

Intuition. Two of the agent's actions are not free. Vaccine allocation cannot exceed the stockpile, cannot draw from a lot that is expired or has broken cold chain, and must prefer near-expiry doses. Lockdown intensity β cannot exceed the legal ceiling. A reward penalty for violations is not enough: early in training the agent will propose infeasible bundles, and a single shipped over-allocation is a real-world failure, not a bad gradient. The fix is to make infeasible actions unreachable, by projecting every proposed action onto the feasible set before the environment ever executes it.

ã = proj𝒞(a)  ,  𝒞 = { x : Σb xb ≤ St,   0 ≤ xb ≤ sb,   xb=0 if lot b over-temp }

Engineering detail — Constrained PPO with a safe layer. The actor outputs a raw allocation; a safe layer analytically projects it onto the feasibility simplex 𝒞 above — an analytic projection so it adds zero latency — and over-temperature lots are masked out of the action dimension entirely, so a physically impossible release can never be sampled. The reward encodes regulatory priority as an order-of-magnitude penalty ladder so the agent learns the hierarchy of "never" rather than averaging it away:

rt = (coverage rate) − 10⁴·𝟙[Σx>S] − 10⁵·𝟙[near-expiry not first] − 10⁶·𝟙[over-temp release]

Training upweights cold-chain-anomaly transitions ~10× via importance-sampled replay so the policy becomes sensitive to rare violations, and the Lagrange multiplier λ on the soft constraints is clamped at an upper bound (e.g. 10³) so the dual variable cannot run away into over-conservatism that tanks coverage. Lockdown β is handled symmetrically: the actor emits (μ, log σ), samples β, and the safe layer projects it into [0, βlegal]; SAC with automatic temperature α trains the continuous policy with an entropy bonus, and a sigmoid scaling fixes β=0.5 to "moderate" while β=1 stays within statute. In a real stockpile back-test this structure held 90 consecutive days with zero inventory violations, cut near-expiry wastage from 5.7% to 0.3%, and raised coverage by 4.1% — feasibility and performance at once, not a trade.

Penalty vs. projection vs. proof
Three strengths of constraint enforcement, increasing in guarantee and in cost. A reward penalty is soft — violated freely during exploration. A safe-layer projection makes each executed action feasible — a hard per-step guarantee, the workhorse here. A control-barrier function goes further: it certifies the closed-loop trajectory never enters the danger set for all time. Construct a barrier h(s) ≥ 0 on the safe set and require ∇h·f(s,π(s)) ≥ −κ·h(s); a Lyapunov comparison then gives V(t) ≤ V(0)e−κt, so a trajectory starting safe stays safe. Barrier functions give an infinite-horizon hard guarantee but need an accurate model — CPO's trust-region only guarantees the single-step linearized constraint. Choosing among them is an engineering decision, not a default.

5 · The reward is two objectives — make the exchange rate explicit

Intuition. Every action saves lives and costs output; there is no scalar reward until you decide how many dollars a life is worth. Hiding that in a hand-tuned weight is both fragile and indefensible to a regulator. The honest move is to name it: a value of statistical life (VSL) term converts averted deaths into the same units as GDP loss, and the death side is additionally written as a hard constraint, not just a price.

rt = −α·(GDP losst) − β·Var(losst) − γ·Penaltyt  ,  subject to   Eπ[ deaths per episode ] ≤ ε

Engineering detail. The economic side is concrete: with α≈10 a 1% output gap maps to −0.1 reward, β≈0.5 suppresses high-variance policies, and Penaltyt fires (−10) only when a macro red line is crossed (deficit > 3% or CPI > 3.5%), else 0. The state ingests ~10 high-frequency proxies — night-light index, transit ridership, blast-furnace utilization — so the agent closes the loop at monthly frequency. The health side is enforced with CPO: the per-episode expected-death constraint Eπ[cost] ≤ ε is updated in the trust region, and a Monte-Carlo back-test checks the realized death rate is 95%-confidence below the threshold across a simulation → closed-environment → live ladder. Offline, a doubly-robust estimator on historical windows showed the RL policy cut cumulative GDP loss by ~1.2 percentage points without tripping any macro red line. To prevent extreme corner solutions on the Pareto front (all-economy or all-health), the policy is constrained away from the front's steep inflection — analogous to capping maximum drawdown — which in deployment kept worst-case outcomes bounded while sacrificing only ~0.8% of headline performance.

1 · FORMULATE S, A, R, P constrained MDP 2 · DIAGNOSE hidden infections, two objectives 3 · ENGINEER belief GRU, safe layer, CPO+VSL 4 · GUARD BOCPD detect, KL circuit-break 5 · ITERATE re-diagnose a new variant moves the dynamics — re-run the loop

6 · Non-stationarity — detect the variant, then adapt fast and safely

Intuition. The hardest property is that the world does not hold still. A new variant changes transmissibility and severity; the transition function P drifts under the policy. A controller trained on last season's dynamics will quietly degrade, and the danger is mistaking "the policy is diverging" for "the environment changed" — opposite fixes. The guard is a three-layer defence: notice the drop, locate the change-point, and adapt conservatively without forgetting what still works.

Engineering detail — three layers. (1) Real-time metrics. Maintain a 1000-step sliding window of mean episode return and its 95% lower confidence bound; if three consecutive windows fall below 90% of the historical best, raise a level-1 alert. Disambiguate with policy KL: KL > 0.02 with a >5% return drop means the policy is diverging; KL ≈ 0 with a return drop points at the environment or reward. (2) Distribution test. Feed the last ~2000 reward samples to BOCPD (Bayesian online change-point detection); a posterior change probability > 0.7 pins the mutation moment, and a χ² test on the state-visitation distribution before vs. after (p < 0.01) confirms a genuine dynamics drift. (3) Safe adjustment. On confirmation, cut the learning rate to 10%, shrink PPO clip ε from 0.2 to 0.05, freeze the shared feature layers and fine-tune only the policy/value heads to avoid catastrophic forgetting; insert the ±500 transitions around the change-point into replay at top priority with truncated (≤2) importance weights. If the system was pre-seeded with a Reptile / Meta-PPO initialization, a few inner-loop steps on the freshly-collected post-mutation trajectories recover most of the original return in seconds rather than hours. If 24 hours later return is still below 80% of best, auto-fall-back to the filed rule-based policy and emit a structured incident report (mutation type, timestamp, impact estimate) for audit.

The through-line
Every section above is one row of the MDP table turned into a mechanism: hidden infections → a belief GRU with an information-gain bonus; infeasible actions → a safe-layer projection plus action masking, escalating to a barrier-function proof when a hard guarantee is required; two-objective reward → an explicit VSL exchange rate with a CPO death constraint and an anti-extreme Pareto cap; mutating transition → BOCPD detection feeding meta-RL fast adaptation with a rule-based fallback. You never reached for a tool until a row of the table demanded it. That discipline — write the MDP, name the difficulty that binds first, remove exactly that — is the whole track, and it is the same on a game frame as it is on an epidemic curve.
Health–economy frontier — where on the lockdown curve do you sit?

Lockdown intensity β trades deaths against output. The reward weights both into one number, but a hard death-constraint ε and a legal ceiling βlegal bound the choice. Slide β and the VSL exchange rate; watch the scalarized reward, the constraint, and whether the operating point is feasible.

deaths /100k
GDP loss %
scalar reward
feasible?

Further considerations