Personalized medical dosing
A clinician adjusts a drug dose every few hours from a patient's labs and vitals, trying to reach a therapeutic effect without crossing into toxicity. That is a sequential-decision problem — a closed-loop controller over a living body — and it is the cleanest place to learn the parts of applied RL that no game teaches you: the state is a heterogeneous, missing, high-dimensional clinical record; the action is a continuous dose under a hard safety cap; the reward is an efficacy-minus-toxicity signal that arrives days late; and you can never explore on a real patient — so everything is offline, conservative, and audited. Each of those names a mechanism.
1 · Formulate — the MDP behind a dosing controller
Intuition. Picture a sepsis patient in the ICU. Every few hours a decision moment arrives: look at the chart, choose a dose of a vasopressor or fluid, watch what the body does, repeat. The chart is the state, the dose is the action, the change in the patient's status is the reward, and the patient's physiology is the transition we cannot write down. Every difficulty in this lesson is one of those four pieces being awkward in a way no simulator is.
| Piece | For a personalized dosing agent | The awkward part |
|---|---|---|
| State S | labs, vitals, meds, notes & imaging over sliding 1/6/12/24 h windows — thousands of raw fields | heterogeneous, high-dimensional, and partly missing at almost every step |
| Action A | a continuous dose (e.g. µg/kg/min), one per decision moment | continuous and safety-critical — an overdose is irreversible |
| Reward R | efficacy (tumor shrinkage, restored MAP) minus toxicity penalty | delayed — side effects can appear days or steps later — and hackable |
| Transition P(s′|s,a,η) | the patient's body, with an unobserved individual response η | per-patient distribution shift; you may never explore on it → offline only |
The rightmost column is the lesson. Notice the constraint hanging off the objective: unlike a game, dosing carries a hard cost budget E[Σγᵗcₜ]≤ε that turns the plain MDP into a constrained MDP (CMDP). Sections 4 and 5 are entirely about that ε.
2 · Diagnose — three properties, in the order they bite
Intuition. You cannot fix everything at once, so order the difficulties by which one stops you first. For dosing the order is unusually clear. (a) The state is unusable raw — thousands of heterogeneous, irregularly-sampled, missing fields. Until you have a clean low-dimensional state, nothing downstream trains. (b) The action can kill — a continuous dose with a hard ceiling, where a soft penalty is not good enough. (c) The reward and the truth both arrive late — toxicity surfaces days after the dose, and because you can never run the policy on a real patient to confirm, you must learn and evaluate entirely from historical logs. These map one-to-one onto the source's chapters: state engineering, safe continuous actions, delayed-credit reward, and offline RL under regulation.
3 · Engineer the state — compress thousands of dirty fields into a stable few
Intuition. The raw record is a pile of different things measured at different times with holes everywhere: a sodium from an hour ago, a heart rate every five minutes, a free-text progress note, an imaging report from yesterday. A policy network cannot consume that. The job is to turn it into a fixed-length vector that (i) aligns everything to the decision moment, (ii) is small enough to learn from a few hundred patient-days, and (iii) tells the network what it does not know.
Engineering detail — align, encode, compress. Anchor time at the decision moment as zero and take sliding windows (e.g. past 24/12/6/1 h). Resample everything onto a uniform 5-minute grid with linear interpolation, and crucially feed a GRU-D-style decay so the network knows a value measured 20 minutes ago is more trustworthy than one from 20 hours ago. Encode the modalities separately — structured fields through an ontology embedding (~256-dim), clinical text through a domain language model mapped onto a standard concept vocabulary (~128-dim) — and concatenate into a ~384-dim primary state. Then compress with a β-VAE (β≈0.5) down to ~32 latent dimensions, each loosely aligned to one physiological system, with an MMD term added to the KL so the latent space transfers across hospitals.
In a real sepsis fluid-resuscitation task this reduction took the raw heterogeneous fields from ~1,372 dimensions down to a 32-dim unified state, after which a PPO policy converged on only ~200 ICU-patient-days of data and cut the prognostic-score prediction error by ~18%. The point is not the VAE specifically — it is that policy-centered compression (a dimensionality reduction that provably preserves the Q-values it feeds) buys you sample efficiency you cannot get any other way when patient-days are scarce.
4 · Engineer the action — make the safety cap structural, not a penalty
Intuition. The dose is a continuous number with a hard ceiling: never exceed the clinically-allowed maximum, which itself depends on the patient (weight, renal and hepatic function). The naïve approach adds a penalty term to the reward for exceeding the cap — but a penalty only discourages overdose on average, and "on average safe" is not safe. The right move is to make it impossible to output an unsafe dose by construction, so the cap never appears in the objective at all.
Engineering detail — squash into the feasible set. Let the network emit an unbounded latent z, then map it through a fixed, differentiable squashing function into the per-patient interval [amin(s), amax(s)]. Because the map's range is the safe set, the dose is always feasible — there is no dose-penalty hyperparameter to tune, and inference simply takes the deterministic action with exploration noise switched off, giving a 100%-satisfied cap.
When the safe interval moves with the patient, feed amin(s), amax(s) as additional network inputs — the embedding logic is identical. For more complex couplings (e.g. a total-metabolic-load constraint across several co-administered drugs) replace the scalar squash with a small differentiable convex-optimization layer that projects the raw action onto the feasible polytope, keeping the whole thing end-to-end trainable. A useful proof point for a reviewer: the projection is a non-expansive operator, so it preserves the policy's Lipschitz continuity and the monotonic-improvement guarantees of CPO still hold.
5 · Engineer the reward — efficacy minus toxicity, with late credit assigned correctly
Intuition. The reward must say "this dose moved the patient toward cure without poisoning them." That is two opposing signals — benefit and harm — and the harm often shows up much later than the dose that caused it. If you only reward the immediate step, the agent learns to push the dose because today's tumor-shrinkage reward is real and the toxicity bill is days away.
Engineering detail — a decomposed, auto-computable reward. Build the per-step reward from quantities the hospital systems already record, so no extra manual entry is needed and the whole thing is reproducible for audit:
where ΔQALYt converts a utility gain over Δt days (e.g. a 0.12 utility lift over 21 days ≈ 0.0069 QALY), Δtumort is the normalized lesion-diameter change (a 30% partial response → 0.3; disease progression goes negative), and the penalty counts same-day grade-3 and grade-4 toxicity events with a one-time −0.2 terminal penalty on death. Normalize the reward to [−1, 1] and use γ=0.99, chosen to keep the same monotonic direction as the standard health-economic discount rate.
Delayed credit — a cost channel plus counterfactuals. Because toxicity is delayed, run a second critic Qc(s,a) that estimates long-run accumulated cost, and compute its advantage with GAE(λ) so credit propagates across very long delays. The policy gradient then subtracts the cost advantage with a dynamic Lagrange multiplier kept on its dual ascent:
When the side effect is extremely sparse, add a hindsight counterfactual network that takes (st, at, st+k) and predicts the counterfactual probability that this action caused the later harm, trained with binary cross-entropy — giving you single-step credit even when the consequence is a thousand steps away.
6 · Guard — barrier functions, offline evaluation, and conservative extrapolation
Intuition. The squash from §4 keeps a single dose inside its cap, but safety in a sequential system is about the trajectory: you want the patient's physiological state to never leave a safe region across all future steps. And because you can never test the policy live, you must prove it is at least as good as the clinicians before you are allowed to deploy it.
Engineering detail — a control barrier function as a per-step shield. Define a barrier B(x) whose safe set is the zero-superlevel set, and require, at every step, the discrete-time CBF condition. Combined with maximizing the Q-value this is a tiny quadratic program solved before each action — it projects the RL's raw dose onto the set that keeps the state forward-invariant:
For a robust certificate that survives the sim-to-real gap, tighten the right-hand side by the model-error bound ε‖∇B(x)‖; invariance still holds. When the system is too high-dimensional to write B(x) by hand, parameterize it as a network Bθ and co-train a barrier loss ReLU(LfBθ+LgBθu−α(Bθ)) with the policy loss.
Engineering detail — offline policy evaluation before any patient sees the policy. Turn historical trials into a batch log D={⟨s,a,r,s′⟩} (using inverse-probability-of-censoring weights so right-censored outcomes give an unbiased reward; discretize continuous doses into 20–50 clinically interpretable clusters for clinician review). Then estimate the new policy's value off-policy: run weighted importance sampling and fitted-Q evaluation, and if they disagree by >10% the distribution shift is too large to trust. Wrap the estimate in a doubly-robust estimator and a 1000-sample bootstrap 95% confidence interval — and if the lower bound falls below the clinicians' policy by more than the 5th percentile, reject the AI policy. This DR + bootstrap-CI pattern is the technical core of regulator-grade offline evaluation.
7 · The dose–toxicity trade-off — the central knob
Intuition. Strip everything away and one tension remains: push the dose up and you gain efficacy but raise the probability of crossing into toxicity; the constraint budget ε caps how much expected toxicity you will tolerate. The widget below is the napkin model of that trade-off. A patient's response shifts the dose–efficacy curve left or right (the unobserved η); the cost channel and the CBF shield are what keep the expected toxicity under ε as you turn the dose up.
8 · Iterate — the through-line
Further considerations
- Cross-hospital domain shift. Reference ranges and device versions differ between sites. Add a reference-range z-score plus a hospital-ID embedding so the network learns the domain-offset coefficient, and apply unsupervised domain alignment (e.g. CORAL) so a policy trained at a tertiary center can be deployed zero-shot at a smaller hospital with a small AUC drop. For evaluation, transfer-OPE with domain-adaptive importance sampling carries source-center knowledge to the target center with a finite-sample error bound.
- Survival-curve rewards. When the outcome is a survival curve rather than a scalar, generalize OPE to continuous-time survival: fit a neural Cox model λ(t|s,a) and integrate over time to get expected survival benefit, instead of collapsing the endpoint to one number.
- Multi-objective conflict. Jointly optimizing efficacy and toxicity is a Pareto problem; gradient-surgery methods (e.g. PCGrad) stop the toxicity signal from being drowned by the efficacy signal, and a Pareto-stable region selects the Lagrange/weight vector without hand-tuning.
- Explainability for review. Ship a white-box backup policy (an ε-optimal decision tree whose value differs from the deep policy by <1% under the same state distribution), SHAP top-5 feature attributions, and a conformal-prediction interval per recommended dose; if the interval exceeds the clinical tolerance (e.g. ±0.5 U/h for insulin), force a downgrade to the white-box policy and log the uncertainty event.
- Federated learning. To keep raw data inside each hospital, aggregate random-effects priors via federated averaging and fine-tune locally per patient — expanding the prior sample over time without any record leaving the institution. Exchange only constraint gradients under differential privacy and secure aggregation when collaborating across sites.