Personalized medical dosing

A clinician adjusts a drug dose every few hours from a patient's labs and vitals, trying to reach a therapeutic effect without crossing into toxicity. That is a sequential-decision problem — a closed-loop controller over a living body — and it is the cleanest place to learn the parts of applied RL that no game teaches you: the state is a heterogeneous, missing, high-dimensional clinical record; the action is a continuous dose under a hard safety cap; the reward is an efficacy-minus-toxicity signal that arrives days late; and you can never explore on a real patient — so everything is offline, conservative, and audited. Each of those names a mechanism.

The method — five steps, every lesson

Applied RL is one loop. (1) Formulate the MDP — state, action, reward, transition, horizon. (2) Diagnose the one or two properties that make this MDP hard. (3) Engineer the mechanism that removes that difficulty — and not before. (4) Guard it in production — detect when it breaks and fall back to something safe. (5) Iterate. For dosing the difficulties stack: the state is dirty, the action is dangerous, the reward is delayed, and you have no environment to query. We run the loop once per layer.

1 · Formulate — the MDP behind a dosing controller

Intuition. Picture a sepsis patient in the ICU. Every few hours a decision moment arrives: look at the chart, choose a dose of a vasopressor or fluid, watch what the body does, repeat. The chart is the state, the dose is the action, the change in the patient's status is the reward, and the patient's physiology is the transition we cannot write down. Every difficulty in this lesson is one of those four pieces being awkward in a way no simulator is.

MDP = (S, A, R, P, γ) · maximize E[ Σₜ γᵗ rₜ ] subject to E[ Σₜ γᵗ cₜ ] ≤ ε

Piece	For a personalized dosing agent	The awkward part
State S	labs, vitals, meds, notes & imaging over sliding 1/6/12/24 h windows — thousands of raw fields	heterogeneous, high-dimensional, and partly missing at almost every step
Action A	a continuous dose (e.g. µg/kg/min), one per decision moment	continuous and safety-critical — an overdose is irreversible
Reward R	efficacy (tumor shrinkage, restored MAP) minus toxicity penalty	delayed — side effects can appear days or steps later — and hackable
Transition P(s′\|s,a,η)	the patient's body, with an unobserved individual response η	per-patient distribution shift; you may never explore on it → offline only

The rightmost column is the lesson. Notice the constraint hanging off the objective: unlike a game, dosing carries a hard cost budget E[Σγᵗcₜ]≤ε that turns the plain MDP into a constrained MDP (CMDP). Sections 4 and 5 are entirely about that ε.

2 · Diagnose — three properties, in the order they bite

Intuition. You cannot fix everything at once, so order the difficulties by which one stops you first. For dosing the order is unusually clear. (a) The state is unusable raw — thousands of heterogeneous, irregularly-sampled, missing fields. Until you have a clean low-dimensional state, nothing downstream trains. (b) The action can kill — a continuous dose with a hard ceiling, where a soft penalty is not good enough. (c) The reward and the truth both arrive late — toxicity surfaces days after the dose, and because you can never run the policy on a real patient to confirm, you must learn and evaluate entirely from historical logs. These map one-to-one onto the source's chapters: state engineering, safe continuous actions, delayed-credit reward, and offline RL under regulation.

Why this domain is not a game

In a game a bad action costs a frame; here it can cost a life, and there is no reset button. Three consequences follow and shape every mechanism below: (1) exploration on the real environment is forbidden — the data is fixed, retrospective, and biased by whatever the clinicians already did; (2) safety must be a hard guarantee, not an expected-value penalty; (3) every choice — the reward weights, the dimensionality reduction, the policy version — must be explainable and auditable, because a regulator and an ethics board will read it.

3 · Engineer the state — compress thousands of dirty fields into a stable few

Intuition. The raw record is a pile of different things measured at different times with holes everywhere: a sodium from an hour ago, a heart rate every five minutes, a free-text progress note, an imaging report from yesterday. A policy network cannot consume that. The job is to turn it into a fixed-length vector that (i) aligns everything to the decision moment, (ii) is small enough to learn from a few hundred patient-days, and (iii) tells the network what it does not know.

Engineering detail — align, encode, compress. Anchor time at the decision moment as zero and take sliding windows (e.g. past 24/12/6/1 h). Resample everything onto a uniform 5-minute grid with linear interpolation, and crucially feed a GRU-D-style decay so the network knows a value measured 20 minutes ago is more trustworthy than one from 20 hours ago. Encode the modalities separately — structured fields through an ontology embedding (~256-dim), clinical text through a domain language model mapped onto a standard concept vocabulary (~128-dim) — and concatenate into a ~384-dim primary state. Then compress with a β-VAE (β≈0.5) down to ~32 latent dimensions, each loosely aligned to one physiological system, with an MMD term added to the KL so the latent space transfers across hospitals.

ℒ_VAE = ‖x − x̂‖² + β·KL( q(z|x) ‖ p(z) ) + λ·MMD( q(z), q′(z) )

In a real sepsis fluid-resuscitation task this reduction took the raw heterogeneous fields from ~1,372 dimensions down to a 32-dim unified state, after which a PPO policy converged on only ~200 ICU-patient-days of data and cut the prognostic-score prediction error by ~18%. The point is not the VAE specifically — it is that policy-centered compression (a dimensionality reduction that provably preserves the Q-values it feeds) buys you sample efficiency you cannot get any other way when patient-days are scarce.

Missingness is a state feature, not a defect to silently fill

Intuition. A blank lab is information — it tells you the clinician chose not to order it. Quietly imputing the mean hides that. Engineering detail. First classify the missingness mechanism (MCAR / MAR / MNAR) and log it. Then append a 32-dim missing-mask to the state so the policy explicitly perceives its own uncertainty. When missingness is high, treat the problem as a POMDP: maintain a belief over the true state (a particle filter or a GRU+attention history encoder) and feed the belief, not a point estimate, to the critic. During training, inject random missingness and adversarially mask inputs so the policy is robust to the holes it will see in production; aim to keep return degradation under ~5% at test time, and write the imputation confidence into the audit log.

4 · Engineer the action — make the safety cap structural, not a penalty

Intuition. The dose is a continuous number with a hard ceiling: never exceed the clinically-allowed maximum, which itself depends on the patient (weight, renal and hepatic function). The naïve approach adds a penalty term to the reward for exceeding the cap — but a penalty only discourages overdose on average, and "on average safe" is not safe. The right move is to make it impossible to output an unsafe dose by construction, so the cap never appears in the objective at all.

Engineering detail — squash into the feasible set. Let the network emit an unbounded latent z, then map it through a fixed, differentiable squashing function into the per-patient interval [a_min(s), a_max(s)]. Because the map's range is the safe set, the dose is always feasible — there is no dose-penalty hyperparameter to tune, and inference simply takes the deterministic action with exploration noise switched off, giving a 100%-satisfied cap.

a = a_min(s) + ( a_max(s) − a_min(s) ) · σ(z), z = f_θ(s) ∈ ℝ

When the safe interval moves with the patient, feed a_min(s), a_max(s) as additional network inputs — the embedding logic is identical. For more complex couplings (e.g. a total-metabolic-load constraint across several co-administered drugs) replace the scalar squash with a small differentiable convex-optimization layer that projects the raw action onto the feasible polytope, keeping the whole thing end-to-end trainable. A useful proof point for a reviewer: the projection is a non-expansive operator, so it preserves the policy's Lipschitz continuity and the monotonic-improvement guarantees of CPO still hold.

Per-patient response shift — treat η as a belief, adapt online

Intuition. The same dose helps one patient and harms another, because each body has an unobserved sensitivity η. A policy trained on the population will be biased for any individual. Engineering detail. Model η with a hierarchical Bayesian nonlinear mixed-effects model, P(s′|s,a,η), and get its posterior by HMC. Then train a Bayes-adaptive policy that takes (s, b(η)) — state plus a belief over η — as input. Such a policy is adaptive by construction: at the bedside you update b(η) from the new patient's first 3–5 responses and the dose self-corrects. If unobserved confounders (e.g. a missing genotype) would make the belief over-confident, introduce a VAE latent z as a proxy and recover hidden reward structure from expert clinician trajectories via inverse RL.

5 · Engineer the reward — efficacy minus toxicity, with late credit assigned correctly

Intuition. The reward must say "this dose moved the patient toward cure without poisoning them." That is two opposing signals — benefit and harm — and the harm often shows up much later than the dose that caused it. If you only reward the immediate step, the agent learns to push the dose because today's tumor-shrinkage reward is real and the toxicity bill is days away.

Engineering detail — a decomposed, auto-computable reward. Build the per-step reward from quantities the hospital systems already record, so no extra manual entry is needed and the whole thing is reproducible for audit:

r_t = ΔQALY_t + Δtumor_t − penalty_t, penalty_t = 0.05·III_tox + 0.1·IV_tox + 0.2·Death_t

where ΔQALY_t converts a utility gain over Δt days (e.g. a 0.12 utility lift over 21 days ≈ 0.0069 QALY), Δtumor_t is the normalized lesion-diameter change (a 30% partial response → 0.3; disease progression goes negative), and the penalty counts same-day grade-3 and grade-4 toxicity events with a one-time −0.2 terminal penalty on death. Normalize the reward to [−1, 1] and use γ=0.99, chosen to keep the same monotonic direction as the standard health-economic discount rate.

Delayed credit — a cost channel plus counterfactuals. Because toxicity is delayed, run a second critic Q_c(s,a) that estimates long-run accumulated cost, and compute its advantage with GAE(λ) so credit propagates across very long delays. The policy gradient then subtracts the cost advantage with a dynamic Lagrange multiplier kept on its dual ascent:

∇J = E[ ( Â_r − β·Â_c ) ∇ log π_θ(a|s) ], β updated by dual gradient descent so E[Σγᵗcₜ] ≤ ε

When the side effect is extremely sparse, add a hindsight counterfactual network that takes (s_t, a_t, s_t+k) and predicts the counterfactual probability that this action caused the later harm, trained with binary cross-entropy — giving you single-step credit even when the consequence is a thousand steps away.

Reward hacking — fake efficacy is the failure to watch for

A shaped clinical reward can be gamed: the agent finds a dosing pattern that maximizes the proxy (a lab number, a surrogate endpoint) without the patient actually improving. Defend in depth. (1) Stream a Wasserstein drift detector on the recent reward distribution against baseline; if the W-distance exceeds ~0.15, flag hacking risk and downgrade to a rule-based safe policy. (2) Run a feature-level attributor (integrated gradients) on high-reward trajectories and route the contributions to a second-line review. (3) Gate every reward-function change behind versioned approval and keep the diff — a tagged, timestamped reward version aligned to the trial database is what makes the audit trail one-click exportable. Together these push the hacking miss-rate below ~0.5%.

6 · Guard — barrier functions, offline evaluation, and conservative extrapolation

Intuition. The squash from §4 keeps a single dose inside its cap, but safety in a sequential system is about the trajectory: you want the patient's physiological state to never leave a safe region across all future steps. And because you can never test the policy live, you must prove it is at least as good as the clinicians before you are allowed to deploy it.

Engineering detail — a control barrier function as a per-step shield. Define a barrier B(x) whose safe set is the zero-superlevel set, and require, at every step, the discrete-time CBF condition. Combined with maximizing the Q-value this is a tiny quadratic program solved before each action — it projects the RL's raw dose onto the set that keeps the state forward-invariant:

B(x_t+1) ≤ (1 + γΔt)·B(x_t) ⟹ h(x_t) ≥ 0 for all t ≥ 0

For a robust certificate that survives the sim-to-real gap, tighten the right-hand side by the model-error bound ε‖∇B(x)‖; invariance still holds. When the system is too high-dimensional to write B(x) by hand, parameterize it as a network B_θ and co-train a barrier loss ReLU(L_fB_θ+L_gB_θu−α(B_θ)) with the policy loss.

Engineering detail — offline policy evaluation before any patient sees the policy. Turn historical trials into a batch log D={⟨s,a,r,s′⟩} (using inverse-probability-of-censoring weights so right-censored outcomes give an unbiased reward; discretize continuous doses into 20–50 clinically interpretable clusters for clinician review). Then estimate the new policy's value off-policy: run weighted importance sampling and fitted-Q evaluation, and if they disagree by >10% the distribution shift is too large to trust. Wrap the estimate in a doubly-robust estimator and a 1000-sample bootstrap 95% confidence interval — and if the lower bound falls below the clinicians' policy by more than the 5th percentile, reject the AI policy. This DR + bootstrap-CI pattern is the technical core of regulator-grade offline evaluation.

Conservatism — stay inside the data's support

Intuition. Offline RL fails when the policy queries Q-values for (s,a) pairs nobody ever tried — the network extrapolates confidently into nonsense. Engineering detail. Penalize the value of out-of-support actions (a lower-confidence bound, e.g. mean − 1.96·σ), and mask any action whose visitation count in the data is below a statistically-justified threshold. Constrain the policy with a KL penalty toward the data distribution that anneals — strong early (β≈1.0) for safety, relaxed later (β≈0.1) for performance — and use dual gradient descent to snap β back the moment an OOD-detection rate rises. At serving time keep a One-Class OOD discriminator running; if the current state scores out-of-distribution, switch to the minimum-risk rule-based fallback and log the event. A self-driving analogue of this exact stack cut the human-takeover rate from 1.2 to 0.15 per 100 km while losing only ~4% of cumulative return — the same conservatism trade-off you accept at the bedside.

7 · The dose–toxicity trade-off — the central knob

Intuition. Strip everything away and one tension remains: push the dose up and you gain efficacy but raise the probability of crossing into toxicity; the constraint budget ε caps how much expected toxicity you will tolerate. The widget below is the napkin model of that trade-off. A patient's response shifts the dose–efficacy curve left or right (the unobserved η); the cost channel and the CBF shield are what keep the expected toxicity under ε as you turn the dose up.

8 · Iterate — the through-line

The through-line

Every section is one row of the MDP table turned into a mechanism, in the order the difficulties bite: dirty high-dimensional state → align + β-VAE compression + an explicit missing-mask; fatal continuous action → a structural squash into the safe interval (no penalty term); per-patient shift → a Bayes-adaptive belief over η; delayed, hackable reward → a cost critic with hindsight counterfactuals plus a drift detector; never-explore transition → CBF shielding, doubly-robust offline evaluation, and OOD-aware conservatism. You never reached for a tool until a row of the table demanded it — and because a regulator reads the result, each tool also carries its own explanation and audit log. That discipline is the whole track.

Further considerations

Cross-hospital domain shift. Reference ranges and device versions differ between sites. Add a reference-range z-score plus a hospital-ID embedding so the network learns the domain-offset coefficient, and apply unsupervised domain alignment (e.g. CORAL) so a policy trained at a tertiary center can be deployed zero-shot at a smaller hospital with a small AUC drop. For evaluation, transfer-OPE with domain-adaptive importance sampling carries source-center knowledge to the target center with a finite-sample error bound.
Survival-curve rewards. When the outcome is a survival curve rather than a scalar, generalize OPE to continuous-time survival: fit a neural Cox model λ(t|s,a) and integrate over time to get expected survival benefit, instead of collapsing the endpoint to one number.
Multi-objective conflict. Jointly optimizing efficacy and toxicity is a Pareto problem; gradient-surgery methods (e.g. PCGrad) stop the toxicity signal from being drowned by the efficacy signal, and a Pareto-stable region selects the Lagrange/weight vector without hand-tuning.
Explainability for review. Ship a white-box backup policy (an ε-optimal decision tree whose value differs from the deep policy by <1% under the same state distribution), SHAP top-5 feature attributions, and a conformal-prediction interval per recommended dose; if the interval exceeds the clinical tolerance (e.g. ±0.5 U/h for insulin), force a downgrade to the white-box policy and log the uncertainty event.
Federated learning. To keep raw data inside each hospital, aggregate random-effects priors via federated averaging and fine-tune locally per patient — expanding the prior sample over time without any record leaving the institution. Exchange only constraint gradients under differential privacy and secure aggregation when collaborating across sites.