Industrial control & parameter tuning
A rolling mill, a battery sorter, a servo motor — physical plants where a bad action overheats a coil, scraps a part, or trips a safety relay. Game AI could afford a billion throwaway episodes; a plant cannot afford one unsafe one. So the binding difficulties of this domain are different: real samples are scarce and expensive, some constraints must hold on every single step, sensors arrive late, and the plant ages out from under the policy. Each one names a tool — and the recurring move is to start from the controller the plant already trusts (PID/MPC) and let RL correct it, never replace it cold.
1 · Formulate — the MDP behind a tuned plant
Intuition. A controller reads sensors, sets actuators, and the plant responds; "good control" means the output tracks a setpoint while staying safe and efficient. That is already an MDP: the sensor vector is the state, the actuator command is the action, "close to setpoint, low energy, no violation" is the reward, and the plant physics is the transition. Every difficulty below is one of these four pieces being awkward in a way a video game never was.
| Piece | For a heat-line / servo plant | The awkward part |
|---|---|---|
| State S | joint angles, currents, temperatures, residual = (physics − twin), plus the posterior mean/var of unknown dynamics params | sensors arrive late (e.g. 100 ms vision) → the observed state lags the true state |
| Action A | continuous actuator command, or a correction Δθ to the controller's gains, bounded by physical limits | limits are hard: mass > 0, temperature ≤ Tmax — illegal actions damage hardware, not just score |
| Reward R | −‖output − setpoint‖² with terms for energy and for parameter jitter | multi-objective (energy vs throughput) and entangled with a cost budget that may not be exceeded |
| Transition P | plant physics; in practice a digital-twin simulator + the real machine | non-stationary (the plant ages) and sample-scarce (real rollouts cost product and time) |
Same trick as lesson 67: the rightmost column is the lesson. The plant adds a fifth column that game AI never had — a separate cost signal cₜ with a hard budget d — and that single addition reshapes every mechanism below.
2 · Diagnose — three properties, all rooted in cost
Intuition. Strip away the jargon and the plant is hard for three linked reasons. First, you cannot explore freely: a random action can scrap a part or trip a relay, so pure from-scratch RL that needs hundreds of thousands of unsafe steps is a non-starter. Second, some limits must never be crossed even once — "constraint satisfied in expectation" is not good enough when the constraint is "the motor must not exceed 80 °C." Third, the plant drifts: bearings wear, batteries age, so a policy that was optimal at commissioning slowly becomes wrong.
Engineering detail. These map to three concrete framings the source builds the whole chapter on. (a) Sample scarcity → warm-start from a known controller and learn mostly offline from logged data, so the real machine sees as few novel actions as possible. (b) Hard safety → a Constrained MDP with a per-step cost cₜ and an inviolable budget, enforced not by a soft penalty alone but by a projection / shield that makes unsafe actions literally unreachable. (c) Non-stationarity → online change-point detection plus meta-learning over an explicit distribution of "aging tasks." Sections 3–5 engineer exactly these three and nothing more.
3 · Engineer (a) — warm-start from PID, learn the residual
Intuition. The plant already runs a hand-tuned PID loop that is decent, just not optimal. Throwing it away and letting RL flail for 800k unsafe steps is wasteful and dangerous. Instead, let the PID drive, and let RL learn only the small correction on top — starting from "do nothing extra" and easing the correction in as confidence grows. You inherit the PID's safety on day one and spend RL's sample budget only on the gap to optimal.
Engineering detail. Three layers. (1) Action prior: the policy output is added to uPID, and a blend gain k ramps from 0 to 1 so the handover is smooth — no step change the operators would feel. (2) Bounded exploration: for the first ~20% of episodes the RL term is clipped to ±Δ of the actuator range, with Δ on a linear schedule from 10% down to 0; afterwards exploration is handed to SAC's automatic temperature α, which tunes its own noise. (3) Deployment split: the PID runs hard real-time in C++ on a 500 µs cycle while the RL correction runs on a slower ~5 ms inference cycle, the two combined at the actuator. On a hot-strip-rolling dataset this cut convergence from ~800k steps (pure SAC) to ~120k steps for the same reward, and dropped early-training peak overshoot from ~18% to ~3% — inside the plant's safety spec.
3b · When samples are still too few — go offline and conservative
Intuition. Often you may not run the real plant at all during training; all you have is logged data from the existing controller (and a Simulink twin). The danger of learning from a fixed dataset is delusional optimism: the value network assigns high value to actions never seen in the data, then the deployed policy chases them off a cliff. The fix is to be pessimistic about anything out-of-distribution.
Engineering detail. Use CQL-SAC: add a conservative regularizer to the Q-loss that pushes down the value of actions the policy would pick but the data never took, while pulling up the values actually logged —
with α auto-tuned (start 1.0, early-stop on a KL threshold ~0.05). Practical guards from the source: a differentiable tanh+scale saturation layer on the action input that matches the simulator's clamp, killing out-of-distribution actions at the source; LayerNorm instead of BatchNorm to survive non-stationary covariate shift; and a gate review that fails the policy unless the out-of-distribution action share is <3% on a 5-fold time-series cross-validation. The shipped policy is even extracted to ~256 K-means cluster centers for table-lookup on the embedded controller.
4 · Engineer (b) — the constraint must hold on every step
Intuition. Now the hard part. "Keep the motor under Tmax" is not a reward you can trade off — it is a wall. A Lagrangian penalty (subtract λ·cost from reward) only satisfies the constraint on average, which a safety auditor will reject: averages allow rare catastrophic violations. You want a mechanism that makes exceeding the limit geometrically impossible, then a second mechanism that catches model error at runtime.
Engineering detail — cast it as a CMDP, then project. Define a cost cₜ = 1 exactly on the (s,a) pairs an FMEA marks dangerous, with budget E[Σc] ≤ pmax·T. The training signal is the Lagrangian relaxation R′ = R − λ·C, with λ found by search so the constraint binds without degrading return. But for high safety-integrity loops (SIL≥3) the relaxation is not trusted alone — add a projection operator: after each update, any probability mass the policy puts on an unsafe action is redistributed onto safe actions in the same state, keeping Σπ = 1.
When the constraint is a smooth physical model you can do better than redistribution — project analytically. If temperature is T(s,a) = T₀ + κ‖a‖², the constraint c(s,a) = κ‖a‖² + T₀ − Tmax ≤ 0 is convex in a, so a closed-form, differentiable scaling layer clamps the action onto the safe ball before it is ever issued:
This layer is analytically differentiable, so gradients still flow to μθ and the policy-gradient estimator stays unbiased (reparameterize ã = asafe + σ·ε), yet ‖asafe‖² ≤ (Tmax−T₀)/κ by construction — the temperature can never exceed the limit. In practice this hard-constraint version costs only ~3% of return versus the unconstrained policy, while its training curve nearly overlaps.
4b · Sensor delay — the state you act on is stale
Intuition. Vision arrives ~100 ms late; by the time the agent "sees" the part, the conveyor has moved. Acting on a stale observation is like steering a car from a photo taken a second ago. The remedy is to give the agent enough recent history to reconstruct the true present state.
Engineering detail. Measure the delay's 95th percentile as a window L, then feed an information state φₜ = (oₜ, at−L, …, at−1) — the late observation plus the actions taken since — of fixed dimension (L+1)·(dim_o+dim_a). For short delays (τ≤3 steps) a Transformer encoder over φₜ produces a compensated state zₜ that drops straight into the SAC policy. For large delays or high safety levels, learn a recurrent state-space model (PlaNet-style) and fine-tune with MBPO to push accumulated error <1%. A custom DelayBuffer keeps the value backup aligned with the delay so off-policy methods stay correct. On a battery-sorting line this lifted grasp success from 89% to 97% at ~1.8 ms inference — inside the PLC tick. For out-of-order packets over 5G, add receive timestamps and lean on the Transformer's positional encoding so the network knows which frame is newest and never reverses causality.
5 · Engineer (c) — the plant ages, so the policy must adapt
Intuition. Bearings wear, batteries fade, friction creeps. A policy frozen at commissioning slowly drifts off-optimal. Two jobs: notice the change fast, and pre-train a policy that adapts to it in a handful of steps without forgetting what it already knew.
Engineering detail — detect, then adapt. Run an online change-point detector on the state distribution: compute a corrected Jensen–Shannon divergence Jₜ between recent and reference distributions, feed it to an adaptive CUSUM, Sₜ = max(0, Sₜ₋₁ + Jₜ − λσ̂J), and trigger when Sₜ exceeds a threshold h calibrated by Monte-Carlo to a target false-alarm rate (ARL₀ ≈ 10⁴ → false-positive <0.05%). Crucially, a trigger does not immediately rewrite the policy — it opens a 10-step confirmation window; only a sustained breach is "real drift," a transient one is logged and ignored. On a production line this gave ~6.8-step (~400 ms) detection latency, 0% miss over six months, 0.04% false alarms, and <18% edge-CPU.
For the adapt half, pre-train with MAML over an aging-task distribution: discretize wear into bands (e.g. by a battery-life standard: light α≥0.8, medium 0.6≤α<0.8, heavy α<0.6), then sample within each band from a truncated Gaussian with ~10% overlap between adjacent bands so the policy never hard-switches at a boundary. Over-weight the heavy-aging band (~50% of the task batch) so the meta-policy does not bias toward healthy operation, and accelerate aging ~30× in simulation, then close the sim-to-real gap on ~48 h of field data via an MMD loss until the drift distributions differ by <0.05.
6 · The central tension — energy vs throughput
Intuition. The economic reward is multi-objective: run the line hot and fast (high throughput, high energy) or cool and slow (low energy, low throughput). There is no single right answer — there is a Pareto frontier, and the business picks a point on it via a weight, possibly one that changes hour to hour with the electricity price. Engineering this means a single policy that can be re-aimed at a new trade-off without retraining.
Engineering detail. Cast it as a Constrained Multi-Objective MDP: reward vector r = (r₀, r₁, …, rm) with r₀ the main objective and the rest costs bounded by E[Σγᵗri] ≤ di. A shared encoder feeds an actor conditioned on the preference vector θ and a multiplier network λψ(θ); the actor uses PPO-clip on the main advantage plus Σλi on the cost advantages, and λ updates on a slow timescale (LR ~1/10 of the actor's). A differentiable projection layer (or, for discrete actions, masking violating logits to −∞) keeps every sampled action feasible. At inference you swap θ live to move along the frontier with zero retraining and <5 ms latency. The widget below is the napkin math of this trade-off — feel how the electricity price slides the optimal operating point.
7 · Guard in production — detection, monitoring, fallback
Intuition. Everything above can still fail at runtime: the twin mismatches reality, a sensor jumps, the plant drifts past what the meta-policy saw. Production safety is a layered net that assumes the policy will sometimes be wrong and stays safe anyway.
Engineering detail. Four layers. (1) Zero-order shield: at the edge, if the measured temperature exceeds Tmax−δ (δ≈2 °C margin), truncate the action to the safe boundary immediately — no gradient, just an alarm log. Because the margin was baked into training, this rarely fires. (2) Pre-deployment HIL replay: before release, replay ~100 h of real logs on a hardware-in-the-loop bench and require end-to-end task success +≥5% and inference-latency increase <2 ms over the incumbent. (3) Stress test + distribution check: run ~24 h and pass only if the state-trajectory KL versus the offline data is <0.01 with zero safety violations — the bar for a functional-safety sign-off. (4) Drift monitor + fallback: the CUSUM of §5 watches live; on confirmed drift the policy degrades to MPC rolling optimization with action amplitude clamped (<30%) while it fine-tunes, then resumes.
Further considerations
- Quantify model uncertainty, don't just point-estimate. Model the transition with a Gaussian process (ARD squared-exponential kernel); its predictive covariance Σ* gives a per-point std σ(s,a) you can feed three ways — as an exploration bonus α·σ̄, as an active-sampling target (max differential entropy), or as a 95% confidence ellipsoid for robust MPC (optimize maxπ minf∈ellipsoid J). One GP fit gives analytic uncertainty with O(n²) rank-one updates — versus ~20 forward passes for MC-Dropout, which can cost ~30 ms on an automotive chip while a sparse GP answers in ~3 ms. Use FITC/inducing-point sparsification to hit 50–1000 Hz control cycles.
- Auto-calibrate the digital twin. Treat parameter calibration as its own MDP: action Δθ bounded to physical limits via an invertible transform (so mass>0, damping≥0 hold automatically), reward −‖yreal−ytwin‖²Q with jitter and Bayesian-prior regularizers. Pre-train in a shadow twin with domain randomization, then run Bayesian SAC with deep-kernel GP residuals on the real device. This drove a deployment mechanism's modal-frequency error from ~30% to <2% in ~120 s while cutting motor power ~18%.
- Compare fairly against the MPC baseline. RL only "wins" if its expected return beats MPC with non-overlapping confidence intervals; if they overlap, compare the worst-5% trajectory return (CVaR) and the constraint-violation rate. Under model mismatch (e.g. 20% parameter error), expect RL to trade some mean return for robustness — report which you value. Archive code, seeds, and hardware logs so a third party can reproduce the result; reproducibility is what makes "fair" credible.
- Select a deployable policy off the Pareto frontier robustly. Don't ship the nominal "best" point — score each candidate by its worst case over the plausible weight interval minus an uncertainty penalty, Score(π) = minw∈[wmin,wmax] w·μ − κσ (κ≈1.96), drop near-duplicate "twin" policies (cosine>0.95, low hypervolume contribution), and canary-release 5%→15%→50% with automatic rollback if the weighted KPI falls >2% below baseline or the drift monitor fires.
- Treat explainability and audit as a requirement, not a nicety. Keep the PID/MPC as a written "prior rule," and report how much the RL term corrects it (e.g. via a symbolic-regression layer that fits the optimal-parameter trajectory to an if-else+polynomial form) so a "why was this parameter changed?" query has a traceable answer that can be written into a process-change order and rolled back.