Industrial control & parameter tuning

A rolling mill, a battery sorter, a servo motor — physical plants where a bad action overheats a coil, scraps a part, or trips a safety relay. Game AI could afford a billion throwaway episodes; a plant cannot afford one unsafe one. So the binding difficulties of this domain are different: real samples are scarce and expensive, some constraints must hold on every single step, sensors arrive late, and the plant ages out from under the policy. Each one names a tool — and the recurring move is to start from the controller the plant already trusts (PID/MPC) and let RL correct it, never replace it cold.

The method — five steps, every lesson

Applied RL is the same loop in every domain. (1) Formulate the MDP — state, action, reward, transition, horizon. (2) Diagnose the property that makes this MDP hard. (3) Engineer the mechanism that removes that difficulty. (4) Guard it in production — detect when it breaks and fall back. (5) Iterate. In game AI the binding difficulty was a hybrid action space; here it is cost: every sample and every constraint violation is paid for in scrapped product, downtime, or a failed safety certification.

1 · Formulate — the MDP behind a tuned plant

Intuition. A controller reads sensors, sets actuators, and the plant responds; "good control" means the output tracks a setpoint while staying safe and efficient. That is already an MDP: the sensor vector is the state, the actuator command is the action, "close to setpoint, low energy, no violation" is the reward, and the plant physics is the transition. Every difficulty below is one of these four pieces being awkward in a way a video game never was.

MDP = (S, A, R, P, γ) · maximize E[ Σₜ γᵗ rₜ ] subject to E[ Σₜ γᵗ cₜ ] ≤ d

Piece	For a heat-line / servo plant	The awkward part
State S	joint angles, currents, temperatures, residual = (physics − twin), plus the posterior mean/var of unknown dynamics params	sensors arrive late (e.g. 100 ms vision) → the observed state lags the true state
Action A	continuous actuator command, or a correction Δθ to the controller's gains, bounded by physical limits	limits are hard: mass > 0, temperature ≤ T_max — illegal actions damage hardware, not just score
Reward R	−‖output − setpoint‖² with terms for energy and for parameter jitter	multi-objective (energy vs throughput) and entangled with a cost budget that may not be exceeded
Transition P	plant physics; in practice a digital-twin simulator + the real machine	non-stationary (the plant ages) and sample-scarce (real rollouts cost product and time)

Same trick as lesson 67: the rightmost column is the lesson. The plant adds a fifth column that game AI never had — a separate cost signal cₜ with a hard budget d — and that single addition reshapes every mechanism below.

2 · Diagnose — three properties, all rooted in cost

Intuition. Strip away the jargon and the plant is hard for three linked reasons. First, you cannot explore freely: a random action can scrap a part or trip a relay, so pure from-scratch RL that needs hundreds of thousands of unsafe steps is a non-starter. Second, some limits must never be crossed even once — "constraint satisfied in expectation" is not good enough when the constraint is "the motor must not exceed 80 °C." Third, the plant drifts: bearings wear, batteries age, so a policy that was optimal at commissioning slowly becomes wrong.

Engineering detail. These map to three concrete framings the source builds the whole chapter on. (a) Sample scarcity → warm-start from a known controller and learn mostly offline from logged data, so the real machine sees as few novel actions as possible. (b) Hard safety → a Constrained MDP with a per-step cost cₜ and an inviolable budget, enforced not by a soft penalty alone but by a projection / shield that makes unsafe actions literally unreachable. (c) Non-stationarity → online change-point detection plus meta-learning over an explicit distribution of "aging tasks." Sections 3–5 engineer exactly these three and nothing more.

3 · Engineer (a) — warm-start from PID, learn the residual

Intuition. The plant already runs a hand-tuned PID loop that is decent, just not optimal. Throwing it away and letting RL flail for 800k unsafe steps is wasteful and dangerous. Instead, let the PID drive, and let RL learn only the small correction on top — starting from "do nothing extra" and easing the correction in as confidence grows. You inherit the PID's safety on day one and spend RL's sample budget only on the gap to optimal.

u = u_PID + k·u_RL , k: 0 → 1 over training | explore: u_RL clipped to ±Δ, Δ: 10% → 0% of range

Engineering detail. Three layers. (1) Action prior: the policy output is added to u_PID, and a blend gain k ramps from 0 to 1 so the handover is smooth — no step change the operators would feel. (2) Bounded exploration: for the first ~20% of episodes the RL term is clipped to ±Δ of the actuator range, with Δ on a linear schedule from 10% down to 0; afterwards exploration is handed to SAC's automatic temperature α, which tunes its own noise. (3) Deployment split: the PID runs hard real-time in C++ on a 500 µs cycle while the RL correction runs on a slower ~5 ms inference cycle, the two combined at the actuator. On a hot-strip-rolling dataset this cut convergence from ~800k steps (pure SAC) to ~120k steps for the same reward, and dropped early-training peak overshoot from ~18% to ~3% — inside the plant's safety spec.

3b · When samples are still too few — go offline and conservative

Intuition. Often you may not run the real plant at all during training; all you have is logged data from the existing controller (and a Simulink twin). The danger of learning from a fixed dataset is delusional optimism: the value network assigns high value to actions never seen in the data, then the deployed policy chases them off a cliff. The fix is to be pessimistic about anything out-of-distribution.

Engineering detail. Use CQL-SAC: add a conservative regularizer to the Q-loss that pushes down the value of actions the policy would pick but the data never took, while pulling up the values actually logged —

L_Q = L_Bellman + α·( log Σ_a exp Q(s,a) − E_a∼π[Q(s,a)] )

with α auto-tuned (start 1.0, early-stop on a KL threshold ~0.05). Practical guards from the source: a differentiable tanh+scale saturation layer on the action input that matches the simulator's clamp, killing out-of-distribution actions at the source; LayerNorm instead of BatchNorm to survive non-stationary covariate shift; and a gate review that fails the policy unless the out-of-distribution action share is <3% on a 5-fold time-series cross-validation. The shipped policy is even extracted to ~256 K-means cluster centers for table-lookup on the embedded controller.

4 · Engineer (b) — the constraint must hold on every step

Intuition. Now the hard part. "Keep the motor under T_max" is not a reward you can trade off — it is a wall. A Lagrangian penalty (subtract λ·cost from reward) only satisfies the constraint on average, which a safety auditor will reject: averages allow rare catastrophic violations. You want a mechanism that makes exceeding the limit geometrically impossible, then a second mechanism that catches model error at runtime.

Engineering detail — cast it as a CMDP, then project. Define a cost cₜ = 1 exactly on the (s,a) pairs an FMEA marks dangerous, with budget E[Σc] ≤ p_max·T. The training signal is the Lagrangian relaxation R′ = R − λ·C, with λ found by search so the constraint binds without degrading return. But for high safety-integrity loops (SIL≥3) the relaxation is not trusted alone — add a projection operator: after each update, any probability mass the policy puts on an unsafe action is redistributed onto safe actions in the same state, keeping Σπ = 1.

When the constraint is a smooth physical model you can do better than redistribution — project analytically. If temperature is T(s,a) = T₀ + κ‖a‖², the constraint c(s,a) = κ‖a‖² + T₀ − T_max ≤ 0 is convex in a, so a closed-form, differentiable scaling layer clamps the action onto the safe ball before it is ever issued:

a_safe = μ_θ(s) · min( 1 , √[(T_max−T₀)/κ] / ‖μ_θ(s)‖ )

This layer is analytically differentiable, so gradients still flow to μ_θ and the policy-gradient estimator stays unbiased (reparameterize ã = a_safe + σ·ε), yet ‖a_safe‖² ≤ (T_max−T₀)/κ by construction — the temperature can never exceed the limit. In practice this hard-constraint version costs only ~3% of return versus the unconstrained policy, while its training curve nearly overlaps.

Why "satisfied in expectation" is the wrong target

A soft Lagrangian or RCPO penalty drives E[cost] under budget, but the realized cost is a distribution — its tail can still breach the wall. Safety engineers will not certify an expectation. So lead with the hard mechanism (analytic projection, or a verifiable guard network), and only then discuss soft trade-offs. If the temperature model is non-convex (current coupled to high-order terms), analytic projection breaks down — fall back to a log-barrier added to the policy objective with a monotonic-improvement guarantee (line-search CPO), or a differentiable convex-optimization layer (OptNet) — but only after a worst-case solve-time test (e.g. <0.5 ms) proves it fits inside the PLC cycle.

4b · Sensor delay — the state you act on is stale

Intuition. Vision arrives ~100 ms late; by the time the agent "sees" the part, the conveyor has moved. Acting on a stale observation is like steering a car from a photo taken a second ago. The remedy is to give the agent enough recent history to reconstruct the true present state.

Engineering detail. Measure the delay's 95th percentile as a window L, then feed an information state φₜ = (oₜ, a_t−L, …, a_t−1) — the late observation plus the actions taken since — of fixed dimension (L+1)·(dim_o+dim_a). For short delays (τ≤3 steps) a Transformer encoder over φₜ produces a compensated state zₜ that drops straight into the SAC policy. For large delays or high safety levels, learn a recurrent state-space model (PlaNet-style) and fine-tune with MBPO to push accumulated error <1%. A custom DelayBuffer keeps the value backup aligned with the delay so off-policy methods stay correct. On a battery-sorting line this lifted grasp success from 89% to 97% at ~1.8 ms inference — inside the PLC tick. For out-of-order packets over 5G, add receive timestamps and lean on the Transformer's positional encoding so the network knows which frame is newest and never reverses causality.

5 · Engineer (c) — the plant ages, so the policy must adapt

Intuition. Bearings wear, batteries fade, friction creeps. A policy frozen at commissioning slowly drifts off-optimal. Two jobs: notice the change fast, and pre-train a policy that adapts to it in a handful of steps without forgetting what it already knew.

Engineering detail — detect, then adapt. Run an online change-point detector on the state distribution: compute a corrected Jensen–Shannon divergence Jₜ between recent and reference distributions, feed it to an adaptive CUSUM, Sₜ = max(0, Sₜ₋₁ + Jₜ − λσ̂_J), and trigger when Sₜ exceeds a threshold h calibrated by Monte-Carlo to a target false-alarm rate (ARL₀ ≈ 10⁴ → false-positive <0.05%). Crucially, a trigger does not immediately rewrite the policy — it opens a 10-step confirmation window; only a sustained breach is "real drift," a transient one is logged and ignored. On a production line this gave ~6.8-step (~400 ms) detection latency, 0% miss over six months, 0.04% false alarms, and <18% edge-CPU.

For the adapt half, pre-train with MAML over an aging-task distribution: discretize wear into bands (e.g. by a battery-life standard: light α≥0.8, medium 0.6≤α<0.8, heavy α<0.6), then sample within each band from a truncated Gaussian with ~10% overlap between adjacent bands so the policy never hard-switches at a boundary. Over-weight the heavy-aging band (~50% of the task batch) so the meta-policy does not bias toward healthy operation, and accelerate aging ~30× in simulation, then close the sim-to-real gap on ~48 h of field data via an MMD loss until the drift distributions differ by <0.05.

Adapting without forgetting

Online fine-tuning risks catastrophic forgetting — adapting to today's wear erases yesterday's competence. Constrain the update: on drift, freeze the main network and fine-tune only the actor on the last ~1000 steps at a low LR (1e-4) for ~3 epochs under a KL constraint (δ≈0.01), and keep an EWC-style importance penalty so weights critical to past regimes barely move. Where rewards are sparse and delayed (intervals >10 min), relabel failed trajectories with hindsight to manufacture learnable signal and cut the forgetting rate further. The principle mirrors lesson 67's KL-rollback: bound how far each online update can move.

6 · The central tension — energy vs throughput

Intuition. The economic reward is multi-objective: run the line hot and fast (high throughput, high energy) or cool and slow (low energy, low throughput). There is no single right answer — there is a Pareto frontier, and the business picks a point on it via a weight, possibly one that changes hour to hour with the electricity price. Engineering this means a single policy that can be re-aimed at a new trade-off without retraining.

Engineering detail. Cast it as a Constrained Multi-Objective MDP: reward vector r = (r₀, r₁, …, r_m) with r₀ the main objective and the rest costs bounded by E[Σγᵗr_i] ≤ d_i. A shared encoder feeds an actor conditioned on the preference vector θ and a multiplier network λ_ψ(θ); the actor uses PPO-clip on the main advantage plus Σλ_i on the cost advantages, and λ updates on a slow timescale (LR ~1/10 of the actor's). A differentiable projection layer (or, for discrete actions, masking violating logits to −∞) keeps every sampled action feasible. At inference you swap θ live to move along the frontier with zero retraining and <5 ms latency. The widget below is the napkin math of this trade-off — feel how the electricity price slides the optimal operating point.

Energy-vs-throughput: where does the optimal operating point sit?

Run the line at intensity x ∈ [0,1]. Throughput rises with x (with diminishing returns); energy cost rises faster (≈ quadratically). Net value = price·throughput − tariff·energy. As the electricity tariff climbs, the value-maximizing intensity slides down — and a fixed-gain controller, pinned at one x, leaves money on the table.

electricity tariff (¥/kWh) 0.8 product price (¥/unit) 4 fixed-gain intensity x₀ 0.85

optimal x*

–

value at x*

–

value at fixed x₀

–

money left on table

–

7 · Guard in production — detection, monitoring, fallback

Intuition. Everything above can still fail at runtime: the twin mismatches reality, a sensor jumps, the plant drifts past what the meta-policy saw. Production safety is a layered net that assumes the policy will sometimes be wrong and stays safe anyway.

Engineering detail. Four layers. (1) Zero-order shield: at the edge, if the measured temperature exceeds T_max−δ (δ≈2 °C margin), truncate the action to the safe boundary immediately — no gradient, just an alarm log. Because the margin was baked into training, this rarely fires. (2) Pre-deployment HIL replay: before release, replay ~100 h of real logs on a hardware-in-the-loop bench and require end-to-end task success +≥5% and inference-latency increase <2 ms over the incumbent. (3) Stress test + distribution check: run ~24 h and pass only if the state-trajectory KL versus the offline data is <0.01 with zero safety violations — the bar for a functional-safety sign-off. (4) Drift monitor + fallback: the CUSUM of §5 watches live; on confirmed drift the policy degrades to MPC rolling optimization with action amplitude clamped (<30%) while it fine-tunes, then resumes.

The through-line

Every section is one row of the MDP table turned into a mechanism, all anchored on cost: scarce samples → PID warm-start + conservative offline CQL; hard limits → CMDP with analytic projection / shield; sensor delay → information-state history + recurrent compensation; aging plant → drift CUSUM + MAML over aging tasks + EWC; energy vs throughput → preference-conditioned multi-objective actor. You never replaced the trusted controller — you let RL correct its residual, and you guarded every actuator command with a mechanism that fails safe. That discipline is the whole track.

Further considerations

Quantify model uncertainty, don't just point-estimate. Model the transition with a Gaussian process (ARD squared-exponential kernel); its predictive covariance Σ* gives a per-point std σ(s,a) you can feed three ways — as an exploration bonus α·σ̄, as an active-sampling target (max differential entropy), or as a 95% confidence ellipsoid for robust MPC (optimize max_π min_{f∈ellipsoid} J). One GP fit gives analytic uncertainty with O(n²) rank-one updates — versus ~20 forward passes for MC-Dropout, which can cost ~30 ms on an automotive chip while a sparse GP answers in ~3 ms. Use FITC/inducing-point sparsification to hit 50–1000 Hz control cycles.
Auto-calibrate the digital twin. Treat parameter calibration as its own MDP: action Δθ bounded to physical limits via an invertible transform (so mass>0, damping≥0 hold automatically), reward −‖y_real−y_twin‖²_Q with jitter and Bayesian-prior regularizers. Pre-train in a shadow twin with domain randomization, then run Bayesian SAC with deep-kernel GP residuals on the real device. This drove a deployment mechanism's modal-frequency error from ~30% to <2% in ~120 s while cutting motor power ~18%.
Compare fairly against the MPC baseline. RL only "wins" if its expected return beats MPC with non-overlapping confidence intervals; if they overlap, compare the worst-5% trajectory return (CVaR) and the constraint-violation rate. Under model mismatch (e.g. 20% parameter error), expect RL to trade some mean return for robustness — report which you value. Archive code, seeds, and hardware logs so a third party can reproduce the result; reproducibility is what makes "fair" credible.
Select a deployable policy off the Pareto frontier robustly. Don't ship the nominal "best" point — score each candidate by its worst case over the plausible weight interval minus an uncertainty penalty, Score(π) = min_{w∈[w_min,w_max]} w·μ − κσ (κ≈1.96), drop near-duplicate "twin" policies (cosine>0.95, low hypervolume contribution), and canary-release 5%→15%→50% with automatic rollback if the weighted KPI falls >2% below baseline or the drift monitor fires.
Treat explainability and audit as a requirement, not a nicety. Keep the PID/MPC as a written "prior rule," and report how much the RL term corrects it (e.g. via a symbolic-regression layer that fits the optimal-parameter trajectory to an if-else+polynomial form) so a "why was this parameter changed?" query has a traceable answer that can be written into a process-change order and rolled back.