Data-center cooling
A hall full of racks, a fleet of fans, pumps and chilled-water valves, and one number the operator is judged on: PUE — total facility power divided by IT power. RL can shave it, but only against a wall of hard physics: you never see the whole temperature field, the cheapest cooling setting is always one step from a thermal runaway, and the IT load that drives everything lurches without warning. This lesson casts cooling as a constrained, partially-observable MDP and works the four binding difficulties: a hidden 3-D thermal state, a hybrid discrete-continuous actuator space with hard safety intervals, a multi-objective PUE-vs-temperature reward that is trivially hackable, and a non-stationary load.
1 · Formulate — the MDP behind a cooling controller
Intuition. The controller reads thousands of temperature sensors and a handful of power meters, then sets fan speeds, pump frequencies and valve positions. Minutes later the racks are a little hotter or cooler and the facility drew a little more or less power. That is an MDP: the sensor field is the state, the actuator setpoints are the action, the energy-and-temperature outcome is the reward, and the room's thermodynamics (plus the IT load you do not control) is the transition. Every difficulty below is just one of these four pieces being awkward in a real hall.
| Piece | For a cooling controller | The awkward part |
|---|---|---|
| State S | thermistor array + inlet/outlet temps + IT load + outdoor weather + electricity price | sensors are sparse points in a 3-D field — the hottest voxel is usually unmeasured → partially observable |
| Action A | fan stage (discrete) × pump frequency (continuous) × valve opening (continuous) | hybrid, and the actuators have hard physical floors (chiller can't run below ~30 Hz) |
| Reward R | −energy, with bonuses for low PUE and penalties for hot spots and SLA breach | multi-objective and hackable: the literal optimum is "turn cooling off" |
| Transition P | room CFD + thermal mass + the IT workload arriving from outside | load jumps (a batch job lands) → non-stationary, and thermal inertia means actions pay off with delay |
The rightmost column is the lesson. Each awkward part names a mechanism: hidden field → field reconstruction + belief state; hybrid + hard floors → structured heads + a safety projection; hackable reward → constrained shaping with an absorbing penalty; jumping load → online change-point detection and fast adaptation. We reach for each tool only when its row demands it.
2 · Diagnose — the hidden thermal field (19.1)
Intuition. You have, say, a few hundred thermistors for a hall that is a continuous 3-D temperature field with millions of degrees of freedom. The dangerous quantity — the single hottest point behind some dense GPU rack — is almost never sitting under a sensor. A controller that optimizes only the temperatures it can see will happily run the room hot exactly where it is blind. This is the partial-observability row, and it is the difficulty that binds first because everything downstream (the safety margin, the reward) is defined on temperatures you must infer, not read.
Engineering detail — fuse the field, keep the uncertainty. Treat the sparse sensor vector as noisy observations of a latent field and reconstruct it. The source's recipe: an encoder that fuses multi-modal sensing (infrared cameras, fibre-optic strings, thermistors) into a voxel feature volume, decoded by a 3-D ResUNet to a dense field plus a per-voxel uncertainty. Critically, you do not collapse to a point estimate — you carry the uncertainty forward, because a hot voxel that is also high-variance is exactly where the policy must be cautious. Offline you can supervise this network against historical CFD ground truth, then warm-start an uncertainty-weighted filter so the online estimate never explodes when a sensor drops out.
State compression — don't feed a million voxels to the policy. The raw field is far too high-dimensional to be a policy input. Compress it: an unsupervised pre-train (a β-VAE on the field, a low-dim summary of the sensor time-series) reduces the observation to a few-dozen-dimensional vector. The source's discipline is the part most people skip — verify sufficiency: compute the mutual information between the policy and the field before vs. after compression and require a high retention (it reports >96%). A compression that loses task-relevant information is silently fatal, because the policy can no longer tell a safe room from a marginal one.
3 · Engineer — hybrid actuators inside hard safety intervals (19.2)
Intuition. A cooling action is not one knob. It is "pick a fan stage" (a discrete menu), "set pump frequency" and "set valve opening" (continuous, but bounded). Flatten it into one giant discrete grid and you explode combinatorially; treat it as one continuous vector and you lose the menu structure of the fan stages. Keep the structure: a discrete head for the stage, continuous heads conditioned on it for the analogue setpoints.
Engineering detail. Sample the discrete stage with Gumbel-Softmax so the choice stays differentiable and the setpoints fed to a real PLC are discrete-valid; reparameterize the continuous heads (ε ~ 𝒩(0,I)) so gradients flow end-to-end. Train with a clipped surrogate (PPO-Clip) for the actor and a twin critic with clipped double-Q for the mixed-action Bellman error; maximize Q − α·H over the joint entropy H(afan) + H(apump,avalve | afan) with an auto-tuned temperature α so exploration is calibrated, not hand-set.
The safety interval is the whole game. The chiller has a minimum frequency (~30 Hz); valves have travel limits; and above all, no action may drive any voxel above Tmax. A naive answer is "give an over-temperature action a −10⁶ penalty." The source is emphatic that this is wrong: with a large penalty the policy still explores near the edge and random exploration will occasionally cross it — and crossing it once means cooked hardware. The right answer is to make the unsafe action impossible, not merely expensive.
Engineering detail — a differentiable safety layer. After the policy emits a raw action, project it onto the feasible set with a small QP solver: predict next-step temperature T̂t+1 = f(Tt, at, load) and find the nearest action keeping T̂t+1 ≤ Tmax (nearest, so control stays smooth). Backpropagate the gradient of the projected action, so the network actually learns to steer away from the boundary rather than relying on the projector to clean up after it. Where you can afford it, certify the layer with a Control-Barrier-Function check (B grows toward the boundary; require dB/dt + κB ≥ 0) and a model-predictive variant (MPSL) that projects over an N-step horizon — single-step projection can be too late when thermal coupling is fast.
4 · Engineer — the PUE-vs-SLA reward (19.3)
Intuition. The thing you want is low PUE. But PUE alone has a degenerate optimum — switch the cooling off and PUE momentarily looks great while the room climbs toward failure. So the reward is genuinely multi-objective: spend less energy, and keep every voxel cool, and honour the IT-side latency SLA, and ideally shift heavy jobs to cheap electricity. Each extra term is a new thing the agent will optimize literally, so each must be shaped with care.
Engineering detail. The source's structure: a shaping term that rewards driving PUE toward a load-aware baseline (at low IT load PUE is naturally inflated — pueref(load) = 1.05 + 0.25·e−5·load — so the agent isn't punished for physics it can't beat); a hard absorbing penalty (PUE > 1.5 → episode terminates with a large negative reward, so the policy never learns to chase a "good" PUE by starving cooling); a latency term (weight α ≈ 0.1 ms⁻¹) tying cooling to the actual IT SLA; and an electricity-price term (β ≈ 0.01) that, where peak/off-peak spread is large (≥0.6 per kWh), pushes heavy workloads into the cheap overnight window — cutting both PUE and the bill. Normalize the total to [−1,1] and feed it to PPO with GAE(λ=0.95). The source reports a hall at ~10k-rack scale converging in ~3 days, PUE falling from 1.42 to 1.27.
5 · Guard — surviving a non-stationary load in production (19.4)
Intuition. The IT load is exogenous and bursty — a batch job lands, a region fails over, traffic triples on a sale day — and the room's thermal behaviour shifts under the policy's feet. A controller tuned for yesterday's load can be wrong for today's. You need to detect the shift fast and adapt without retraining from scratch, all while a hardware backstop guarantees safety regardless.
Engineering detail — detect. Maintain an EWMA mean and EW-variance on each key signal (μt = α·xt + (1−α)·μt−1; flag a local anomaly when |xt − μt| > k·σt, k ≈ 3.5). For a genuine regime change, run online Bayesian change-point detection — a Normal-Gamma conjugate posterior updated each mini-batch, declaring a global change-point when the log Bayes factor exceeds a threshold for two consecutive batches. On a confirmed change-point: freeze the old replay buffer and open a fresh one (so distributions don't mix), warm-start a meta-learned policy (MAML/PEARL) for fast adaptation, and signal the orchestrator to scale capacity if CPU is also saturating.
Engineering detail — adapt online, safely. Incremental PPO on a sliding window of recent interaction, mixed with ~20% reservoir-sampled history to prevent forgetting; a KL penalty with dual-gradient-adjusted β holding each step's KL < 0.01 so adaptation can't lurch; recompute advantages with GAE to keep variance down. Then a shadow / canary gate: the new policy serves a small fraction of capacity, and if a core metric drops more than ~1σ within minutes, roll back to the previous version in seconds. Where you have a workload forecast, close the loop with model-based rollouts (MBPO-style short horizons weighted by predictive uncertainty, w = 1/(σ²+ε)) so the controller pre-positions cooling before the heat arrives instead of chasing it.
6 · The central trade-off — PUE vs thermal safety margin
Intuition. Everything above collapses to one dial. Run the cooling hard and you keep a fat safety margin below Tmax but pay a high PUE. Throttle it and PUE drops — until your margin is so thin that the uncertainty in the hidden-field estimate, plus the next load spike, tips a voxel over the limit. The whole engineering effort — field reconstruction, the safety projection, the constrained reward — exists to let you run the room closer to the limit safely, i.e. to convert reduced uncertainty into lower PUE without buying a breach. The widget below is that napkin math: pick a setpoint margin and a load-spike size, and watch energy savings trade against breach risk.
7 · Iterate — the through-line
Each section is one row of the MDP table turned into a mechanism. Hidden 3-D field → uncertainty-aware reconstruction + a compressed belief state. Hybrid actuators with hard floors → structured Gumbel-Softmax + reparameterized heads, then a differentiable QP safety projection. Multi-objective, hackable reward → load-aware shaping, an absorbing limit penalty, and a Lagrangian constraint instead of a guessed weight. Non-stationary load → change-point detection, incremental KL-bounded adaptation, a canary gate, and an independent PLC backstop. Remove the field uncertainty and the binding difficulty becomes how tight a margin you dare run; tighten the margin and oscillation re-appears — so you re-run the loop. The discipline is the same one from lesson 67: never reach for a tool until a row of the table demands it.
Further considerations
- Multi-agent coordination. A campus has many coupled thermal subsystems (multiple halls, shared chiller plant). Promote each to an agent and let them share boundary temperatures via a graph-attention layer, co-optimizing facility-wide long-term return instead of each hall greedily chasing its own local optimum.
- Meta-learning for fast transfer. Seasonal swings and hardware refreshes shift the thermal dynamics. MAML or PEARL lets the policy adapt from a handful of samples in the new regime, avoiding the cost of re-collecting CFD ground truth every time the room changes.
- Verifiable compression. For a safety-critical controller, prove the compressed state preserves the MDP: a bisimulation-equivalence argument bounding the optimal-value error of the compressed MDP (ε < 0.02) so the reduction is certified, not just empirically "good enough."
- Barrier-function guarantees with sample bounds. Casting the limit as a CMDP and using a differentiable log-barrier objective yields a monotonic-improvement guarantee and an explicit step-size and barrier-coefficient choice — and a Bernstein bound gives the number of trajectories N needed to certify constraint violation ≤ ζ with probability 1−δ, with no dimension dependence. Constraint tuning stops being guesswork.
- Carbon as a third objective. Fold a carbon-cost term into the reward and search the temperature-safe / energy-minimal / carbon-minimal Pareto front (e.g. Pareto-TRPO); where carbon is a tradable allowance, the agent can even arbitrate between saving energy and selling allowances.
- Explainability and audit. Operators must be able to ask "why did you trust the IR camera over the fibre?" Attach a per-action explanation (attention mask + SHAP-style attribution over the fused sensors), log every decision with a hashed parameter fingerprint and data version, and keep a tested "red-button" path that switches to a rule-based baseline within ~200 ms — so a compliance audit reduces to checking the log hashes match and the fallback actually fires.