Data-center cooling

A hall full of racks, a fleet of fans, pumps and chilled-water valves, and one number the operator is judged on: PUE — total facility power divided by IT power. RL can shave it, but only against a wall of hard physics: you never see the whole temperature field, the cheapest cooling setting is always one step from a thermal runaway, and the IT load that drives everything lurches without warning. This lesson casts cooling as a constrained, partially-observable MDP and works the four binding difficulties: a hidden 3-D thermal state, a hybrid discrete-continuous actuator space with hard safety intervals, a multi-objective PUE-vs-temperature reward that is trivially hackable, and a non-stationary load.

The method — five steps, every lesson

Applied RL is not a grab-bag of tricks. Every domain in this track is the same loop: (1) Formulate the MDP — state, action, reward, transition, horizon. (2) Diagnose the one property that makes the MDP hard. (3) Engineer the mechanism that removes that difficulty. (4) Guard it in production — detect when it breaks and fall back. (5) Iterate. The art is only in which difficulty binds first. Here it binds on safety: in a data center an unsafe action does not lose a game, it cooks a rack.

1 · Formulate — the MDP behind a cooling controller

Intuition. The controller reads thousands of temperature sensors and a handful of power meters, then sets fan speeds, pump frequencies and valve positions. Minutes later the racks are a little hotter or cooler and the facility drew a little more or less power. That is an MDP: the sensor field is the state, the actuator setpoints are the action, the energy-and-temperature outcome is the reward, and the room's thermodynamics (plus the IT load you do not control) is the transition. Every difficulty below is just one of these four pieces being awkward in a real hall.

MDP = (S, A, R, P, γ) · maximize E[ Σₜ γᵗ rₜ ] subject to T(s) ≤ T_max

Piece	For a cooling controller	The awkward part
State S	thermistor array + inlet/outlet temps + IT load + outdoor weather + electricity price	sensors are sparse points in a 3-D field — the hottest voxel is usually unmeasured → partially observable
Action A	fan stage (discrete) × pump frequency (continuous) × valve opening (continuous)	hybrid, and the actuators have hard physical floors (chiller can't run below ~30 Hz)
Reward R	−energy, with bonuses for low PUE and penalties for hot spots and SLA breach	multi-objective and hackable: the literal optimum is "turn cooling off"
Transition P	room CFD + thermal mass + the IT workload arriving from outside	load jumps (a batch job lands) → non-stationary, and thermal inertia means actions pay off with delay

The rightmost column is the lesson. Each awkward part names a mechanism: hidden field → field reconstruction + belief state; hybrid + hard floors → structured heads + a safety projection; hackable reward → constrained shaping with an absorbing penalty; jumping load → online change-point detection and fast adaptation. We reach for each tool only when its row demands it.

2 · Diagnose — the hidden thermal field (19.1)

Intuition. You have, say, a few hundred thermistors for a hall that is a continuous 3-D temperature field with millions of degrees of freedom. The dangerous quantity — the single hottest point behind some dense GPU rack — is almost never sitting under a sensor. A controller that optimizes only the temperatures it can see will happily run the room hot exactly where it is blind. This is the partial-observability row, and it is the difficulty that binds first because everything downstream (the safety margin, the reward) is defined on temperatures you must infer, not read.

Engineering detail — fuse the field, keep the uncertainty. Treat the sparse sensor vector as noisy observations of a latent field and reconstruct it. The source's recipe: an encoder that fuses multi-modal sensing (infrared cameras, fibre-optic strings, thermistors) into a voxel feature volume, decoded by a 3-D ResUNet to a dense field plus a per-voxel uncertainty. Critically, you do not collapse to a point estimate — you carry the uncertainty forward, because a hot voxel that is also high-variance is exactly where the policy must be cautious. Offline you can supervise this network against historical CFD ground truth, then warm-start an uncertainty-weighted filter so the online estimate never explodes when a sensor drops out.

b(field) = Encoder( IR, fibre, thermistors ) → ( T̂(x,y,z), σ²(x,y,z) ) · T_risk = max_voxel ( T̂ + k·σ )

State compression — don't feed a million voxels to the policy. The raw field is far too high-dimensional to be a policy input. Compress it: an unsupervised pre-train (a β-VAE on the field, a low-dim summary of the sensor time-series) reduces the observation to a few-dozen-dimensional vector. The source's discipline is the part most people skip — verify sufficiency: compute the mutual information between the policy and the field before vs. after compression and require a high retention (it reports >96%). A compression that loses task-relevant information is silently fatal, because the policy can no longer tell a safe room from a marginal one.

Belief state, not raw observation

Because the room is partially observable, the policy should condition on a belief, not a single frame. Feed the compressed field through a recurrent core (GRU + attention over the sensor history) so the agent integrates information over time — a thermistor that briefly disagrees with its neighbours is noise; one that has drifted for an hour is a real hot spot forming. The hidden state is the controller's running estimate of where the heat actually is.

3 · Engineer — hybrid actuators inside hard safety intervals (19.2)

Intuition. A cooling action is not one knob. It is "pick a fan stage" (a discrete menu), "set pump frequency" and "set valve opening" (continuous, but bounded). Flatten it into one giant discrete grid and you explode combinatorially; treat it as one continuous vector and you lose the menu structure of the fan stages. Keep the structure: a discrete head for the stage, continuous heads conditioned on it for the analogue setpoints.

a = (a_fan, a_pump, a_valve) | a_fan ~ Categorical(logits) a_pump, a_valve = μ(a_fan) + σ(a_fan)⊙ε

Engineering detail. Sample the discrete stage with Gumbel-Softmax so the choice stays differentiable and the setpoints fed to a real PLC are discrete-valid; reparameterize the continuous heads (ε ~ 𝒩(0,I)) so gradients flow end-to-end. Train with a clipped surrogate (PPO-Clip) for the actor and a twin critic with clipped double-Q for the mixed-action Bellman error; maximize Q − α·H over the joint entropy H(a_fan) + H(a_pump,a_valve | a_fan) with an auto-tuned temperature α so exploration is calibrated, not hand-set.

The safety interval is the whole game. The chiller has a minimum frequency (~30 Hz); valves have travel limits; and above all, no action may drive any voxel above T_max. A naive answer is "give an over-temperature action a −10⁶ penalty." The source is emphatic that this is wrong: with a large penalty the policy still explores near the edge and random exploration will occasionally cross it — and crossing it once means cooked hardware. The right answer is to make the unsafe action impossible, not merely expensive.

a_safe = argmin_a' ‖a' − a_raw‖² s.t. g(s,a') ≤ 0 · T̂_t+1(s, a_safe) ≤ T_max

Engineering detail — a differentiable safety layer. After the policy emits a raw action, project it onto the feasible set with a small QP solver: predict next-step temperature T̂_t+1 = f(T_t, a_t, load) and find the nearest action keeping T̂_t+1 ≤ T_max (nearest, so control stays smooth). Backpropagate the gradient of the projected action, so the network actually learns to steer away from the boundary rather than relying on the projector to clean up after it. Where you can afford it, certify the layer with a Control-Barrier-Function check (B grows toward the boundary; require dB/dt + κB ≥ 0) and a model-predictive variant (MPSL) that projects over an N-step horizon — single-step projection can be too late when thermal coupling is fast.

Oscillation — the second-order failure of a smooth-looking controller

A controller that respects every limit can still chatter: fans hunting up and down, valves cycling, because Q-value noise and a non-stationary load keep flipping the argmax. Chatter wears actuators and confuses operators. The source's fixes, in order of leverage: (1) TV-regularize the action sequence and add a separable smoothing layer so consecutive setpoints can't jump; (2) anneal exploration — decay the entropy coefficient on a schedule (e.g. 0.2 → 0.01) so late training stops injecting randomness; (3) average out environment noise with many parallel samplers and synchronous gradient averaging — more workers, lower update variance, fewer "back-and-forth" reversals; (4) normalize reward scale with PopArt so a swing in the raw kWh term doesn't blow up Q-values. Monitor gradient norm, Q-std and policy KL; if KL > 0.02 or the gradient norm spikes, auto-roll back the last model.

4 · Engineer — the PUE-vs-SLA reward (19.3)

Intuition. The thing you want is low PUE. But PUE alone has a degenerate optimum — switch the cooling off and PUE momentarily looks great while the room climbs toward failure. So the reward is genuinely multi-objective: spend less energy, and keep every voxel cool, and honour the IT-side latency SLA, and ideally shift heavy jobs to cheap electricity. Each extra term is a new thing the agent will optimize literally, so each must be shaped with care.

r = r_shape(PUE) + r_pen − α·d_latency − β·π_price·(P_total/1000)

Engineering detail. The source's structure: a shaping term that rewards driving PUE toward a load-aware baseline (at low IT load PUE is naturally inflated — pue_ref(load) = 1.05 + 0.25·e^−5·load — so the agent isn't punished for physics it can't beat); a hard absorbing penalty (PUE > 1.5 → episode terminates with a large negative reward, so the policy never learns to chase a "good" PUE by starving cooling); a latency term (weight α ≈ 0.1 ms⁻¹) tying cooling to the actual IT SLA; and an electricity-price term (β ≈ 0.01) that, where peak/off-peak spread is large (≥0.6 per kWh), pushes heavy workloads into the cheap overnight window — cutting both PUE and the bill. Normalize the total to [−1,1] and feed it to PPO with GAE(λ=0.95). The source reports a hall at ~10k-rack scale converging in ~3 days, PUE falling from 1.42 to 1.27.

Reward hacking — and why a penalty is not enough

Two layers. (1) Constrained, not weighted. The temperature limit is a constraint, handled with a Lagrange multiplier (or CPO / PID-Lagrangian), not a hand-tuned penalty weight — because a weight that is too small lets the agent buy a hot rack with energy savings, and one too large makes it freeze the room. (2) Watch the behaviour, not the score. If shaped reward is high while a behavioural ratio (cooling-energy per unit IT load, say) drifts past human −3σ, that is the agent gaming the shaping; alert and roll back. And against the worst case — a sensor feeding fake low temperatures so the policy shuts cooling off — treat the sensor attack as an adversary and solve a robust-CPO worst-case policy.

5 · Guard — surviving a non-stationary load in production (19.4)

Intuition. The IT load is exogenous and bursty — a batch job lands, a region fails over, traffic triples on a sale day — and the room's thermal behaviour shifts under the policy's feet. A controller tuned for yesterday's load can be wrong for today's. You need to detect the shift fast and adapt without retraining from scratch, all while a hardware backstop guarantees safety regardless.

Engineering detail — detect. Maintain an EWMA mean and EW-variance on each key signal (μ_t = α·x_t + (1−α)·μ_t−1; flag a local anomaly when |x_t − μ_t| > k·σ_t, k ≈ 3.5). For a genuine regime change, run online Bayesian change-point detection — a Normal-Gamma conjugate posterior updated each mini-batch, declaring a global change-point when the log Bayes factor exceeds a threshold for two consecutive batches. On a confirmed change-point: freeze the old replay buffer and open a fresh one (so distributions don't mix), warm-start a meta-learned policy (MAML/PEARL) for fast adaptation, and signal the orchestrator to scale capacity if CPU is also saturating.

Engineering detail — adapt online, safely. Incremental PPO on a sliding window of recent interaction, mixed with ~20% reservoir-sampled history to prevent forgetting; a KL penalty with dual-gradient-adjusted β holding each step's KL < 0.01 so adaptation can't lurch; recompute advantages with GAE to keep variance down. Then a shadow / canary gate: the new policy serves a small fraction of capacity, and if a core metric drops more than ~1σ within minutes, roll back to the previous version in seconds. Where you have a workload forecast, close the loop with model-based rollouts (MBPO-style short horizons weighted by predictive uncertainty, w = 1/(σ²+ε)) so the controller pre-positions cooling before the heat arrives instead of chasing it.

The hardware backstop is not optional

RL is the brain; it must not be the only line of safety. The deployed pattern is two-channel: the RL policy emits setpoints / reference trajectories, and a separately-certified low-level controller (and an independent safety PLC) closes the loop in hard real time — the PLC monitors raw temperatures and can physically cut the actuator drive within milliseconds, completely independent of the RL process. So even if the policy, the field estimator and the safety projection all fail at once, the room still cannot exceed its limit. Validate the whole stack against millions of Monte-Carlo boundary cases in a digital twin and require zero violations before it ever touches the floor.

6 · The central trade-off — PUE vs thermal safety margin

Intuition. Everything above collapses to one dial. Run the cooling hard and you keep a fat safety margin below T_max but pay a high PUE. Throttle it and PUE drops — until your margin is so thin that the uncertainty in the hidden-field estimate, plus the next load spike, tips a voxel over the limit. The whole engineering effort — field reconstruction, the safety projection, the constrained reward — exists to let you run the room closer to the limit safely, i.e. to convert reduced uncertainty into lower PUE without buying a breach. The widget below is that napkin math: pick a setpoint margin and a load-spike size, and watch energy savings trade against breach risk.

Cooling setpoint margin → energy saving vs. over-temperature risk

Tighter margin (cooling backed off) saves energy but eats into the buffer that absorbs hidden-field uncertainty and load spikes. Better sensing (lower σ) and a load forecast both let you tighten the margin without raising breach risk — that is the entire value of the RL stack.

setpoint margin below T_max (°C) 6 field uncertainty σ (°C) 2 load-spike ΔT (°C) 3

PUE

–

energy saving

–

over-temp risk

–

verdict

–

7 · Iterate — the through-line

Each section is one row of the MDP table turned into a mechanism. Hidden 3-D field → uncertainty-aware reconstruction + a compressed belief state. Hybrid actuators with hard floors → structured Gumbel-Softmax + reparameterized heads, then a differentiable QP safety projection. Multi-objective, hackable reward → load-aware shaping, an absorbing limit penalty, and a Lagrangian constraint instead of a guessed weight. Non-stationary load → change-point detection, incremental KL-bounded adaptation, a canary gate, and an independent PLC backstop. Remove the field uncertainty and the binding difficulty becomes how tight a margin you dare run; tighten the margin and oscillation re-appears — so you re-run the loop. The discipline is the same one from lesson 67: never reach for a tool until a row of the table demands it.

Further considerations

Multi-agent coordination. A campus has many coupled thermal subsystems (multiple halls, shared chiller plant). Promote each to an agent and let them share boundary temperatures via a graph-attention layer, co-optimizing facility-wide long-term return instead of each hall greedily chasing its own local optimum.
Meta-learning for fast transfer. Seasonal swings and hardware refreshes shift the thermal dynamics. MAML or PEARL lets the policy adapt from a handful of samples in the new regime, avoiding the cost of re-collecting CFD ground truth every time the room changes.
Verifiable compression. For a safety-critical controller, prove the compressed state preserves the MDP: a bisimulation-equivalence argument bounding the optimal-value error of the compressed MDP (ε < 0.02) so the reduction is certified, not just empirically "good enough."
Barrier-function guarantees with sample bounds. Casting the limit as a CMDP and using a differentiable log-barrier objective yields a monotonic-improvement guarantee and an explicit step-size and barrier-coefficient choice — and a Bernstein bound gives the number of trajectories N needed to certify constraint violation ≤ ζ with probability 1−δ, with no dimension dependence. Constraint tuning stops being guesswork.
Carbon as a third objective. Fold a carbon-cost term into the reward and search the temperature-safe / energy-minimal / carbon-minimal Pareto front (e.g. Pareto-TRPO); where carbon is a tradable allowance, the agent can even arbitrate between saving energy and selling allowances.
Explainability and audit. Operators must be able to ask "why did you trust the IR camera over the fibre?" Attach a per-action explanation (attention mask + SHAP-style attribution over the fused sensors), log every decision with a hashed parameter fingerprint and data version, and keep a tested "red-button" path that switches to a rule-based baseline within ~200 ms — so a compliance audit reduces to checking the log hashes match and the fallback actually fires.