all lessons / reinforcement learning / 85 · Data-center cooling lesson 85 / 87

Data-center cooling

A hall full of racks, a fleet of fans, pumps and chilled-water valves, and one number the operator is judged on: PUE — total facility power divided by IT power. RL can shave it, but only against a wall of hard physics: you never see the whole temperature field, the cheapest cooling setting is always one step from a thermal runaway, and the IT load that drives everything lurches without warning. This lesson casts cooling as a constrained, partially-observable MDP and works the four binding difficulties: a hidden 3-D thermal state, a hybrid discrete-continuous actuator space with hard safety intervals, a multi-objective PUE-vs-temperature reward that is trivially hackable, and a non-stationary load.

The method — five steps, every lesson
Applied RL is not a grab-bag of tricks. Every domain in this track is the same loop: (1) Formulate the MDP — state, action, reward, transition, horizon. (2) Diagnose the one property that makes the MDP hard. (3) Engineer the mechanism that removes that difficulty. (4) Guard it in production — detect when it breaks and fall back. (5) Iterate. The art is only in which difficulty binds first. Here it binds on safety: in a data center an unsafe action does not lose a game, it cooks a rack.

1 · Formulate — the MDP behind a cooling controller

Intuition. The controller reads thousands of temperature sensors and a handful of power meters, then sets fan speeds, pump frequencies and valve positions. Minutes later the racks are a little hotter or cooler and the facility drew a little more or less power. That is an MDP: the sensor field is the state, the actuator setpoints are the action, the energy-and-temperature outcome is the reward, and the room's thermodynamics (plus the IT load you do not control) is the transition. Every difficulty below is just one of these four pieces being awkward in a real hall.

MDP = (S, A, R, P, γ)  ·  maximize  E[ Σₜ γᵗ rₜ ]   subject to   T(s) ≤ Tmax
PieceFor a cooling controllerThe awkward part
State Sthermistor array + inlet/outlet temps + IT load + outdoor weather + electricity pricesensors are sparse points in a 3-D field — the hottest voxel is usually unmeasuredpartially observable
Action Afan stage (discrete) × pump frequency (continuous) × valve opening (continuous)hybrid, and the actuators have hard physical floors (chiller can't run below ~30 Hz)
Reward R−energy, with bonuses for low PUE and penalties for hot spots and SLA breachmulti-objective and hackable: the literal optimum is "turn cooling off"
Transition Proom CFD + thermal mass + the IT workload arriving from outsideload jumps (a batch job lands) → non-stationary, and thermal inertia means actions pay off with delay

The rightmost column is the lesson. Each awkward part names a mechanism: hidden field → field reconstruction + belief state; hybrid + hard floors → structured heads + a safety projection; hackable reward → constrained shaping with an absorbing penalty; jumping load → online change-point detection and fast adaptation. We reach for each tool only when its row demands it.

2 · Diagnose — the hidden thermal field (19.1)

Intuition. You have, say, a few hundred thermistors for a hall that is a continuous 3-D temperature field with millions of degrees of freedom. The dangerous quantity — the single hottest point behind some dense GPU rack — is almost never sitting under a sensor. A controller that optimizes only the temperatures it can see will happily run the room hot exactly where it is blind. This is the partial-observability row, and it is the difficulty that binds first because everything downstream (the safety margin, the reward) is defined on temperatures you must infer, not read.

Engineering detail — fuse the field, keep the uncertainty. Treat the sparse sensor vector as noisy observations of a latent field and reconstruct it. The source's recipe: an encoder that fuses multi-modal sensing (infrared cameras, fibre-optic strings, thermistors) into a voxel feature volume, decoded by a 3-D ResUNet to a dense field plus a per-voxel uncertainty. Critically, you do not collapse to a point estimate — you carry the uncertainty forward, because a hot voxel that is also high-variance is exactly where the policy must be cautious. Offline you can supervise this network against historical CFD ground truth, then warm-start an uncertainty-weighted filter so the online estimate never explodes when a sensor drops out.

b(field) = Encoder( IR, fibre, thermistors )  →  ( T̂(x,y,z), σ²(x,y,z) )  ·  Trisk = maxvoxel ( T̂ + k·σ )

State compression — don't feed a million voxels to the policy. The raw field is far too high-dimensional to be a policy input. Compress it: an unsupervised pre-train (a β-VAE on the field, a low-dim summary of the sensor time-series) reduces the observation to a few-dozen-dimensional vector. The source's discipline is the part most people skip — verify sufficiency: compute the mutual information between the policy and the field before vs. after compression and require a high retention (it reports >96%). A compression that loses task-relevant information is silently fatal, because the policy can no longer tell a safe room from a marginal one.

Belief state, not raw observation
Because the room is partially observable, the policy should condition on a belief, not a single frame. Feed the compressed field through a recurrent core (GRU + attention over the sensor history) so the agent integrates information over time — a thermistor that briefly disagrees with its neighbours is noise; one that has drifted for an hour is a real hot spot forming. The hidden state is the controller's running estimate of where the heat actually is.

3 · Engineer — hybrid actuators inside hard safety intervals (19.2)

Intuition. A cooling action is not one knob. It is "pick a fan stage" (a discrete menu), "set pump frequency" and "set valve opening" (continuous, but bounded). Flatten it into one giant discrete grid and you explode combinatorially; treat it as one continuous vector and you lose the menu structure of the fan stages. Keep the structure: a discrete head for the stage, continuous heads conditioned on it for the analogue setpoints.

a = (afan, apump, avalve)  |  afan ~ Categorical(logits)   apump, avalve = μ(afan) + σ(afan)⊙ε

Engineering detail. Sample the discrete stage with Gumbel-Softmax so the choice stays differentiable and the setpoints fed to a real PLC are discrete-valid; reparameterize the continuous heads (ε ~ 𝒩(0,I)) so gradients flow end-to-end. Train with a clipped surrogate (PPO-Clip) for the actor and a twin critic with clipped double-Q for the mixed-action Bellman error; maximize Q − α·H over the joint entropy H(afan) + H(apump,avalve | afan) with an auto-tuned temperature α so exploration is calibrated, not hand-set.

The safety interval is the whole game. The chiller has a minimum frequency (~30 Hz); valves have travel limits; and above all, no action may drive any voxel above Tmax. A naive answer is "give an over-temperature action a −10⁶ penalty." The source is emphatic that this is wrong: with a large penalty the policy still explores near the edge and random exploration will occasionally cross it — and crossing it once means cooked hardware. The right answer is to make the unsafe action impossible, not merely expensive.

asafe = argmina' ‖a' − araw‖² s.t. g(s,a') ≤ 0  ·  T̂t+1(s, asafe) ≤ Tmax

Engineering detail — a differentiable safety layer. After the policy emits a raw action, project it onto the feasible set with a small QP solver: predict next-step temperature T̂t+1 = f(Tt, at, load) and find the nearest action keeping T̂t+1 ≤ Tmax (nearest, so control stays smooth). Backpropagate the gradient of the projected action, so the network actually learns to steer away from the boundary rather than relying on the projector to clean up after it. Where you can afford it, certify the layer with a Control-Barrier-Function check (B grows toward the boundary; require dB/dt + κB ≥ 0) and a model-predictive variant (MPSL) that projects over an N-step horizon — single-step projection can be too late when thermal coupling is fast.

Oscillation — the second-order failure of a smooth-looking controller
A controller that respects every limit can still chatter: fans hunting up and down, valves cycling, because Q-value noise and a non-stationary load keep flipping the argmax. Chatter wears actuators and confuses operators. The source's fixes, in order of leverage: (1) TV-regularize the action sequence and add a separable smoothing layer so consecutive setpoints can't jump; (2) anneal exploration — decay the entropy coefficient on a schedule (e.g. 0.2 → 0.01) so late training stops injecting randomness; (3) average out environment noise with many parallel samplers and synchronous gradient averaging — more workers, lower update variance, fewer "back-and-forth" reversals; (4) normalize reward scale with PopArt so a swing in the raw kWh term doesn't blow up Q-values. Monitor gradient norm, Q-std and policy KL; if KL > 0.02 or the gradient norm spikes, auto-roll back the last model.
1 · FORMULATE S, A, R, P constrained POMDP 2 · DIAGNOSE hidden field, hard safety limits 3 · ENGINEER field VAE, hybrid heads, safety projection 4 · GUARD CBF certificate, PLC hard cutoff 5 · ITERATE re-diagnose removing one difficulty exposes the next — re-run the loop

4 · Engineer — the PUE-vs-SLA reward (19.3)

Intuition. The thing you want is low PUE. But PUE alone has a degenerate optimum — switch the cooling off and PUE momentarily looks great while the room climbs toward failure. So the reward is genuinely multi-objective: spend less energy, and keep every voxel cool, and honour the IT-side latency SLA, and ideally shift heavy jobs to cheap electricity. Each extra term is a new thing the agent will optimize literally, so each must be shaped with care.

r = rshape(PUE) + rpen − α·dlatency − β·πprice·(Ptotal/1000)

Engineering detail. The source's structure: a shaping term that rewards driving PUE toward a load-aware baseline (at low IT load PUE is naturally inflated — pueref(load) = 1.05 + 0.25·e−5·load — so the agent isn't punished for physics it can't beat); a hard absorbing penalty (PUE > 1.5 → episode terminates with a large negative reward, so the policy never learns to chase a "good" PUE by starving cooling); a latency term (weight α ≈ 0.1 ms⁻¹) tying cooling to the actual IT SLA; and an electricity-price term (β ≈ 0.01) that, where peak/off-peak spread is large (≥0.6 per kWh), pushes heavy workloads into the cheap overnight window — cutting both PUE and the bill. Normalize the total to [−1,1] and feed it to PPO with GAE(λ=0.95). The source reports a hall at ~10k-rack scale converging in ~3 days, PUE falling from 1.42 to 1.27.

Reward hacking — and why a penalty is not enough
Two layers. (1) Constrained, not weighted. The temperature limit is a constraint, handled with a Lagrange multiplier (or CPO / PID-Lagrangian), not a hand-tuned penalty weight — because a weight that is too small lets the agent buy a hot rack with energy savings, and one too large makes it freeze the room. (2) Watch the behaviour, not the score. If shaped reward is high while a behavioural ratio (cooling-energy per unit IT load, say) drifts past human −3σ, that is the agent gaming the shaping; alert and roll back. And against the worst case — a sensor feeding fake low temperatures so the policy shuts cooling off — treat the sensor attack as an adversary and solve a robust-CPO worst-case policy.

5 · Guard — surviving a non-stationary load in production (19.4)

Intuition. The IT load is exogenous and bursty — a batch job lands, a region fails over, traffic triples on a sale day — and the room's thermal behaviour shifts under the policy's feet. A controller tuned for yesterday's load can be wrong for today's. You need to detect the shift fast and adapt without retraining from scratch, all while a hardware backstop guarantees safety regardless.

Engineering detail — detect. Maintain an EWMA mean and EW-variance on each key signal (μt = α·xt + (1−α)·μt−1; flag a local anomaly when |xt − μt| > k·σt, k ≈ 3.5). For a genuine regime change, run online Bayesian change-point detection — a Normal-Gamma conjugate posterior updated each mini-batch, declaring a global change-point when the log Bayes factor exceeds a threshold for two consecutive batches. On a confirmed change-point: freeze the old replay buffer and open a fresh one (so distributions don't mix), warm-start a meta-learned policy (MAML/PEARL) for fast adaptation, and signal the orchestrator to scale capacity if CPU is also saturating.

Engineering detail — adapt online, safely. Incremental PPO on a sliding window of recent interaction, mixed with ~20% reservoir-sampled history to prevent forgetting; a KL penalty with dual-gradient-adjusted β holding each step's KL < 0.01 so adaptation can't lurch; recompute advantages with GAE to keep variance down. Then a shadow / canary gate: the new policy serves a small fraction of capacity, and if a core metric drops more than ~1σ within minutes, roll back to the previous version in seconds. Where you have a workload forecast, close the loop with model-based rollouts (MBPO-style short horizons weighted by predictive uncertainty, w = 1/(σ²+ε)) so the controller pre-positions cooling before the heat arrives instead of chasing it.

The hardware backstop is not optional
RL is the brain; it must not be the only line of safety. The deployed pattern is two-channel: the RL policy emits setpoints / reference trajectories, and a separately-certified low-level controller (and an independent safety PLC) closes the loop in hard real time — the PLC monitors raw temperatures and can physically cut the actuator drive within milliseconds, completely independent of the RL process. So even if the policy, the field estimator and the safety projection all fail at once, the room still cannot exceed its limit. Validate the whole stack against millions of Monte-Carlo boundary cases in a digital twin and require zero violations before it ever touches the floor.

6 · The central trade-off — PUE vs thermal safety margin

Intuition. Everything above collapses to one dial. Run the cooling hard and you keep a fat safety margin below Tmax but pay a high PUE. Throttle it and PUE drops — until your margin is so thin that the uncertainty in the hidden-field estimate, plus the next load spike, tips a voxel over the limit. The whole engineering effort — field reconstruction, the safety projection, the constrained reward — exists to let you run the room closer to the limit safely, i.e. to convert reduced uncertainty into lower PUE without buying a breach. The widget below is that napkin math: pick a setpoint margin and a load-spike size, and watch energy savings trade against breach risk.

Cooling setpoint margin → energy saving vs. over-temperature risk

Tighter margin (cooling backed off) saves energy but eats into the buffer that absorbs hidden-field uncertainty and load spikes. Better sensing (lower σ) and a load forecast both let you tighten the margin without raising breach risk — that is the entire value of the RL stack.

PUE
energy saving
over-temp risk
verdict

7 · Iterate — the through-line

Each section is one row of the MDP table turned into a mechanism. Hidden 3-D field → uncertainty-aware reconstruction + a compressed belief state. Hybrid actuators with hard floors → structured Gumbel-Softmax + reparameterized heads, then a differentiable QP safety projection. Multi-objective, hackable reward → load-aware shaping, an absorbing limit penalty, and a Lagrangian constraint instead of a guessed weight. Non-stationary load → change-point detection, incremental KL-bounded adaptation, a canary gate, and an independent PLC backstop. Remove the field uncertainty and the binding difficulty becomes how tight a margin you dare run; tighten the margin and oscillation re-appears — so you re-run the loop. The discipline is the same one from lesson 67: never reach for a tool until a row of the table demands it.

Further considerations