all lessons / reinforcement learning / 76 · Energy-management microgrid lesson 76 / 87

Energy-management microgrid

A microgrid is a dispatch controller that decides, every few seconds, how much to charge or discharge the battery, which loads to shed, and how much to draw from the grid — to minimize the electricity bill without ever violating a physical limit. What makes this MDP hard is that the state is a forecast, not a fact (weather and price are uncertain), the action lives under hard safety constraints that a stochastic policy must never break, the reward spans multiple time-scales and a depreciation cost you cannot directly measure, and the environment is openly non-stationary (seasons) and distributed across sites that will not share raw load data. Each of those names a mechanism.

The method — five steps, every lesson
Applied RL is not a grab-bag of tricks. Every domain in this track is the same loop: (1) Formulate the MDP — state, action, reward, transition, horizon. (2) Diagnose the one property that makes the MDP hard. (3) Engineer the mechanism that removes that difficulty. (4) Guard it in production — detect when it breaks and fall back. (5) Iterate. The art is only in which difficulty binds first. For a microgrid, uncertainty and hard constraints bind before anything else.

1 · Formulate — the MDP behind a microgrid controller

Intuition. A microgrid agent looks at the weather forecast, the battery's state of charge, the current electricity price, and the building's load; it decides how to move power between battery, grid, and loads; it pays the electricity bill at the end of each interval. That is already a Markov Decision Process. Every difficulty below is one of these four pieces being awkward in a real grid.

MDP = (S, A, R, P, γ)  ·  maximize  E[ Σₜ γᵗ rₜ ]  ·  Δt ≈ 1–15 min, horizon = one day
PieceFor a battery+load microgridThe awkward part
State SSoC, load, PV/wind output, price, weather forecastthe forecast is a distribution, not a value, and links can drop → uncertain & partially observable
Action Abattery power P (continuous) × which loads / chillers to switch (integer)hybrid, and most actions would violate a hard physical limit
Reward Renergy cost saved − battery wear − outage penaltyspans real-time + peak/valley + demand tiers, and wear is unmeasured
Transition Pweather, market price, battery electrochemistry, neighboursseasonal non-stationarity and a multi-agent, privacy-bound grid

The same pattern as every lesson: the MDP table writes itself, and the rightmost column — the awkward part — is the lesson. Each row's difficulty has a named mechanism that answers it.

2 · State is a forecast — carry the distribution, not the point

Intuition. If you feed the policy a single predicted wind speed and treat it as truth, the agent learns to bet the farm on a number that is wrong half the time. A weather forecast is honestly a probability distribution; the controller should see the spread, not just the mean, so it can hold back power when the forecast is uncertain and lean in when it is confident.

Engineering detail. Offline, train a conditional variational auto-encoder (CVAE) on years of met-tower, satellite, and numerical-weather-ensemble data: the encoder ingests the current 500 hPa geopotential field, surface pressure and humidity; the decoder emits the joint distribution parameters — mean, variance, correlation — of local wind speed, direction and temperature. A latent dimension of just 32 keeps the model under the grid-edge GPU's 2 GB budget. Online, append those distribution parameters to the deterministic state to form an augmented state:

st+ = [ stdet, μt, σt, ρt ]

On the forward pass, use the reparameterization trick to draw K = 5 weather realizations {st,k}, run all K through a shared-weight actor, and feed the statistics of the distribution (mean + 95% CVaR) into the critic, which outputs a risk-sensitive value Vφ(st+). The training objective is standard PPO plus a risk term:

ℒ = ℒPPO + β · CVaR0.95(−Vφ)

β is tuned on site so that under a ±5 m/s wind step the power ramp still respects the grid code's 1.5 MW/min limit. Against a baseline that treats the forecast as truth, this distributional state has been reported to lift annual energy yield ~1.8% and cut dispatch-penalty fines ~42%. When only a deterministic forecast is available, fit the historical residuals with deep kernel density estimation and add that residual distribution online — the "distributional state" requirement still holds.

3 · Standardize state & action without breaking the constraints

Intuition. Neural networks want inputs near unit scale, but a microgrid mixes volts, amps, kWh and °C — wildly different magnitudes. Naive normalization is dangerous here: if the agent reasons in normalized space, a small numeric slip can map back to a current that trips a breaker. So normalization must be invertible, auditable, and constraint-aware.

Engineering detail — three moves. (1) Robust, frozen scaling. From ≥10⁶ logged transitions, compute a per-dimension robust scale that ignores spikes, then freeze it:

scale = median(|x − median(x)|) / 0.6745

Bake offset and scale into the model's constant nodes so inference never recomputes them — the DCS can audit them as static constants. (2) A boundary-distance feature. For any hard-constrained variable, don't just hand the network the normalized value; also hand it the log-distance to the legal limit, so the gradient is both stable and aware of which direction is dangerous:

snorm = [ (x − offset)/scale ,   log(ε + c − x) ] ,   ε = 1e-3

where c is the regulatory ceiling. (3) De-normalize then project, in physical units. The policy emits mean and std; convert to raw units with the same offset/scale, then clip in physical space — so the clip can guarantee no violation while log_prob stays differentiable for SAC/PPO:

araw = mean · action_scale + action_offset ,   aproj = clip(araw, amin, amax)
Why the clip must live in physical units
If you clip in normalized space you are clipping a number, not a current — rounding and scale drift can still hand the actuator an out-of-range value. Clipping in raw units after de-normalization makes the safety guarantee exact, independent of the network's numeric state. When a limit is itself dynamic (e.g. allowed current drops with temperature), feed c as an extra state input so the policy conditions on it — but keep c slow-varying (seconds) versus inference (milliseconds), or you reintroduce non-stationarity.

4 · The action under hard constraints — a three-layer safety chain

Intuition. The battery has physical red lines: never over-discharge, never exceed peak power, keep efficiency above a floor. A learning policy is stochastic — given enough samples it will eventually pick a forbidden action unless something structurally forbids it. The principle is defense in depth: make illegal actions impossible to sample, project anything that slips through, and penalize residual risk during learning.

Engineering detail. First, reparameterize the action: instead of raw power P, output a normalized efficiency coefficient k∈[0,1] mapped through a differentiable lookup to physical power P = f⁻¹(k; SoC), so the policy gradient has real magnitude only in the high-efficiency band. Then the three layers:

1. mask → 2. project → 3. constrained-RL penalty

State estimation closes the loop: a domestic EKF library keeps SoC error to ~1.8%, satisfying the 5% margin the standard requires. One 50 kWh unit ran six months with no BMS protection trip and <1% capacity fade.

Hybrid action: hierarchical nesting + mixed PPO
Integer load-switching and continuous valve/power setpoints are different animals, so split them. An upper Manager observes load forecast, price and wet-bulb temperature and emits an integer switch vector ad∈{0,1}ⁿ via masked Gumbel-Softmax; a lower Worker, conditioned on ad's one-hot, emits continuous setpoints ac via a Tanh-Normal. Train both with one mixed-PPO ratio — straight-through Gumbel gradients for the discrete part, Gaussian log-prob for the continuous part, a shared value net, GAE(λ=0.95), plus a switch penalty −0.5·|ad,t−ad,t−1| and a rate penalty −0.01·‖ac,t−ac,t−1‖² to suppress chattering. A minimum on/off interval (e.g. ≥300 s between chiller starts) is handled by exposing a "remaining lock-out time" in the state, so the agent learns zero-shot compliance rather than relying on post-hoc rule patches. At hundreds of devices the 2ⁿ discrete space explodes — embed the device topology with a GNN and use a pointer network to select one device per step, taking complexity from exponential to linear.

5 · Reward — bill, depreciation, and outage, on three clocks

Intuition. The obvious reward is the energy-cost saving, but cycling the battery hard to chase arbitrage silently ages it — a cost you pay months later and never see in the per-step bill. And shedding a critical load to save money can black out a hospital ward, a cost no price can capture. So the reward is three terms: revenue, a wear charge, and an outage penalty whose weight changes with how critical the load is right now.

rt = rrevenue + rdepreciate − λ(t) · Ppenalty

Engineering detail — making wear visible and bounded. Derive a stress factor stresst from temperature, depth-of-discharge and SoC window, concatenate it into the SAC critic's input, and let rdepreciate charge for it — but keep |rdepreciate| ≲ 0.01 when rrevenue∈[−0.1, 0.1] so depreciation shades rather than overshadows the main signal. The policy then learns to back off power or shift its SoC window under high stress. Two refinements from the field: once capacity drops below 80% fade accelerates, so add a feedback term stresscap=1/(1−0.8·kcap) to make the agent conservative early; and when a hard standard says "temperature > 55 °C or SoC < 5% → forced shutdown," set the corresponding stress to +∞ so the constraint is satisfied through the reward itself, with no engineering patch.

Engineering detail — the dynamic outage weight. The outage penalty weight is not constant; it scales with battery reserve, price pressure and weather alerts:

λ(t) = 1.0 · (1 + 0.5·(1−SoC)) · (1 + 0.3·Priceratio) · (1 + 0.2·Weatheridx)

where Priceratio is the current price over the trailing-30-day max. Top-tier (T0) critical loads additionally get a −∞ mask on any shed action, so the policy physically cannot cut them. Before go-live, run a counterfactual simulation: hold λ constant, check the policy doesn't collapse into "shed everything" or "serve everything," then ramp λ back to dynamic and verify the loss-of-load probability (LOLP) stays under the utility's ≤0.1% KPI.

Unifying multi-time-scale tariffs into one step reward
Real bills mix a real-time price (RTP), a fixed peak/valley schedule (TOU), and a demand charge billed on the worst 15-minute draw. Blend RTP and TOU with a weight α (which can itself be a policy output — meta-RL price-weighting — letting the agent decide which to trust). Cap risk with a penalty −β·max(0, RTPt−1.0) so the policy doesn't gamble on the price ceiling, and add the demand-charge term MDt=maxk∈[t−3,t]|Pgrid,k| so the single-step reward faithfully reflects two-part pricing. Crucially, add a causal unit test in the environment wrapper: feeding next-step RTP/TOU must raise an error, guaranteeing no future-information leak.

6 · The arbitrage–depreciation tension — the napkin math

Intuition. Every extra cycle of depth-of-discharge you take earns more peak/valley arbitrage today but spends a slice of the battery's finite ~6000-cycle life. Push the SoC window too wide and the wear cost — amortized replacement plus accelerating fade — eats the arbitrage you just earned. There is an interior optimum, and the widget below is exactly that trade-off as one-line economics: pick a daily depth-of-discharge and watch net value peak and then fall.

Battery dispatch: arbitrage revenue vs. depreciation per day

Wider daily depth-of-discharge (DoD) captures more peak/valley spread, but deep cycling shortens calendar life super-linearly and accelerates fade past 80% capacity. Net daily value = arbitrage − amortized wear. Find the DoD where the gain stops paying for the wear.

cycle life
arbitrage/day
wear/day
net/day

1 · FORMULATE SoC, price, forecast state 2 · DIAGNOSE uncertain state, hard limits 3 · ENGINEER CVAE state, mask, project, shape 4 · GUARD drift detect, LOLP, fallback 5 · ITERATE re-diagnose removing one difficulty exposes the next — re-run the loop

7 · Guard — partial observability, non-stationarity, and the distributed grid

Intuition. In production three things break that simulation hides: communication links drop, the seasons turn, and the grid is many sites that won't share their raw load curves. Each needs a detector and a fallback, not a one-off fix.

Engineering detail — POMDP for link loss. Model dropped communication explicitly. Keep a ring buffer of the last k=10 observations; on any frame lost to a drop or >100 ms latency, mark a missing-mask mt=0 and form a fixed-width observation xt=[olocal; oremote⊙mt; mt]. A lightweight 200-particle filter updates a belief b(st) on the edge GPU; when mt=0 it runs the predict step and skips the update, so belief variance grows naturally and feeds the risk-sensitive value. The policy conditions on a belief embedding zt=Encoder(bt) (2-layer GRU + MLP), trained with PPO-Clip plus an information-bottleneck term λH[bt] that nudges the agent toward a conservative local mode during blackout. Tie a belief-entropy threshold to a minimum-risk fallback so the system reaches a safe state within the regulated window.

Seasonal non-stationarity — detect, retrain, forget nothing
Distribution shift across seasons silently rots a frozen policy. Detect it: each night compute the state-distribution PSI and reward KS, plus an online action-distribution Wasserstein-1; treat two consecutive days over (PSI>0.2, KS p<0.01, W1>0.15) as drift, second-confirmed by a CEP engine to reject holiday spikes. Retrain warm-started from the current policy, LR annealing 3e-4→1e-5, early-stopping at validation KL<0.01, gated to gray-launch only when offline AUC rises >1% and reward variance falls >5%, then ramped 5%→20%→100% over full business cycles with a 3σ-drop auto-rollback. To avoid catastrophic forgetting of old seasons, keep a Seasonal vs. Recent replay buffer at 6:4, add an EWC penalty on old-season samples, and use A-GEM projection: if the new gradient's angle to the old-season mean gradient exceeds 90°, project it back into the feasible cone so old-season value never decreases.
The through-line
Every section is one row of the MDP table turned into a mechanism: uncertain state → CVAE distributional state + CVaR critic; partial observability → masked observation + particle-filter belief; hard action limits → mask → project → Lagrangian PPO; multi-clock, unmeasured reward → stress-aware depreciation + dynamic outage weight + causal test; non-stationary transition → drift detection + EWC/A-GEM; distributed grid → difference rewards under DP-protected gradients. You never reached for a tool until a row of the table demanded it.

8 · Distributed coordination & privacy — the multi-agent layer

Intuition. A region is many microgrids sharing transformers and a common social objective (minimize total cost + curtailment + line loss). Each must act on local observations in milliseconds, yet none will upload its raw load curve. So: align local incentives with the global objective, and share gradients, not data — privately.

Engineering detail — difference rewards. Give each agent a local reward that is provably aligned with the global one: its own cost, plus a difference reward measuring its marginal contribution, plus a zero-sum potential correction that fairly spreads line-overload penalties:

rilocal = −Ci(Pi) + Di + Fi ,   Di = G(s,a) − G(s, a−i, aidefault) ,   Σi Fi = 0

Train a centralized critic Q̂(s,a) in the dispatch cloud, sync its parameters to edge gateways every ~5 min, and execute fully locally. Potential-game theory shows that if G is an exact potential the Nash equilibrium is the social optimum; in practice tuning λ keeps the approximation error <1%. When comms break for >500 ms the agent falls back to a purely local reward (own cost + overload penalty) and back-fills the difference term with a sliding window when the link returns, so training doesn't diverge.

Engineering detail — privacy-preserving gradients. Three escalating levels, trading privacy strength against communication cost. Level 1: local differential privacy — clip the policy gradient to sensitivity C=median(‖gi‖₂), add Gaussian noise calibrated by a moments accountant (per-round ε=0.1, cumulative ε≤10 over 1000 rounds), FedAvg the noisy gradients; load data never leaves the site, convergence slows ~18% but LR warm-up + momentum recovers most of it. Level 2: Top-K=1% sparsification + random rotation + SecAgg, so per-site noise is amortized across the n participants as σ₁/√n — same ε, two orders of magnitude less traffic, <200 ms aggregation. Level 3: distill a 2-layer student policy (5% of the parameters) whose loss contains neither reward nor state, upload only its KL-gradient under CKKS homomorphic encryption with Laplace output noise — member-inference success drops below 1%. Log every ε spent to a privacy-budget ledger for audit.

Communication delay degrades consensus convergence — quantify it
Stale parameters across the consensus ring inject variance just like a stale opponent does in self-play. Mitigate explicitly: delay-weighted Polyak averaging θsynci wi·θi·e−λτi; a delayed importance-sampling ratio ρτ=min(ρ, 1+ε·e−cτ) in the PPO clip; and a delay-aware LR schedule αt=α₀/(1+0.1·τavg). Under compliance rules that keep sensitive data in-region, cross-domain latency often ≥120 ms, so trade the DP noise σDP against delay noise στ — the convergence bound scales like O(√(σDP²+στ²)), a Pareto choice between privacy budget ε≤1 and delay τ≤150 ms.

Further considerations