Energy-management microgrid

A microgrid is a dispatch controller that decides, every few seconds, how much to charge or discharge the battery, which loads to shed, and how much to draw from the grid — to minimize the electricity bill without ever violating a physical limit. What makes this MDP hard is that the state is a forecast, not a fact (weather and price are uncertain), the action lives under hard safety constraints that a stochastic policy must never break, the reward spans multiple time-scales and a depreciation cost you cannot directly measure, and the environment is openly non-stationary (seasons) and distributed across sites that will not share raw load data. Each of those names a mechanism.

The method — five steps, every lesson

Applied RL is not a grab-bag of tricks. Every domain in this track is the same loop: (1) Formulate the MDP — state, action, reward, transition, horizon. (2) Diagnose the one property that makes the MDP hard. (3) Engineer the mechanism that removes that difficulty. (4) Guard it in production — detect when it breaks and fall back. (5) Iterate. The art is only in which difficulty binds first. For a microgrid, uncertainty and hard constraints bind before anything else.

1 · Formulate — the MDP behind a microgrid controller

Intuition. A microgrid agent looks at the weather forecast, the battery's state of charge, the current electricity price, and the building's load; it decides how to move power between battery, grid, and loads; it pays the electricity bill at the end of each interval. That is already a Markov Decision Process. Every difficulty below is one of these four pieces being awkward in a real grid.

MDP = (S, A, R, P, γ) · maximize E[ Σₜ γᵗ rₜ ] · Δt ≈ 1–15 min, horizon = one day

Piece	For a battery+load microgrid	The awkward part
State S	SoC, load, PV/wind output, price, weather forecast	the forecast is a distribution, not a value, and links can drop → uncertain & partially observable
Action A	battery power P (continuous) × which loads / chillers to switch (integer)	hybrid, and most actions would violate a hard physical limit
Reward R	energy cost saved − battery wear − outage penalty	spans real-time + peak/valley + demand tiers, and wear is unmeasured
Transition P	weather, market price, battery electrochemistry, neighbours	seasonal non-stationarity and a multi-agent, privacy-bound grid

The same pattern as every lesson: the MDP table writes itself, and the rightmost column — the awkward part — is the lesson. Each row's difficulty has a named mechanism that answers it.

2 · State is a forecast — carry the distribution, not the point

Intuition. If you feed the policy a single predicted wind speed and treat it as truth, the agent learns to bet the farm on a number that is wrong half the time. A weather forecast is honestly a probability distribution; the controller should see the spread, not just the mean, so it can hold back power when the forecast is uncertain and lean in when it is confident.

Engineering detail. Offline, train a conditional variational auto-encoder (CVAE) on years of met-tower, satellite, and numerical-weather-ensemble data: the encoder ingests the current 500 hPa geopotential field, surface pressure and humidity; the decoder emits the joint distribution parameters — mean, variance, correlation — of local wind speed, direction and temperature. A latent dimension of just 32 keeps the model under the grid-edge GPU's 2 GB budget. Online, append those distribution parameters to the deterministic state to form an augmented state:

s_t⁺ = [ s_t^det, μ_t, σ_t, ρ_t ]

On the forward pass, use the reparameterization trick to draw K = 5 weather realizations {s_t,k}, run all K through a shared-weight actor, and feed the statistics of the distribution (mean + 95% CVaR) into the critic, which outputs a risk-sensitive value V_φ(s_t⁺). The training objective is standard PPO plus a risk term:

ℒ = ℒ_PPO + β · CVaR_0.95(−V_φ)

β is tuned on site so that under a ±5 m/s wind step the power ramp still respects the grid code's 1.5 MW/min limit. Against a baseline that treats the forecast as truth, this distributional state has been reported to lift annual energy yield ~1.8% and cut dispatch-penalty fines ~42%. When only a deterministic forecast is available, fit the historical residuals with deep kernel density estimation and add that residual distribution online — the "distributional state" requirement still holds.

3 · Standardize state & action without breaking the constraints

Intuition. Neural networks want inputs near unit scale, but a microgrid mixes volts, amps, kWh and °C — wildly different magnitudes. Naive normalization is dangerous here: if the agent reasons in normalized space, a small numeric slip can map back to a current that trips a breaker. So normalization must be invertible, auditable, and constraint-aware.

Engineering detail — three moves. (1) Robust, frozen scaling. From ≥10⁶ logged transitions, compute a per-dimension robust scale that ignores spikes, then freeze it:

scale = median(|x − median(x)|) / 0.6745

Bake offset and scale into the model's constant nodes so inference never recomputes them — the DCS can audit them as static constants. (2) A boundary-distance feature. For any hard-constrained variable, don't just hand the network the normalized value; also hand it the log-distance to the legal limit, so the gradient is both stable and aware of which direction is dangerous:

s_norm = [ (x − offset)/scale , log(ε + c − x) ] , ε = 1e-3

where c is the regulatory ceiling. (3) De-normalize then project, in physical units. The policy emits mean and std; convert to raw units with the same offset/scale, then clip in physical space — so the clip can guarantee no violation while log_prob stays differentiable for SAC/PPO:

a_raw = mean · action_scale + action_offset , a_proj = clip(a_raw, a_min, a_max)

Why the clip must live in physical units

If you clip in normalized space you are clipping a number, not a current — rounding and scale drift can still hand the actuator an out-of-range value. Clipping in raw units after de-normalization makes the safety guarantee exact, independent of the network's numeric state. When a limit is itself dynamic (e.g. allowed current drops with temperature), feed c as an extra state input so the policy conditions on it — but keep c slow-varying (seconds) versus inference (milliseconds), or you reintroduce non-stationarity.

4 · The action under hard constraints — a three-layer safety chain

Intuition. The battery has physical red lines: never over-discharge, never exceed peak power, keep efficiency above a floor. A learning policy is stochastic — given enough samples it will eventually pick a forbidden action unless something structurally forbids it. The principle is defense in depth: make illegal actions impossible to sample, project anything that slips through, and penalize residual risk during learning.

Engineering detail. First, reparameterize the action: instead of raw power P, output a normalized efficiency coefficient k∈[0,1] mapped through a differentiable lookup to physical power P = f⁻¹(k; SoC), so the policy gradient has real magnitude only in the high-efficiency band. Then the three layers:

1. mask → 2. project → 3. constrained-RL penalty

Layer 1 — action masking. Before the softmax (discrete loads) or before sampling (continuous power), set the probability of any SoC-illegal action to zero. For integer load switching, use a masked Gumbel-Softmax so an infeasible combination (e.g. starting a 4th chiller when N+1 redundancy allows 3) can never be drawn; the feasibility mask is pushed down live by the EMS, illegal logits set to −∞ → zero limit violations by construction.
Layer 2 — safety projection. If model error defeats the mask, a convex projection layer maps the action into the feasible set using an analytic KKT / single-step QP solution — ~3 µs per step, <2% CPU at a 1 kHz control rate, forward O(1) and gradient continuous.
Layer 3 — Lagrangian PPO. Add a cost term outside the reward with budget δ; setting δ = 0.001 means "≤1 over-discharge per 1000 operating hours." After ~5M steps the Lagrange multiplier converges and empirically the policy hits zero violations.

State estimation closes the loop: a domestic EKF library keeps SoC error to ~1.8%, satisfying the 5% margin the standard requires. One 50 kWh unit ran six months with no BMS protection trip and <1% capacity fade.

Hybrid action: hierarchical nesting + mixed PPO

Integer load-switching and continuous valve/power setpoints are different animals, so split them. An upper Manager observes load forecast, price and wet-bulb temperature and emits an integer switch vector a_d∈{0,1}ⁿ via masked Gumbel-Softmax; a lower Worker, conditioned on a_d's one-hot, emits continuous setpoints a_c via a Tanh-Normal. Train both with one mixed-PPO ratio — straight-through Gumbel gradients for the discrete part, Gaussian log-prob for the continuous part, a shared value net, GAE(λ=0.95), plus a switch penalty −0.5·|a_d,t−a_d,t−1| and a rate penalty −0.01·‖a_c,t−a_c,t−1‖² to suppress chattering. A minimum on/off interval (e.g. ≥300 s between chiller starts) is handled by exposing a "remaining lock-out time" in the state, so the agent learns zero-shot compliance rather than relying on post-hoc rule patches. At hundreds of devices the 2ⁿ discrete space explodes — embed the device topology with a GNN and use a pointer network to select one device per step, taking complexity from exponential to linear.

5 · Reward — bill, depreciation, and outage, on three clocks

Intuition. The obvious reward is the energy-cost saving, but cycling the battery hard to chase arbitrage silently ages it — a cost you pay months later and never see in the per-step bill. And shedding a critical load to save money can black out a hospital ward, a cost no price can capture. So the reward is three terms: revenue, a wear charge, and an outage penalty whose weight changes with how critical the load is right now.

r_t = r_revenue + r_depreciate − λ(t) · P_penalty

Engineering detail — making wear visible and bounded. Derive a stress factor stress_t from temperature, depth-of-discharge and SoC window, concatenate it into the SAC critic's input, and let r_depreciate charge for it — but keep |r_depreciate| ≲ 0.01 when r_revenue∈[−0.1, 0.1] so depreciation shades rather than overshadows the main signal. The policy then learns to back off power or shift its SoC window under high stress. Two refinements from the field: once capacity drops below 80% fade accelerates, so add a feedback term stress_cap=1/(1−0.8·k_cap) to make the agent conservative early; and when a hard standard says "temperature > 55 °C or SoC < 5% → forced shutdown," set the corresponding stress to +∞ so the constraint is satisfied through the reward itself, with no engineering patch.

Engineering detail — the dynamic outage weight. The outage penalty weight is not constant; it scales with battery reserve, price pressure and weather alerts:

λ(t) = 1.0 · (1 + 0.5·(1−SoC)) · (1 + 0.3·Price_ratio) · (1 + 0.2·Weather_idx)

where Price_ratio is the current price over the trailing-30-day max. Top-tier (T0) critical loads additionally get a −∞ mask on any shed action, so the policy physically cannot cut them. Before go-live, run a counterfactual simulation: hold λ constant, check the policy doesn't collapse into "shed everything" or "serve everything," then ramp λ back to dynamic and verify the loss-of-load probability (LOLP) stays under the utility's ≤0.1% KPI.

Unifying multi-time-scale tariffs into one step reward

Real bills mix a real-time price (RTP), a fixed peak/valley schedule (TOU), and a demand charge billed on the worst 15-minute draw. Blend RTP and TOU with a weight α (which can itself be a policy output — meta-RL price-weighting — letting the agent decide which to trust). Cap risk with a penalty −β·max(0, RTP_t−1.0) so the policy doesn't gamble on the price ceiling, and add the demand-charge term MD_t=max_{k∈[t−3,t]}|P_grid,k| so the single-step reward faithfully reflects two-part pricing. Crucially, add a causal unit test in the environment wrapper: feeding next-step RTP/TOU must raise an error, guaranteeing no future-information leak.

6 · The arbitrage–depreciation tension — the napkin math

Intuition. Every extra cycle of depth-of-discharge you take earns more peak/valley arbitrage today but spends a slice of the battery's finite ~6000-cycle life. Push the SoC window too wide and the wear cost — amortized replacement plus accelerating fade — eats the arbitrage you just earned. There is an interior optimum, and the widget below is exactly that trade-off as one-line economics: pick a daily depth-of-discharge and watch net value peak and then fall.

7 · Guard — partial observability, non-stationarity, and the distributed grid

Intuition. In production three things break that simulation hides: communication links drop, the seasons turn, and the grid is many sites that won't share their raw load curves. Each needs a detector and a fallback, not a one-off fix.

Engineering detail — POMDP for link loss. Model dropped communication explicitly. Keep a ring buffer of the last k=10 observations; on any frame lost to a drop or >100 ms latency, mark a missing-mask m_t=0 and form a fixed-width observation x_t=[o_local; o_remote⊙m_t; m_t]. A lightweight 200-particle filter updates a belief b(s_t) on the edge GPU; when m_t=0 it runs the predict step and skips the update, so belief variance grows naturally and feeds the risk-sensitive value. The policy conditions on a belief embedding z_t=Encoder(b_t) (2-layer GRU + MLP), trained with PPO-Clip plus an information-bottleneck term λH[b_t] that nudges the agent toward a conservative local mode during blackout. Tie a belief-entropy threshold to a minimum-risk fallback so the system reaches a safe state within the regulated window.

Seasonal non-stationarity — detect, retrain, forget nothing

Distribution shift across seasons silently rots a frozen policy. Detect it: each night compute the state-distribution PSI and reward KS, plus an online action-distribution Wasserstein-1; treat two consecutive days over (PSI>0.2, KS p<0.01, W1>0.15) as drift, second-confirmed by a CEP engine to reject holiday spikes. Retrain warm-started from the current policy, LR annealing 3e-4→1e-5, early-stopping at validation KL<0.01, gated to gray-launch only when offline AUC rises >1% and reward variance falls >5%, then ramped 5%→20%→100% over full business cycles with a 3σ-drop auto-rollback. To avoid catastrophic forgetting of old seasons, keep a Seasonal vs. Recent replay buffer at 6:4, add an EWC penalty on old-season samples, and use A-GEM projection: if the new gradient's angle to the old-season mean gradient exceeds 90°, project it back into the feasible cone so old-season value never decreases.

The through-line

Every section is one row of the MDP table turned into a mechanism: uncertain state → CVAE distributional state + CVaR critic; partial observability → masked observation + particle-filter belief; hard action limits → mask → project → Lagrangian PPO; multi-clock, unmeasured reward → stress-aware depreciation + dynamic outage weight + causal test; non-stationary transition → drift detection + EWC/A-GEM; distributed grid → difference rewards under DP-protected gradients. You never reached for a tool until a row of the table demanded it.

8 · Distributed coordination & privacy — the multi-agent layer

Intuition. A region is many microgrids sharing transformers and a common social objective (minimize total cost + curtailment + line loss). Each must act on local observations in milliseconds, yet none will upload its raw load curve. So: align local incentives with the global objective, and share gradients, not data — privately.

Engineering detail — difference rewards. Give each agent a local reward that is provably aligned with the global one: its own cost, plus a difference reward measuring its marginal contribution, plus a zero-sum potential correction that fairly spreads line-overload penalties:

r_i^local = −C_i(P_i) + D_i + F_i , D_i = G(s,a) − G(s, a_−i, a_i^default) , Σ_i F_i = 0

Train a centralized critic Q̂(s,a) in the dispatch cloud, sync its parameters to edge gateways every ~5 min, and execute fully locally. Potential-game theory shows that if G is an exact potential the Nash equilibrium is the social optimum; in practice tuning λ keeps the approximation error <1%. When comms break for >500 ms the agent falls back to a purely local reward (own cost + overload penalty) and back-fills the difference term with a sliding window when the link returns, so training doesn't diverge.

Engineering detail — privacy-preserving gradients. Three escalating levels, trading privacy strength against communication cost. Level 1: local differential privacy — clip the policy gradient to sensitivity C=median(‖g_i‖₂), add Gaussian noise calibrated by a moments accountant (per-round ε=0.1, cumulative ε≤10 over 1000 rounds), FedAvg the noisy gradients; load data never leaves the site, convergence slows ~18% but LR warm-up + momentum recovers most of it. Level 2: Top-K=1% sparsification + random rotation + SecAgg, so per-site noise is amortized across the n participants as σ₁/√n — same ε, two orders of magnitude less traffic, <200 ms aggregation. Level 3: distill a 2-layer student policy (5% of the parameters) whose loss contains neither reward nor state, upload only its KL-gradient under CKKS homomorphic encryption with Laplace output noise — member-inference success drops below 1%. Log every ε spent to a privacy-budget ledger for audit.

Communication delay degrades consensus convergence — quantify it

Stale parameters across the consensus ring inject variance just like a stale opponent does in self-play. Mitigate explicitly: delay-weighted Polyak averaging θ_sync=Σ_i w_i·θ_i·e^−λτ_i; a delayed importance-sampling ratio ρ_τ=min(ρ, 1+ε·e^−cτ) in the PPO clip; and a delay-aware LR schedule α_t=α₀/(1+0.1·τ_avg). Under compliance rules that keep sensitive data in-region, cross-domain latency often ≥120 ms, so trade the DP noise σ_DP against delay noise σ_τ — the convergence bound scales like O(√(σ_DP²+σ_τ²)), a Pareto choice between privacy budget ε≤1 and delay τ≤150 ms.

Further considerations

Multimodal, heavy-tailed extremes. Under typhoon or dust storms the forecast distribution becomes multimodal with fat tails; swap the CVAE for a normalizing flow for better tail accuracy, and add a hard-constraint safety layer on machine limit loads to "ride through" with zero incidents.
Storage in the rollout. When the action space includes continuous storage power, inject the forecast uncertainty into a model-predictive-control rollout (a DRL-MPC hybrid) to cut spot-market deviation penalties in intraday dispatch.
Thermal–electric coupling and fast adaptation. Write the heat–power dynamics as a Neural ODE environment and meta-train (MAML) so the policy adapts quickly across −20 °C to 55 °C; online Kalman estimation of internal resistance R_p feeds the projection layer for life-long adaptive constraints.
Four-season meta-tasks. Model the seasons as a contextual-MDP family {M_z} with z an 8-D continuous context (timestamp + temperature + holiday + policy-window) rather than a hard label; train PPO with a Transformer context encoder, hold out an unseen year, and require recovery to 92% of optimal within ~1500 steps and an old-season forgetting rate <5%.
Scaling difference rewards. Beyond ~100 microgrids the O(N²) difference-reward computation is infeasible; compress G with an SVD low-rank approximation to ≤30 dimensions for linear complexity, and consider a distribution-robust penalty −κ·Var(G) when renewable output is non-stationary (Weibull shifts).
Multi-region season skew. When the northeast is in winter while the south is still in summer, treat region ID as a hierarchical context with a sparse mixture-of-experts output layer over region×season — one shared backbone serving "one country, four seasons" as a federated meta-policy.