Energy-management microgrid
A microgrid is a dispatch controller that decides, every few seconds, how much to charge or discharge the battery, which loads to shed, and how much to draw from the grid — to minimize the electricity bill without ever violating a physical limit. What makes this MDP hard is that the state is a forecast, not a fact (weather and price are uncertain), the action lives under hard safety constraints that a stochastic policy must never break, the reward spans multiple time-scales and a depreciation cost you cannot directly measure, and the environment is openly non-stationary (seasons) and distributed across sites that will not share raw load data. Each of those names a mechanism.
1 · Formulate — the MDP behind a microgrid controller
Intuition. A microgrid agent looks at the weather forecast, the battery's state of charge, the current electricity price, and the building's load; it decides how to move power between battery, grid, and loads; it pays the electricity bill at the end of each interval. That is already a Markov Decision Process. Every difficulty below is one of these four pieces being awkward in a real grid.
| Piece | For a battery+load microgrid | The awkward part |
|---|---|---|
| State S | SoC, load, PV/wind output, price, weather forecast | the forecast is a distribution, not a value, and links can drop → uncertain & partially observable |
| Action A | battery power P (continuous) × which loads / chillers to switch (integer) | hybrid, and most actions would violate a hard physical limit |
| Reward R | energy cost saved − battery wear − outage penalty | spans real-time + peak/valley + demand tiers, and wear is unmeasured |
| Transition P | weather, market price, battery electrochemistry, neighbours | seasonal non-stationarity and a multi-agent, privacy-bound grid |
The same pattern as every lesson: the MDP table writes itself, and the rightmost column — the awkward part — is the lesson. Each row's difficulty has a named mechanism that answers it.
2 · State is a forecast — carry the distribution, not the point
Intuition. If you feed the policy a single predicted wind speed and treat it as truth, the agent learns to bet the farm on a number that is wrong half the time. A weather forecast is honestly a probability distribution; the controller should see the spread, not just the mean, so it can hold back power when the forecast is uncertain and lean in when it is confident.
Engineering detail. Offline, train a conditional variational auto-encoder (CVAE) on years of met-tower, satellite, and numerical-weather-ensemble data: the encoder ingests the current 500 hPa geopotential field, surface pressure and humidity; the decoder emits the joint distribution parameters — mean, variance, correlation — of local wind speed, direction and temperature. A latent dimension of just 32 keeps the model under the grid-edge GPU's 2 GB budget. Online, append those distribution parameters to the deterministic state to form an augmented state:
On the forward pass, use the reparameterization trick to draw K = 5 weather realizations {st,k}, run all K through a shared-weight actor, and feed the statistics of the distribution (mean + 95% CVaR) into the critic, which outputs a risk-sensitive value Vφ(st+). The training objective is standard PPO plus a risk term:
β is tuned on site so that under a ±5 m/s wind step the power ramp still respects the grid code's 1.5 MW/min limit. Against a baseline that treats the forecast as truth, this distributional state has been reported to lift annual energy yield ~1.8% and cut dispatch-penalty fines ~42%. When only a deterministic forecast is available, fit the historical residuals with deep kernel density estimation and add that residual distribution online — the "distributional state" requirement still holds.
3 · Standardize state & action without breaking the constraints
Intuition. Neural networks want inputs near unit scale, but a microgrid mixes volts, amps, kWh and °C — wildly different magnitudes. Naive normalization is dangerous here: if the agent reasons in normalized space, a small numeric slip can map back to a current that trips a breaker. So normalization must be invertible, auditable, and constraint-aware.
Engineering detail — three moves. (1) Robust, frozen scaling. From ≥10⁶ logged transitions, compute a per-dimension robust scale that ignores spikes, then freeze it:
Bake offset and scale into the model's constant nodes so inference never recomputes them — the DCS can audit them as static constants. (2) A boundary-distance feature. For any hard-constrained variable, don't just hand the network the normalized value; also hand it the log-distance to the legal limit, so the gradient is both stable and aware of which direction is dangerous:
where c is the regulatory ceiling. (3) De-normalize then project, in physical units. The policy emits mean and std; convert to raw units with the same offset/scale, then clip in physical space — so the clip can guarantee no violation while log_prob stays differentiable for SAC/PPO:
4 · The action under hard constraints — a three-layer safety chain
Intuition. The battery has physical red lines: never over-discharge, never exceed peak power, keep efficiency above a floor. A learning policy is stochastic — given enough samples it will eventually pick a forbidden action unless something structurally forbids it. The principle is defense in depth: make illegal actions impossible to sample, project anything that slips through, and penalize residual risk during learning.
Engineering detail. First, reparameterize the action: instead of raw power P, output a normalized efficiency coefficient k∈[0,1] mapped through a differentiable lookup to physical power P = f⁻¹(k; SoC), so the policy gradient has real magnitude only in the high-efficiency band. Then the three layers:
- Layer 1 — action masking. Before the softmax (discrete loads) or before sampling (continuous power), set the probability of any SoC-illegal action to zero. For integer load switching, use a masked Gumbel-Softmax so an infeasible combination (e.g. starting a 4th chiller when N+1 redundancy allows 3) can never be drawn; the feasibility mask is pushed down live by the EMS, illegal logits set to −∞ → zero limit violations by construction.
- Layer 2 — safety projection. If model error defeats the mask, a convex projection layer maps the action into the feasible set using an analytic KKT / single-step QP solution — ~3 µs per step, <2% CPU at a 1 kHz control rate, forward O(1) and gradient continuous.
- Layer 3 — Lagrangian PPO. Add a cost term outside the reward with budget δ; setting δ = 0.001 means "≤1 over-discharge per 1000 operating hours." After ~5M steps the Lagrange multiplier converges and empirically the policy hits zero violations.
State estimation closes the loop: a domestic EKF library keeps SoC error to ~1.8%, satisfying the 5% margin the standard requires. One 50 kWh unit ran six months with no BMS protection trip and <1% capacity fade.
5 · Reward — bill, depreciation, and outage, on three clocks
Intuition. The obvious reward is the energy-cost saving, but cycling the battery hard to chase arbitrage silently ages it — a cost you pay months later and never see in the per-step bill. And shedding a critical load to save money can black out a hospital ward, a cost no price can capture. So the reward is three terms: revenue, a wear charge, and an outage penalty whose weight changes with how critical the load is right now.
Engineering detail — making wear visible and bounded. Derive a stress factor stresst from temperature, depth-of-discharge and SoC window, concatenate it into the SAC critic's input, and let rdepreciate charge for it — but keep |rdepreciate| ≲ 0.01 when rrevenue∈[−0.1, 0.1] so depreciation shades rather than overshadows the main signal. The policy then learns to back off power or shift its SoC window under high stress. Two refinements from the field: once capacity drops below 80% fade accelerates, so add a feedback term stresscap=1/(1−0.8·kcap) to make the agent conservative early; and when a hard standard says "temperature > 55 °C or SoC < 5% → forced shutdown," set the corresponding stress to +∞ so the constraint is satisfied through the reward itself, with no engineering patch.
Engineering detail — the dynamic outage weight. The outage penalty weight is not constant; it scales with battery reserve, price pressure and weather alerts:
where Priceratio is the current price over the trailing-30-day max. Top-tier (T0) critical loads additionally get a −∞ mask on any shed action, so the policy physically cannot cut them. Before go-live, run a counterfactual simulation: hold λ constant, check the policy doesn't collapse into "shed everything" or "serve everything," then ramp λ back to dynamic and verify the loss-of-load probability (LOLP) stays under the utility's ≤0.1% KPI.
6 · The arbitrage–depreciation tension — the napkin math
Intuition. Every extra cycle of depth-of-discharge you take earns more peak/valley arbitrage today but spends a slice of the battery's finite ~6000-cycle life. Push the SoC window too wide and the wear cost — amortized replacement plus accelerating fade — eats the arbitrage you just earned. There is an interior optimum, and the widget below is exactly that trade-off as one-line economics: pick a daily depth-of-discharge and watch net value peak and then fall.
7 · Guard — partial observability, non-stationarity, and the distributed grid
Intuition. In production three things break that simulation hides: communication links drop, the seasons turn, and the grid is many sites that won't share their raw load curves. Each needs a detector and a fallback, not a one-off fix.
Engineering detail — POMDP for link loss. Model dropped communication explicitly. Keep a ring buffer of the last k=10 observations; on any frame lost to a drop or >100 ms latency, mark a missing-mask mt=0 and form a fixed-width observation xt=[olocal; oremote⊙mt; mt]. A lightweight 200-particle filter updates a belief b(st) on the edge GPU; when mt=0 it runs the predict step and skips the update, so belief variance grows naturally and feeds the risk-sensitive value. The policy conditions on a belief embedding zt=Encoder(bt) (2-layer GRU + MLP), trained with PPO-Clip plus an information-bottleneck term λH[bt] that nudges the agent toward a conservative local mode during blackout. Tie a belief-entropy threshold to a minimum-risk fallback so the system reaches a safe state within the regulated window.
8 · Distributed coordination & privacy — the multi-agent layer
Intuition. A region is many microgrids sharing transformers and a common social objective (minimize total cost + curtailment + line loss). Each must act on local observations in milliseconds, yet none will upload its raw load curve. So: align local incentives with the global objective, and share gradients, not data — privately.
Engineering detail — difference rewards. Give each agent a local reward that is provably aligned with the global one: its own cost, plus a difference reward measuring its marginal contribution, plus a zero-sum potential correction that fairly spreads line-overload penalties:
Train a centralized critic Q̂(s,a) in the dispatch cloud, sync its parameters to edge gateways every ~5 min, and execute fully locally. Potential-game theory shows that if G is an exact potential the Nash equilibrium is the social optimum; in practice tuning λ keeps the approximation error <1%. When comms break for >500 ms the agent falls back to a purely local reward (own cost + overload penalty) and back-fills the difference term with a sliding window when the link returns, so training doesn't diverge.
Engineering detail — privacy-preserving gradients. Three escalating levels, trading privacy strength against communication cost. Level 1: local differential privacy — clip the policy gradient to sensitivity C=median(‖gi‖₂), add Gaussian noise calibrated by a moments accountant (per-round ε=0.1, cumulative ε≤10 over 1000 rounds), FedAvg the noisy gradients; load data never leaves the site, convergence slows ~18% but LR warm-up + momentum recovers most of it. Level 2: Top-K=1% sparsification + random rotation + SecAgg, so per-site noise is amortized across the n participants as σ₁/√n — same ε, two orders of magnitude less traffic, <200 ms aggregation. Level 3: distill a 2-layer student policy (5% of the parameters) whose loss contains neither reward nor state, upload only its KL-gradient under CKKS homomorphic encryption with Laplace output noise — member-inference success drops below 1%. Log every ε spent to a privacy-budget ledger for audit.
Further considerations
- Multimodal, heavy-tailed extremes. Under typhoon or dust storms the forecast distribution becomes multimodal with fat tails; swap the CVAE for a normalizing flow for better tail accuracy, and add a hard-constraint safety layer on machine limit loads to "ride through" with zero incidents.
- Storage in the rollout. When the action space includes continuous storage power, inject the forecast uncertainty into a model-predictive-control rollout (a DRL-MPC hybrid) to cut spot-market deviation penalties in intraday dispatch.
- Thermal–electric coupling and fast adaptation. Write the heat–power dynamics as a Neural ODE environment and meta-train (MAML) so the policy adapts quickly across −20 °C to 55 °C; online Kalman estimation of internal resistance Rp feeds the projection layer for life-long adaptive constraints.
- Four-season meta-tasks. Model the seasons as a contextual-MDP family {Mz} with z an 8-D continuous context (timestamp + temperature + holiday + policy-window) rather than a hard label; train PPO with a Transformer context encoder, hold out an unseen year, and require recovery to 92% of optimal within ~1500 steps and an old-season forgetting rate <5%.
- Scaling difference rewards. Beyond ~100 microgrids the O(N²) difference-reward computation is infeasible; compress G with an SVD low-rank approximation to ≤30 dimensions for linear complexity, and consider a distribution-robust penalty −κ·Var(G) when renewable output is non-stationary (Weibull shifts).
- Multi-region season skew. When the northeast is in winter while the south is still in summer, treat region ID as a hierarchical context with a sparse mixture-of-experts output layer over region×season — one shared backbone serving "one country, four seasons" as a federated meta-policy.