Satellite-constellation power control
A low-Earth-orbit (LEO) broadband satellite must set its downlink transmit power every few milliseconds — enough to close a link through a channel that swings violently as geometry and the ionosphere change, but never enough to saturate the power amplifier, blow the harvested-energy budget, or burn battery cycle-life. The binding difficulties are sharp: the channel is only partially observable and non-stationary, the amplifier imposes a hard safety constraint, the true reward trades throughput against hardware lifetime, the orbit injects a periodic eclipse shock, and inference runs on a radiation-hardened FPGA with a millisecond deadline. Each one names a tool.
1 · Formulate — the MDP behind on-orbit power control
Intuition. A satellite radio measures a channel, picks a transmit power and a predicted channel coefficient, sends a packet, and learns later whether the link held and at what energy cost. That is already a Markov Decision Process: the relative geometry and link state are the state, the power / prediction-coefficient update is the action, the negative mean-squared prediction error under a power constraint is the reward, and orbital mechanics plus the radio hardware are the transition. Every section below is one of these four pieces being awkward in space.
| Piece | For a LEO downlink power controller | The awkward part |
|---|---|---|
| State S | relative velocity, inter-satellite range, geomagnetic Kp index, queue length — plus the latent true channel h | you never measure h directly, only delayed noisy proxies → partially observable |
| Action A | transmit power u, and the step that updates the channel-prediction coefficient | bounded by amplifier saturation umax — a hard constraint, not a soft cost |
| Reward R | negative prediction MSE; throughput minus energy cost minus battery wear | capacity now versus cycle-life later — multi-objective, delayed |
| Transition P | orbital geometry + ionosphere + solar power harvest | periodic eclipse flips the power budget → non-stationary |
Same pattern as every other domain: the MDP table writes itself, and the rightmost column — the awkward part — is the lesson. We reach for memory, a safety layer, a constrained reward, and meta-adaptation only because a row of the table demands it.
2 · Partial observability — predict the channel you cannot see
Intuition. The instantaneous channel coefficient h that decides how much power closes the link is never measured cleanly: feedback is delayed, quantized, and corrupted by Doppler from a satellite crossing the sky at ~7.5 km/s. So the agent must act on a belief about h, not on h itself. That makes this a POMDP: cast inter-satellite channel prediction as a partially-observable MDP whose state is the four-tuple (relative velocity, inter-satellite range, Kp index, queue length), and whose action is the update step of the prediction coefficients.
Engineering detail. The belief encoder splits into a deterministic and a stochastic path: a LayerNorm-GRU carries the deterministic hidden state ht for long-horizon stability, a 3-layer MLP emits μt, σt for a reparameterized stochastic latent zt, and two cross-attention layers fuse them to focus selectively on the raw observation while suppressing noise. It is trained as a sequence model with an ELBO plus an information-bottleneck term:
with β=1e-3 and γ=1e-4 to fight overfitting and gradient explosion. A small actor-critic then runs on the belief: PPO-Clip actor (continuous outputs via a Gaussian mixture for multi-modal expressiveness), a critic estimating V(bt) with GAE(λ=0.95), and the front three layers shared with the belief encoder to cut parameters ~30%. The whole policy network is two 64-unit fully-connected layers with Tanh activations — ~87 k parameters — so it survives the radiation budget of Section 6.
3 · The amplifier wall — a hard constraint, not a penalty
Intuition. A power amplifier (PA) saturates: push past umax and the signal clips, spectral regrowth interferes with neighbours, and the transistor ages faster. Unlike a soft cost you can trade away, this is a wall — the agent must learn to ride close to it for throughput yet never cross it. A naive "clip the output afterwards" approach distorts the policy's own gradient and still allows transient overshoot. The honest framing is a constrained MDP (CMDP): maximize throughput subject to the saturation cost staying under budget.
Engineering detail — defence in depth. The source builds the wall in layers, and no single layer is trusted alone:
- Reward saturation penalty: rsat = −10·max(0, |u| − umax)², so the network learns early to be "reluctant to hug the edge."
- Observe the duty cycle: feed the PA duty cycle dt into the state so the agent senses "I am near the ceiling" and pulls back on its own.
- Barrier in the loss: a log-barrier term −β·log(1 − |u|/umax) added to the PPO-Clip objective, with β tuned as a Lagrange multiplier so the KKT conditions hold approximately and the cost-return Jc stays pinned below Pmax without the reward collapsing.
- Slew-rate limiter at deploy: the RL target value is ramped over a 1 ms slope, so no step change can spike the PA into instantaneous saturation.
- Robustness training: hardware-in-the-loop ageing tests randomly drop umax to 70–90% of nominal, so the learned policy carries a built-in safety margin.
4 · Reward — capacity now versus lifetime later
Intuition. Maximizing throughput alone teaches the agent to blast power, draining the battery hard and aging the cells — a policy that looks great for a month and bricks the satellite years early. The real objective is multi-objective and the costly term is delayed: a deep discharge today shows up as lost cycle-life far in the future. So the reward must price wear explicitly, the way a battery-storage operator prices degradation against energy revenue.
Engineering detail. Folding the instantaneous power budget through a time-varying price λ(t) forces the agent to ingest the energy signal at the state-representation stage — objective and state close the loop, instead of the agent discovering the price only at reward time. The wear term is learned, not guessed: a degradation model predicts capacity-loss-rate from the cell's current/temperature/state-of-charge trace, and that predicted loss enters the reward so the policy learns to back off the rate and shrink the depth-of-discharge swing in stressful conditions — the same move that, in a battery-storage analogue, lifts usable cycle life substantially. To handle the delay, effects that surface tens of seconds later are credited with a differenced, off-policy-corrected return (V-trace style) rather than a naive instantaneous reward, so a deep-discharge action is still blamed for the wear it causes after the fact.
5 · Eclipse non-stationarity — detect the regime, then adapt
Intuition. Every orbit the satellite passes into Earth's shadow. Solar input vanishes, the power budget collapses onto the battery, and the optimal policy flips: the controller that was right in sunlight is wrong in eclipse. This is a recurring, predictable distribution shift — a non-stationary MDP — and the agent must notice the regime change and react without forgetting how to behave in the other regime.
Engineering detail — graded response. Detection watches an ensemble-critic prediction variance: a sudden variance spike means the value model no longer fits the environment, which raises a tiered alarm. The response is graded by severity: level 1 keeps the network weights and merely turns up exploration noise (larger σ) for quick self-recovery; level 2 freezes the trunk and trains only a lightweight 2-layer adapter (<5% of parameters, lr 1e-4, EWC regularization to protect the weights that matter for the other regime); level 3 warm-starts from a MAML-pretrained meta-policy that restores ~90% of performance in a handful of gradient steps. Because eclipse is periodic, a meta-learned initialization adapts to a fresh orbit from only minutes of real data — decisive when a constellation is being filled out by rapid batch launches and every new shell is a slightly new task. Any adaptation runs in a shadow/gray environment until its cumulative advantage is non-negative and the old-regime regression suite passes 100% before it goes full-rate.
6 · The radiation-hardened millisecond — inference is part of the MDP
Intuition. The policy does not run in a datacenter; it runs on a radiation-tolerant FPGA aboard the spacecraft, on a link-layer time-slot budget. A model that is slightly more accurate but misses the slot, or that cannot survive a single-event upset (SEU) from a cosmic ray, is worse than useless. Latency, memory, and fault-tolerance are hard constraints that shaped the 87 k-parameter network of Section 2 — not afterthoughts.
Engineering detail. The compact two-layer Tanh policy runs to ~5 ms per iteration on a domain FPGA (e.g. a Zynq-class part), meeting the CCSDS 141.0-B-1 link-layer slot timing. At inference only the policy network is kept resident, holding RAM under 256 kB so the satellite can survive an SEU and restart under triple-modular redundancy (TMR). Sample efficiency comes from hybrid experience replay: 80% of transitions from a digital-twin orbit simulator (carrying a reference ionosphere model) and 20% from in-orbit telemetry. The mismatch between simulator and reality is absorbed automatically by importance-sampling weights — the same ratio-clipping that PPO uses naturally suppresses the distribution drift the high-dynamics channel induces — so a single satellite converges in ~1,200 orbit·steps, well under a typical ~3,000 orbit·step budget. Reported on-orbit, this kind of design cuts channel-prediction MSE by ~42% while keeping on-board CPU utilization under 8%.
Further considerations
- Multi-satellite cooperation. Model neighbours in one orbital plane as multi-agent PPO (MAPPO) — shared channel observations, independent policies — exchanging hidden-layer features over a low-latency (~2 ms) inter-satellite laser backbone, which can push prediction error toward ~65% of the single-satellite figure. The cost is encrypting the inter-satellite data link (e.g. an SM4-GCM hardware core, <0.2 W extra per link) to satisfy spectrum-coordination rules.
- Fuse a physics prior. Add a physical-residual reward term Rphy = −η·|ĥLSTM − hparabolic| so the policy gracefully degrades to the classical parabolic channel model during quiet ionosphere — buying interpretability for review and cutting overfitting risk.
- On-orbit online update under a tiny downlink. With only a short-message uplink (~1,750 bit per message), compress gradients to 1-bit signSGD; ~3 messages per day suffice to hot-update the whole constellation's policy, enabling day-scale algorithm evolution instead of waiting for a half-yearly ground-station contact window.
- Constellation-scale action explosion. Scaling from a few hundred to thousands of satellites blows up the joint action space; move to MARL with centralized-training / distributed-execution (CTDE) and a graph-attention network over the inter-satellite topology, while explicitly handling the non-stationarity and communication delay it introduces.