Satellite-constellation power control

A low-Earth-orbit (LEO) broadband satellite must set its downlink transmit power every few milliseconds — enough to close a link through a channel that swings violently as geometry and the ionosphere change, but never enough to saturate the power amplifier, blow the harvested-energy budget, or burn battery cycle-life. The binding difficulties are sharp: the channel is only partially observable and non-stationary, the amplifier imposes a hard safety constraint, the true reward trades throughput against hardware lifetime, the orbit injects a periodic eclipse shock, and inference runs on a radiation-hardened FPGA with a millisecond deadline. Each one names a tool.

The method — five steps, every lesson

Applied RL is the same loop in every domain: (1) Formulate the MDP — state, action, reward, transition, horizon. (2) Diagnose the property that makes this MDP hard. (3) Engineer the mechanism that removes that difficulty. (4) Guard it in production — detect when it breaks and fall back. (5) Iterate. The art is only in which difficulty binds first. Here it binds on a satellite that cannot be touched again once it launches.

1 · Formulate — the MDP behind on-orbit power control

Intuition. A satellite radio measures a channel, picks a transmit power and a predicted channel coefficient, sends a packet, and learns later whether the link held and at what energy cost. That is already a Markov Decision Process: the relative geometry and link state are the state, the power / prediction-coefficient update is the action, the negative mean-squared prediction error under a power constraint is the reward, and orbital mechanics plus the radio hardware are the transition. Every section below is one of these four pieces being awkward in space.

MDP = (S, A, R, P, γ) · maximize E[ Σₜ γᵗ rₜ ] s.t. E[ Σₜ γᵗ c_power(sₜ,aₜ) ] ≤ P_max

Piece	For a LEO downlink power controller	The awkward part
State S	relative velocity, inter-satellite range, geomagnetic Kp index, queue length — plus the latent true channel h	you never measure h directly, only delayed noisy proxies → partially observable
Action A	transmit power u, and the step that updates the channel-prediction coefficient	bounded by amplifier saturation u_max — a hard constraint, not a soft cost
Reward R	negative prediction MSE; throughput minus energy cost minus battery wear	capacity now versus cycle-life later — multi-objective, delayed
Transition P	orbital geometry + ionosphere + solar power harvest	periodic eclipse flips the power budget → non-stationary

Same pattern as every other domain: the MDP table writes itself, and the rightmost column — the awkward part — is the lesson. We reach for memory, a safety layer, a constrained reward, and meta-adaptation only because a row of the table demands it.

2 · Partial observability — predict the channel you cannot see

Intuition. The instantaneous channel coefficient h that decides how much power closes the link is never measured cleanly: feedback is delayed, quantized, and corrupted by Doppler from a satellite crossing the sky at ~7.5 km/s. So the agent must act on a belief about h, not on h itself. That makes this a POMDP: cast inter-satellite channel prediction as a partially-observable MDP whose state is the four-tuple (relative velocity, inter-satellite range, Kp index, queue length), and whose action is the update step of the prediction coefficients.

b_t = f_θ(o_≤t, a_<t) | ĥ_t = g(b_t) | r_t = −‖ĥ_t − h_t‖² (with power penalty)

Engineering detail. The belief encoder splits into a deterministic and a stochastic path: a LayerNorm-GRU carries the deterministic hidden state h_t for long-horizon stability, a 3-layer MLP emits μ_t, σ_t for a reparameterized stochastic latent z_t, and two cross-attention layers fuse them to focus selectively on the raw observation while suppressing noise. It is trained as a sequence model with an ELBO plus an information-bottleneck term:

L = E[ log p(o_t,r_t | h_t,z_t,a_t) ] − β·KL( q(z_t|h_t,o_t) ‖ p(z_t|h_t) ) + γ·‖h_t‖₂

with β=1e-3 and γ=1e-4 to fight overfitting and gradient explosion. A small actor-critic then runs on the belief: PPO-Clip actor (continuous outputs via a Gaussian mixture for multi-modal expressiveness), a critic estimating V(b_t) with GAE(λ=0.95), and the front three layers shared with the belief encoder to cut parameters ~30%. The whole policy network is two 64-unit fully-connected layers with Tanh activations — ~87 k parameters — so it survives the radiation budget of Section 6.

Two-stage launch gate for the belief model

Before this ships you run two checks. (1) Offline counterfactual replay: replay historical orbits and require the belief's predicted next-step observation to land within KL < 0.05 nats of the realized one. (2) Online shadow A/B: compare against a full-information oracle policy and accept only if cumulative return drops < 3%. A belief that cannot predict the next observation offline will not control power well online — gate it there, not in orbit.

3 · The amplifier wall — a hard constraint, not a penalty

Intuition. A power amplifier (PA) saturates: push past u_max and the signal clips, spectral regrowth interferes with neighbours, and the transistor ages faster. Unlike a soft cost you can trade away, this is a wall — the agent must learn to ride close to it for throughput yet never cross it. A naive "clip the output afterwards" approach distorts the policy's own gradient and still allows transient overshoot. The honest framing is a constrained MDP (CMDP): maximize throughput subject to the saturation cost staying under budget.

Engineering detail — defence in depth. The source builds the wall in layers, and no single layer is trusted alone:

Reward saturation penalty: r_sat = −10·max(0, |u| − u_max)², so the network learns early to be "reluctant to hug the edge."
Observe the duty cycle: feed the PA duty cycle d_t into the state so the agent senses "I am near the ceiling" and pulls back on its own.
Barrier in the loss: a log-barrier term −β·log(1 − |u|/u_max) added to the PPO-Clip objective, with β tuned as a Lagrange multiplier so the KKT conditions hold approximately and the cost-return J_c stays pinned below P_max without the reward collapsing.
Slew-rate limiter at deploy: the RL target value is ramped over a 1 ms slope, so no step change can spike the PA into instantaneous saturation.
Robustness training: hardware-in-the-loop ageing tests randomly drop u_max to 70–90% of nominal, so the learned policy carries a built-in safety margin.

J(π) = E[ Σ γᵗ r_t ] − β·log(1 − |u|/u_max) , β ← Lagrangian( J_c − P_max )

Why post-hoc clipping is the wrong default

Clipping the action after the policy samples it breaks the importance ratio π_new/π_old — the executed action is no longer the sampled one, so PPO's surrogate is computed against a fiction. The constraint must live inside the objective (barrier + Lagrangian) and inside the state (duty cycle), with the slew limiter as a final physical backstop. Pushed into reward and observation rather than bolted on afterward, the trained policy holds the cost-return below P_max and, relative to a post-clip baseline, raises throughput while driving PA-protection trips toward zero.

4 · Reward — capacity now versus lifetime later

Intuition. Maximizing throughput alone teaches the agent to blast power, draining the battery hard and aging the cells — a policy that looks great for a month and bricks the satellite years early. The real objective is multi-objective and the costly term is delayed: a deep discharge today shows up as lost cycle-life far in the future. So the reward must price wear explicitly, the way a battery-storage operator prices degradation against energy revenue.

r = w₁·throughput − w₂·energy_cost − w₃·wear_cost , energy_cost = power · λ(t)

Engineering detail. Folding the instantaneous power budget through a time-varying price λ(t) forces the agent to ingest the energy signal at the state-representation stage — objective and state close the loop, instead of the agent discovering the price only at reward time. The wear term is learned, not guessed: a degradation model predicts capacity-loss-rate from the cell's current/temperature/state-of-charge trace, and that predicted loss enters the reward so the policy learns to back off the rate and shrink the depth-of-discharge swing in stressful conditions — the same move that, in a battery-storage analogue, lifts usable cycle life substantially. To handle the delay, effects that surface tens of seconds later are credited with a differenced, off-policy-corrected return (V-trace style) rather than a naive instantaneous reward, so a deep-discharge action is still blamed for the wear it causes after the fact.

5 · Eclipse non-stationarity — detect the regime, then adapt

Intuition. Every orbit the satellite passes into Earth's shadow. Solar input vanishes, the power budget collapses onto the battery, and the optimal policy flips: the controller that was right in sunlight is wrong in eclipse. This is a recurring, predictable distribution shift — a non-stationary MDP — and the agent must notice the regime change and react without forgetting how to behave in the other regime.

Engineering detail — graded response. Detection watches an ensemble-critic prediction variance: a sudden variance spike means the value model no longer fits the environment, which raises a tiered alarm. The response is graded by severity: level 1 keeps the network weights and merely turns up exploration noise (larger σ) for quick self-recovery; level 2 freezes the trunk and trains only a lightweight 2-layer adapter (<5% of parameters, lr 1e-4, EWC regularization to protect the weights that matter for the other regime); level 3 warm-starts from a MAML-pretrained meta-policy that restores ~90% of performance in a handful of gradient steps. Because eclipse is periodic, a meta-learned initialization adapts to a fresh orbit from only minutes of real data — decisive when a constellation is being filled out by rapid batch launches and every new shell is a slightly new task. Any adaptation runs in a shadow/gray environment until its cumulative advantage is non-negative and the old-regime regression suite passes 100% before it goes full-rate.

6 · The radiation-hardened millisecond — inference is part of the MDP

Intuition. The policy does not run in a datacenter; it runs on a radiation-tolerant FPGA aboard the spacecraft, on a link-layer time-slot budget. A model that is slightly more accurate but misses the slot, or that cannot survive a single-event upset (SEU) from a cosmic ray, is worse than useless. Latency, memory, and fault-tolerance are hard constraints that shaped the 87 k-parameter network of Section 2 — not afterthoughts.

Engineering detail. The compact two-layer Tanh policy runs to ~5 ms per iteration on a domain FPGA (e.g. a Zynq-class part), meeting the CCSDS 141.0-B-1 link-layer slot timing. At inference only the policy network is kept resident, holding RAM under 256 kB so the satellite can survive an SEU and restart under triple-modular redundancy (TMR). Sample efficiency comes from hybrid experience replay: 80% of transitions from a digital-twin orbit simulator (carrying a reference ionosphere model) and 20% from in-orbit telemetry. The mismatch between simulator and reality is absorbed automatically by importance-sampling weights — the same ratio-clipping that PPO uses naturally suppresses the distribution drift the high-dynamics channel induces — so a single satellite converges in ~1,200 orbit·steps, well under a typical ~3,000 orbit·step budget. Reported on-orbit, this kind of design cuts channel-prediction MSE by ~42% while keeping on-board CPU utilization under 8%.

The through-line

Every section is one row of the MDP table turned into a mechanism: unseen channel → belief GRU + information bottleneck; amplifier wall → CMDP barrier + duty-cycle observation + slew limiter; capacity-vs-lifetime reward → priced energy + learned wear term + delayed-credit V-trace; eclipse non-stationarity → ensemble-variance detection + graded MAML adaptation; the radiation-hardened millisecond → an 87 k-parameter quantized policy with TMR restart and 80/20 sim-real replay. You never reached for a tool until a row of the table demanded it. That discipline is the whole track.

Further considerations

Multi-satellite cooperation. Model neighbours in one orbital plane as multi-agent PPO (MAPPO) — shared channel observations, independent policies — exchanging hidden-layer features over a low-latency (~2 ms) inter-satellite laser backbone, which can push prediction error toward ~65% of the single-satellite figure. The cost is encrypting the inter-satellite data link (e.g. an SM4-GCM hardware core, <0.2 W extra per link) to satisfy spectrum-coordination rules.
Fuse a physics prior. Add a physical-residual reward term R_phy = −η·|ĥ_LSTM − h_parabolic| so the policy gracefully degrades to the classical parabolic channel model during quiet ionosphere — buying interpretability for review and cutting overfitting risk.
On-orbit online update under a tiny downlink. With only a short-message uplink (~1,750 bit per message), compress gradients to 1-bit signSGD; ~3 messages per day suffice to hot-update the whole constellation's policy, enabling day-scale algorithm evolution instead of waiting for a half-yearly ground-station contact window.
Constellation-scale action explosion. Scaling from a few hundred to thousands of satellites blows up the joint action space; move to MARL with centralized-training / distributed-execution (CTDE) and a graph-attention network over the inter-satellite topology, while explicitly handling the non-stationarity and communication delay it introduces.