all lessons / reinforcement learning / 84 · Satellite-constellation power control lesson 84 / 87

Satellite-constellation power control

A low-Earth-orbit (LEO) broadband satellite must set its downlink transmit power every few milliseconds — enough to close a link through a channel that swings violently as geometry and the ionosphere change, but never enough to saturate the power amplifier, blow the harvested-energy budget, or burn battery cycle-life. The binding difficulties are sharp: the channel is only partially observable and non-stationary, the amplifier imposes a hard safety constraint, the true reward trades throughput against hardware lifetime, the orbit injects a periodic eclipse shock, and inference runs on a radiation-hardened FPGA with a millisecond deadline. Each one names a tool.

The method — five steps, every lesson
Applied RL is the same loop in every domain: (1) Formulate the MDP — state, action, reward, transition, horizon. (2) Diagnose the property that makes this MDP hard. (3) Engineer the mechanism that removes that difficulty. (4) Guard it in production — detect when it breaks and fall back. (5) Iterate. The art is only in which difficulty binds first. Here it binds on a satellite that cannot be touched again once it launches.

1 · Formulate — the MDP behind on-orbit power control

Intuition. A satellite radio measures a channel, picks a transmit power and a predicted channel coefficient, sends a packet, and learns later whether the link held and at what energy cost. That is already a Markov Decision Process: the relative geometry and link state are the state, the power / prediction-coefficient update is the action, the negative mean-squared prediction error under a power constraint is the reward, and orbital mechanics plus the radio hardware are the transition. Every section below is one of these four pieces being awkward in space.

MDP = (S, A, R, P, γ)  ·  maximize  E[ Σₜ γᵗ rₜ ]  s.t.  E[ Σₜ γᵗ cpower(sₜ,aₜ) ] ≤ Pmax
PieceFor a LEO downlink power controllerThe awkward part
State Srelative velocity, inter-satellite range, geomagnetic Kp index, queue length — plus the latent true channel hyou never measure h directly, only delayed noisy proxies → partially observable
Action Atransmit power u, and the step that updates the channel-prediction coefficientbounded by amplifier saturation umax — a hard constraint, not a soft cost
Reward Rnegative prediction MSE; throughput minus energy cost minus battery wearcapacity now versus cycle-life later — multi-objective, delayed
Transition Porbital geometry + ionosphere + solar power harvestperiodic eclipse flips the power budget → non-stationary

Same pattern as every other domain: the MDP table writes itself, and the rightmost column — the awkward part — is the lesson. We reach for memory, a safety layer, a constrained reward, and meta-adaptation only because a row of the table demands it.

2 · Partial observability — predict the channel you cannot see

Intuition. The instantaneous channel coefficient h that decides how much power closes the link is never measured cleanly: feedback is delayed, quantized, and corrupted by Doppler from a satellite crossing the sky at ~7.5 km/s. So the agent must act on a belief about h, not on h itself. That makes this a POMDP: cast inter-satellite channel prediction as a partially-observable MDP whose state is the four-tuple (relative velocity, inter-satellite range, Kp index, queue length), and whose action is the update step of the prediction coefficients.

bt = fθ(o≤t, a<t)  |  ĥt = g(bt)  |  rt = −‖ĥt − ht‖²  (with power penalty)

Engineering detail. The belief encoder splits into a deterministic and a stochastic path: a LayerNorm-GRU carries the deterministic hidden state ht for long-horizon stability, a 3-layer MLP emits μt, σt for a reparameterized stochastic latent zt, and two cross-attention layers fuse them to focus selectively on the raw observation while suppressing noise. It is trained as a sequence model with an ELBO plus an information-bottleneck term:

L = E[ log p(ot,rt | ht,zt,at) ] − β·KL( q(zt|ht,ot) ‖ p(zt|ht) ) + γ·‖ht‖₂

with β=1e-3 and γ=1e-4 to fight overfitting and gradient explosion. A small actor-critic then runs on the belief: PPO-Clip actor (continuous outputs via a Gaussian mixture for multi-modal expressiveness), a critic estimating V(bt) with GAE(λ=0.95), and the front three layers shared with the belief encoder to cut parameters ~30%. The whole policy network is two 64-unit fully-connected layers with Tanh activations — ~87 k parameters — so it survives the radiation budget of Section 6.

Two-stage launch gate for the belief model
Before this ships you run two checks. (1) Offline counterfactual replay: replay historical orbits and require the belief's predicted next-step observation to land within KL < 0.05 nats of the realized one. (2) Online shadow A/B: compare against a full-information oracle policy and accept only if cumulative return drops < 3%. A belief that cannot predict the next observation offline will not control power well online — gate it there, not in orbit.

3 · The amplifier wall — a hard constraint, not a penalty

Intuition. A power amplifier (PA) saturates: push past umax and the signal clips, spectral regrowth interferes with neighbours, and the transistor ages faster. Unlike a soft cost you can trade away, this is a wall — the agent must learn to ride close to it for throughput yet never cross it. A naive "clip the output afterwards" approach distorts the policy's own gradient and still allows transient overshoot. The honest framing is a constrained MDP (CMDP): maximize throughput subject to the saturation cost staying under budget.

Engineering detail — defence in depth. The source builds the wall in layers, and no single layer is trusted alone:

J(π) = E[ Σ γᵗ rt ] − β·log(1 − |u|/umax)  ,  β ← Lagrangian( Jc − Pmax )
Why post-hoc clipping is the wrong default
Clipping the action after the policy samples it breaks the importance ratio πnewold — the executed action is no longer the sampled one, so PPO's surrogate is computed against a fiction. The constraint must live inside the objective (barrier + Lagrangian) and inside the state (duty cycle), with the slew limiter as a final physical backstop. Pushed into reward and observation rather than bolted on afterward, the trained policy holds the cost-return below Pmax and, relative to a post-clip baseline, raises throughput while driving PA-protection trips toward zero.

4 · Reward — capacity now versus lifetime later

Intuition. Maximizing throughput alone teaches the agent to blast power, draining the battery hard and aging the cells — a policy that looks great for a month and bricks the satellite years early. The real objective is multi-objective and the costly term is delayed: a deep discharge today shows up as lost cycle-life far in the future. So the reward must price wear explicitly, the way a battery-storage operator prices degradation against energy revenue.

r = w₁·throughput − w₂·energy_cost − w₃·wear_cost  ,  energy_cost = power · λ(t)

Engineering detail. Folding the instantaneous power budget through a time-varying price λ(t) forces the agent to ingest the energy signal at the state-representation stage — objective and state close the loop, instead of the agent discovering the price only at reward time. The wear term is learned, not guessed: a degradation model predicts capacity-loss-rate from the cell's current/temperature/state-of-charge trace, and that predicted loss enters the reward so the policy learns to back off the rate and shrink the depth-of-discharge swing in stressful conditions — the same move that, in a battery-storage analogue, lifts usable cycle life substantially. To handle the delay, effects that surface tens of seconds later are credited with a differenced, off-policy-corrected return (V-trace style) rather than a naive instantaneous reward, so a deep-discharge action is still blamed for the wear it causes after the fact.

Power policy → throughput vs. battery cycle-life

Pushing transmit power up wins instantaneous link margin but discharges deeper and runs the PA hotter, which burns battery cycles. Slide the aggressiveness and the wear weight to feel where total mission value peaks — and watch the amplifier-saturation guard fire when you cross the wall.

throughput
cycle-life
mission value
PA guard

5 · Eclipse non-stationarity — detect the regime, then adapt

Intuition. Every orbit the satellite passes into Earth's shadow. Solar input vanishes, the power budget collapses onto the battery, and the optimal policy flips: the controller that was right in sunlight is wrong in eclipse. This is a recurring, predictable distribution shift — a non-stationary MDP — and the agent must notice the regime change and react without forgetting how to behave in the other regime.

Engineering detail — graded response. Detection watches an ensemble-critic prediction variance: a sudden variance spike means the value model no longer fits the environment, which raises a tiered alarm. The response is graded by severity: level 1 keeps the network weights and merely turns up exploration noise (larger σ) for quick self-recovery; level 2 freezes the trunk and trains only a lightweight 2-layer adapter (<5% of parameters, lr 1e-4, EWC regularization to protect the weights that matter for the other regime); level 3 warm-starts from a MAML-pretrained meta-policy that restores ~90% of performance in a handful of gradient steps. Because eclipse is periodic, a meta-learned initialization adapts to a fresh orbit from only minutes of real data — decisive when a constellation is being filled out by rapid batch launches and every new shell is a slightly new task. Any adaptation runs in a shadow/gray environment until its cumulative advantage is non-negative and the old-regime regression suite passes 100% before it goes full-rate.

1 · FORMULATE S, A, R, P CMDP + POMDP 2 · DIAGNOSE unseen channel, PA wall, eclipse 3 · ENGINEER belief GRU, barrier, wear reward, MAML 4 · GUARD variance alarm, TMR, arbiter 5 · ITERATE re-diagnose removing one difficulty exposes the next — re-run the loop

6 · The radiation-hardened millisecond — inference is part of the MDP

Intuition. The policy does not run in a datacenter; it runs on a radiation-tolerant FPGA aboard the spacecraft, on a link-layer time-slot budget. A model that is slightly more accurate but misses the slot, or that cannot survive a single-event upset (SEU) from a cosmic ray, is worse than useless. Latency, memory, and fault-tolerance are hard constraints that shaped the 87 k-parameter network of Section 2 — not afterthoughts.

Engineering detail. The compact two-layer Tanh policy runs to ~5 ms per iteration on a domain FPGA (e.g. a Zynq-class part), meeting the CCSDS 141.0-B-1 link-layer slot timing. At inference only the policy network is kept resident, holding RAM under 256 kB so the satellite can survive an SEU and restart under triple-modular redundancy (TMR). Sample efficiency comes from hybrid experience replay: 80% of transitions from a digital-twin orbit simulator (carrying a reference ionosphere model) and 20% from in-orbit telemetry. The mismatch between simulator and reality is absorbed automatically by importance-sampling weights — the same ratio-clipping that PPO uses naturally suppresses the distribution drift the high-dynamics channel induces — so a single satellite converges in ~1,200 orbit·steps, well under a typical ~3,000 orbit·step budget. Reported on-orbit, this kind of design cuts channel-prediction MSE by ~42% while keeping on-board CPU utilization under 8%.

The through-line
Every section is one row of the MDP table turned into a mechanism: unseen channel → belief GRU + information bottleneck; amplifier wall → CMDP barrier + duty-cycle observation + slew limiter; capacity-vs-lifetime reward → priced energy + learned wear term + delayed-credit V-trace; eclipse non-stationarity → ensemble-variance detection + graded MAML adaptation; the radiation-hardened millisecond → an 87 k-parameter quantized policy with TMR restart and 80/20 sim-real replay. You never reached for a tool until a row of the table demanded it. That discipline is the whole track.

Further considerations