all lessons / reinforcement learning / 77 · Wireless-network power allocation lesson 77 / 87

Wireless-network power allocation

A base station has a power budget and a fistful of users on shared spectrum. Every watt you give one user is interference to the others, and the channel that decides the payoff changes faster than you can measure it. Cast as RL, the binding difficulties are sharp and physical: the state (channel-state information) is enormous and stale, the action (transmit power) is continuous and hard-constrained by a power amplifier that distorts when saturated, the reward is a non-convex sum-rate-versus-energy trade-off, the transition is partially observable because coherence time is shorter than a decision slot, and the whole thing must run inside a 3 ms TTI across many base stations at once. Each one names a tool.

The method — five steps, every lesson
Applied RL is not a grab-bag of tricks. Every domain in this track is the same loop: (1) Formulate the MDP — state, action, reward, transition, horizon. (2) Diagnose the one or two properties that make this MDP hard. (3) Engineer the mechanism that removes that difficulty. (4) Guard it in production — detect when it breaks and fall back. (5) Iterate. The art is only in which difficulty binds first. For a radio it is the size and freshness of the channel state.

1 · Formulate — the MDP behind a power-control loop

Intuition. The base station measures the channel, decides how much power to put on each user's beam, transmits, and is told later (by ACK/NACK and throughput) how that went. That is a Markov Decision Process: the measured channel is the state, the per-user power vector is the action, the achieved rate minus the energy spent is the reward, and the radio environment — fading, mobility, the other users — is the transition. Every difficulty below is one of these four pieces being physically awkward.

MDP = (S, A, R, P, γ)  ·  maximize  E[ Σₜ γᵗ ( Σₖ log(1 + SINRₖ) − λ·Pₜ ) ]
PieceFor a Massive-MIMO power controllerThe awkward part
State Scomplex channel matrix (CSI) H ∈ ℂ64×8, plus interference temperature, user speed, battery state-of-chargehuge and complex-valued — 2·512 real numbers per slot — and already stale by the time you act
Action Acontinuous transmit power per user, Pₖ > 0, with Σₖ Pₖ ≤ Pmaxcontinuous, strictly positive, sum-constrained, and capped by a power amplifier that distorts when driven near saturation
Reward Rweighted sum-rate minus an energy penalty λ·Pnon-convex in power (interference coupling) and the energy weight λ must track the battery
Transition Pfading channel + user mobility + the other cells' power decisionscoherence time < decision slot → partially observable and non-stationary

Notice the pattern we reuse for all 20 domains: the MDP table writes itself, and the rightmost column — the awkward part — is the lesson. Each row's difficulty has a named mechanism that answers it.

2 · Diagnose — what actually makes this MDP hard

Intuition. Three properties bind, and they all trace back to the radio being a physical, fast, shared medium. (a) The state is too big and too stale. Full CSI for a 64×8 array is hundreds of complex numbers refreshed every slot; feed it raw and the policy is slow to train and slower to run, and by the time it decides, the channel has already moved. (b) The action lives on a hard, curved feasible set. Power is positive, sums to a budget, and the amplifier clips near its ceiling — a naive Gaussian policy spends most of its probability mass outside the legal region. (c) The environment is partially observable and non-stationary. When coherence time drops below the decision period the agent is acting on a channel that no longer exists, and the channel statistics themselves drift as the user moves through different scattering.

Engineering detail. These are not three separate lessons — they compound. A stale 512-dimensional state makes the value function noisy; a noisy value function makes a constrained continuous action hard to optimize; a non-stationary channel invalidates whatever the value function learned. The next three sections remove them in the order they bind: compress the state, then constrain the action, then handle the missing time.

3 · Engineer the state — Lipschitz-normalized CSI compression

Intuition. The policy does not need every complex coefficient of the channel; it needs the few directions that decide how to split power. So learn a compressor that keeps the rate-relevant structure and throws away the rest, and normalize it so that two slightly different channels never produce wildly different states — otherwise the policy's input is jittery noise.

Engineering detail. Flatten the complex channel z into its real and imaginary parts to get a 2d real vector, then Lipschitz-normalize it by dividing by the largest singular value so the encoder cannot amplify small channel perturbations into large state swings. On a 64×8 Massive-MIMO link, a latent of d = 32 holds capacity loss below 0.5% while cutting the input dimension by an order of magnitude — which speeds PPO training 4.2× and lets the whole forward pass fit inside the 3 ms slot (TTI).

s = z / σmax(z)  ∈ ℝ2d,   d = 32  ·  capacity loss < 0.5%,   train speed-up ≈ 4.2×
Interference temperature is a hidden state — and its error caps your optimality
The SINR each user sees depends on the interference temperature — the aggregate interference floor — which you do not observe directly but estimate online. That estimate feeds the entropy-temperature parameter α in a SAC-style controller. If the estimate lags, the policy optimizes against a wrong floor and return can drop by 12% or more. The fix is a two-timescale update: refresh α on a slower clock (every 3rd policy update) and accumulate it in a FP32 accumulator so quantization noise does not bias it. That keeps Δα under 1% and holds the policy-optimality loss below 1%.

4 · Engineer the action — keep power positive, feasible, and below saturation

Intuition. Transmit power must be strictly positive, and the amplifier behaves linearly only below a ceiling — push past it and the signal distorts, splattering into neighbours' bands and burning hardware. A policy that samples raw real numbers will routinely propose negative or over-budget power. The trick is to make the parameterization itself guarantee feasibility, so the policy can never emit an illegal action and never has to learn "don't do that."

Engineering detail — reparameterize in log-power. Let the network output a Gaussian over log a and exponentiate: a = exp(log a) > 0 is positive by construction. With the reparameterization trick log a = μθ + σθ·ε, the policy gradient is low-variance and unbiased:

θJ = Eε[ ∂Q/∂a · a · (∇θμθ + ε ⊙ ∇θσθ) ]

The factor a in the gradient is exactly the chain-rule term ∂a/∂(log a). Log-power beats a tanh-Gaussian in the ultra-low-SNR regime: as a → 0, log a → −∞ but the gradient stays non-zero, whereas tanh saturates at its boundary and the gradient vanishes — so the agent can still learn to back off power gracefully when a user is deep in a fade.

Engineering detail — five layers against amplifier saturation. Positivity is not the same as staying below the linear ceiling Umax, so saturation gets its own defense in depth: (1) a learnable soft compressor asafe = Umax·tanh(κ·a/Umax) with κ initialized at 1.0, letting the network decide how early to start compressing while keeping the gradient alive; (2) a shaping penalty rsat = −λ·max(|asafe|−Umax, 0) with λ set 5–10× the peak task reward so overshoot genuinely hurts; (3) a low-level anti-windup PI stage that freezes its integrator once |u| > 0.95·Umax — invisible to the RL agent but eating residual overshoot, a second layer of insurance; (4) a PPO-Lagrangian outer loop that treats saturation events as a cost and constrains E[c] below δ ≈ 1 event per episode, so the multiplier tunes the penalty automatically instead of by hand; (5) domain randomization of Umax by ±5% each episode so the policy is robust to hardware aging and thermal drift. The target at delivery: saturation rate below 0.01% over a 24-hour, 1000-cycle hardware-in-the-loop test.

5 · Engineer the reward — a concave sum-rate-minus-energy trade-off

Intuition. The reward is a tug-of-war: more power means more rate but more energy, and on a battery-powered or thermally limited node the right balance shifts moment to moment. If you write it carelessly the optimization is non-convex and the agent can wander into local optima — or worse, learn to starve some users to inflate aggregate rate.

Engineering detail — make the utility concave. Diminishing returns on rate are real (a user who already has 100 Mb/s barely benefits from 10 more), so use a concave utility and subtract a linear energy cost; the difference stays concave and the optimization is well-behaved. If the business utility is S-shaped (revenue saturates), apply a concave-hull transform first:

Uconcave(a) = min{ U(a), U′(x₀)·(a − x₀) + U(x₀) }  ·  r = Uconcave(rate) − λ·power

Engineering detail — let λ track the battery. The energy weight λ should not be a constant. Make it a function of state-of-charge: at full charge λ is small (spend freely for throughput); as SoC falls λ rises so the agent voluntarily trades rate for endurance. In a deployed controller λ was made a meta-learned parameter, updated MAML-style every ~10 minutes, and as SoC fell from 100% toward 20% it climbed from ≈0.1 to ≈4.7, pulling the mean energy weight from 0.8 down to 0.25 — at the cost of only ~6% extra latency and no low-battery outages. A hard constraint layer clamps the weight (and triggers a graceful degraded mode) below a 5% SoC floor, and that degraded action is not written to the replay buffer so it cannot poison training.

Anti-starvation — the same pattern as a recommender's "user starvation"
Maximizing a raw sum can let the agent abandon the weakest user to feed the strongest. Three guards, lifted straight from the source: a per-user fairness floor as a hard constraint (e.g. minimum served rate), a long-horizon value head so the critic prices the cost of an angry, churning user rather than only this slot's bits, and a Lagrangian constraint "fraction of starved users ≤ ε" whose multiplier rises the instant the floor is breached. The concavity of the utility already discourages starvation; these make it a guarantee.
The watt-for-a-bit trade-off — sum-rate vs energy

A base station shares Pmax over users on a noise-limited link. Spending more power buys rate with diminishing returns (the log), while energy cost is linear (λ·P). The optimal operating point is where the marginal bit is no longer worth its marginal watt. Slide the energy weight λ and watch the optimum move — and watch what happens to the weakest user.

power used
sum-rate
net utility
weakest user

1 · FORMULATE CSI, power, SINR 3 ms slot 2 · DIAGNOSE huge stale state, POMDP channel 3 · ENGINEER CSI compress, log- power, belief net 4 · GUARD KL drift detect, safety fallback 5 · ITERATE meta-adapt a new channel statistic exposes the next difficulty — re-run the loop

6 · Engineer the transition — acting under a channel you can't see in time

Intuition. When the channel changes faster than you can decide (coherence time < slot), the measurement you act on is already wrong. The honest model is a partially observable MDP: you never see the true channel, only delayed, noisy observations, so you must maintain a belief about where the channel is and decide from that.

Engineering detail — belief state, then memory policy. Build a belief over the channel. With a floating-point budget, a particle filter (100–500 particles, reweighted each slot) tracks it well; on a Cortex-M-class MCU, fall back to an exponentially-weighted moment, bt = α·ot + (1−α)·bt−1, with α set inversely to coherence time. Feed a length-L observation–action–reward sequence into a stacked LSTM or light GRU whose hidden state h splits into an actor head and a critic head, train with PPO-clip, and put a throughput-minus-BLER term in the reward so the policy actively steers away from high-uncertainty channels. In field tests this beats a greedy scheduler by 12–18% average throughput while dropping block-error rate 0.8–1.2 points.

Engineering detail — meta-RL for drifting statistics. Mobility doesn't just move the channel, it changes its statistics (a user enters a tunnel, a different scattering regime). Pre-train a context meta-RL policy: an encoder qφ(z|τ) maps the last H ≈ 32 slots of (CSI, ACK/NACK, MCS) into an 8-dimensional latent z, and the policy πθ(a|s,z) conditions on it. Online, run two clocks: a slow thread refreshing the prior over θ, and a fast thread doing a single-step Bayesian correction z ← z + η·∇z log β (β the ACK/NACK likelihood) with no backprop, latency < 0.2 ms. Only if KL(q(z)‖p(z)) > ε does it fire a lightweight MAML fine-tune of the last two LoRA layers — 5 steps, < 2 kB of signalling. The payoff at a 3.5 GHz / 100 MHz, 120 km/h field trial: first-packet latency −27%, throughput +18%, handover failure 0.9% → 0.12%.

7 · Guard in production — detect drift, fall back fast

Intuition. A radio policy ships to thousands of cells and must never make a link worse than the classical baseline it replaced. So you watch for the channel becoming something the policy never trained on, and you keep a safe fallback one switch away.

Engineering detail. Three guards. (1) Non-stationarity detector. Run a sliding-window KL divergence on the channel statistics; the moment DKL > ε, trigger a local retrain — recovery needs only ~200 samples to restore capacity, well inside the O-RAN 7.2x interface's latency budget. (2) Robust safety margin. A robust-MPC outer guard solves a min-max over a model-error ball and keeps the state trajectory inside a robust-invariant set; it publishes a safety-margin index SMI = 1 − d∂Ω(x)/dmax, and when SMI < 0.05 it switches to a backup policy within 200 ms. (3) Quantization shadow. The deployed encoder runs INT8 on the inference accelerator; a periodic check compares its logits against an FP32 reference and recalibrates if the gap widens, because INT8 quantization noise can otherwise silently invalidate the robust bound.

The through-line
Every section is one row of the MDP table turned into a mechanism: huge stale CSI → Lipschitz-normalized compression to d=32; positive sum-constrained action → log-power reparameterization; amplifier ceiling → five-layer anti-saturation; non-convex rate-vs-energy → concave utility with a battery-tracking λ and anti-starvation constraints; POMDP channel → belief state + memory policy + context meta-RL; multi-cell real-time → INT8 inference and drift-triggered local retrain. You never reached for a tool until a row of the table demanded it. That discipline is the whole track.

8 · Iterate — distributed execution across cells

Intuition. One base station is an agent; a network is a team. Removing the single-cell difficulties exposes the next one — cells interfere, so their power decisions are coupled, and you cannot afford a central brain in the loop at 3 ms. So you re-run the loop at the multi-agent scale.

Engineering detail. Use centralized-training, decentralized-execution with a value-decomposition critic (a QPLEX-light: per-agent Q from a shared backbone, a hypernetwork mixing them into Qtot under the IGM/monotonicity guarantee, cross-terms kept only for k=3-hop neighbours so parameters stay O(n) not O(n²)). Each cell keeps a lightweight actor; the global critic sees a mean-pooled, self-attention summary of all observations. When over-eager policy averaging causes consensus oscillation, damp it with a momentum EMA on the parameter server (β=0.8 decaying to 0.5) and a circuit-breaker that rolls back to the last good model and locks averaging for a cooldown if online metrics dip — taking oscillation events from dozens per day to a couple. Where there is no central node at all, fall back to gossip averaging with importance weights clipped to [0.9, 1.1], which matches a parameter server's convergence within 3% at a third of the bandwidth.

Further considerations