all lessons / reinforcement learning / 82 · 5G network-slicing resource allocation lesson 82 / 87

5G network-slicing resource allocation

One physical radio carrier is sliced into virtual networks that must each meet a contract: a video slice wants bandwidth, an industrial-control slice wants 99.999% reliability at sub-millisecond latency, a best-effort slice wants whatever is left. A controller hands out radio blocks and transmit power every transmission-time interval. The binding difficulties are a high-dimensional, fast-drifting state (channel reports that are huge, delayed, and partially observed), a hybrid constrained action (discrete block selection × continuous power under a total-power cap), a multi-objective reward where a reliability breach is a regulatory red line, non-stationary traffic that shifts faster than you can retrain, and a hard isolation requirement: no slice may starve its neighbour. Each names a tool.

The method — five steps, every lesson
Applied RL is not a grab-bag of tricks. Every domain in this track runs the same loop: (1) Formulate the MDP — state, action, reward, transition, horizon. (2) Diagnose the one property that makes the MDP hard. (3) Engineer the mechanism that removes that difficulty. (4) Guard it in production — detect when it breaks and fall back. (5) Iterate. The art is only in which difficulty binds first. Here we run it on a slice scheduler living inside a 5G base-station that must decide inside a 1 ms slot.

1 · Formulate — the MDP behind a slice scheduler

Intuition. Every slot the scheduler looks at who needs to send, how good each link is, and which contracts are at risk; it then assigns resource blocks and power; the air interface delivers (or drops) packets; and a contract-satisfaction score comes back. That is a Markov Decision Process — but every one of its four pieces is awkward in a way that names the rest of the lesson.

MDP = (S, A, R, P, γ)  ·  maximize  E[ Σₜ γᵗ rₜ ]   subject to   SLA-violation rate ≤ ε
PieceFor a 5G slice schedulerThe awkward part
State Sper-slice QoS demand + business-weight signal, channel-state reports (CSI), system load (CPU, queue length, RTT) — raw ~80–120 dim, CSI alone up to 8192 dimhuge, heterogeneous, delayed, partially observed — reports arrive late and stale
Action Awhich resource blocks (discrete) × transmit power per block (continuous), under a total-power caphybrid and constrained: Σ power ≤ Pmax, every slot
Reward Rbusiness satisfaction + spectral efficiency − SLA violation − operator-cost termsmulti-objective; a reliability breach is a hard red line, not a soft cost
Transition Pfast-fading channel + arriving traffic + the other slices sharing the carriernon-stationary (new services launch weekly) and coupled across slices

The rightmost column is the lesson. State is high-dimensional and delayed → embed, compress, and build an information state. Action is hybrid and capped → structured heads with a differentiable projection. Reward has a hard red line → constrained RL. Transition is non-stationary → change detection and meta-adaptation. And slices share one carrier → an isolation guarantee. We reach for each tool only when its row demands it.

2 · Diagnose — three properties that make this MDP hard

Intuition. Strip the domain to its load-bearing difficulties. First, the state is enormous, heterogeneous, and stale: you concatenate business weights, per-slice QoS vectors, system counters, and raw channel matrices, then must decide in under a millisecond on reports that arrived late. Second, the action is a capped hybrid: pick a subset of resource blocks and set continuous power on each, never exceeding total power — and a satisfied SLA constraint must hold every single slot, not on average. Third, the slices are coupled and the traffic moves: one slice grabbing power raises a neighbour's interference, and the traffic mix that you trained on is gone by next week's promotion. The rest of the lesson removes exactly these.

Why a naive flatten-and-DQN scheduler fails here
Feed the raw 8192-dim channel vector and 100-dim system state straight into a fully-connected policy and three things break at once. (1) The network is too big to run inside the 1 ms slot on an edge accelerator. (2) A soft penalty for SLA violation lets the agent trade reliability for throughput — fine on average, fatal for the one industrial slice whose contract says 99.999%. (3) The policy silently assumes the channel report it sees is current; under report delay the state is not Markov, and the value estimates wander. You cannot patch these after the fact — they must be designed out.

3 · Engineer — state: embed, compress, and make it Markov

Intuition. The raw observation is a pile of different things measured in different units. You want a small, fixed-width vector that keeps what the policy needs and throws away the rest — and that is genuinely Markov despite delayed reports.

Engineering detail — heterogeneous embedding then attention compression. Embed the categorical service-type into a low-dimensional vector (16 dim is plenty), stack it with the three QoS signals (demand, latency target, dynamically-pushed business weight — e.g. "e-commerce weight ×1.8 at midnight during a sales peak") into a QoS vector, and concatenate with system counters to get the raw ~80–120 dim state. Then compress it with a small Transformer encoder (2 layers, 2 heads, 32 hidden units) down to a 32-dim state. Self-attention across same-domain QoS signals lets the model learn couplings automatically — "if video bandwidth goes up, tolerated latency can go up too" — and a shallow, narrow encoder runs inference in under 0.2 ms, inside the MEC edge node's 1 ms decision budget.

etype = Embed(service)  ·  qos = [etype, demand, latency, weight]  ·  s32 = TransformerEnc(concat(qos, sys))

Engineering detail — compressing the channel report. The channel-state report (CSI) can be ~8192-dim and is the single largest input. A learned compressor (a "channel-state transform network") squeezes it to a 32-dim latent — a 256:1 reduction in the densest case — using a gated soft-thresholding block that writes an ℓ0 sparsity constraint as the smooth surrogate (‖x‖ − τ)+ so it stays differentiable. Crucially you do not train it with PCA: PCA only preserves maximum variance, which is not the same as preserving what the policy needs. Instead, share the compressor's last two layers with the actor's front end and add a compression-distortion penalty (λ = 0.01) to the reward, so the latent is optimized end-to-end to maximize mutual information between the compressed state and long-term return. With quantization-aware training and weight pruning to INT4, this drops the state dimension from 8192 to 32, cuts policy parameters by ~80%, speeds convergence ~3.4×, and reduces inference power on the edge accelerator by ~62%.

Engineering detail — building an information state under report delay. Channel and queue reports arrive late, so the latest observation is not the true state — the MDP is partially observed. Replay offline logs to fit the end-to-end delay distribution Δ ~ LogNormal(μ, σ²), then stack the last k = ⌈P99(Δ)/dt⌉ observations and run a 1-D CNN + GRU to produce a 64-dim candidate information state ẑt. Verify it is actually Markov with a conditional-independence test: train a next-observation predictor and accept ẑt only if it predicts ot+1 at least as well as the raw ot (MSE within 1.05×). During training, randomly offset replayed samples by δ ~ U(0, Δmax) to simulate real jitter, and add a delay-consistency regularizer so the critic gives stable values to nearby information states:

Lreg = λ · ‖ Q(ẑt, at) − Q(ẑt+δ, at) ‖² ,   λ = 0.1  |  L = LSAC + Lreg

4 · Engineer — action: a capped hybrid you can backpropagate through

Intuition. The scheduler must choose which resource blocks to use (a discrete subset) and set a continuous power level on each, while the powers sum to no more than the carrier's total. Flattening this into one giant discrete menu explodes; ignoring the cap produces illegal allocations. Keep the structure and make the cap part of the math.

Engineering detail. Use a discrete head to select blocks and a continuous head that outputs a Gaussian (μ, σ) whose dimension equals the number of selected blocks; sample power by reparameterization. To enforce Σ power ≤ Pmax, apply a differentiable projection: softmax-normalize the sampled vector, then multiply by Pmax. The backward path is preserved, so training stays stable. At inference, take the argmax of the discrete head and emit μ directly — no sampling — which on a pooled baseband platform adds under 0.2 ms, satisfying the 1 ms TTI budget. Train with a constrained PPO that turns each QoS breach into a cost function and uses a Lagrangian update (Lagrangian-constrained policy optimization) to drive the dual multiplier; in a fast-fading simulator (3GPP TR 38.901 channel, <5 ms per simulated slot) it reaches ~98% of exhaustive-search optimal and the multiplier converges within ~20 steps.

maxθ E[ Σ γᵗ rt ] − μ · E[ costSLA ]  ,  μ ← [ μ + η ( E[costSLA] − ε ) ]+
URLLC reliability as a true constraint, not a penalty
For an ultra-reliable low-latency slice the contract is 99.999% — block error rate ≤ 10-5. Track a smoothed violation probability pt = (1−α)·pt-1 + α·ct with α = 0.001 (≈1000-slot memory, a ~1 s statistical window), and write the constraint as cost = max(0, pt − 10-5). A tiny GRU + 2-layer MLP policy (<200k params) infers in ~120 µs; trained on millions of virtual samples (TDL channel + Poisson arrivals) then fine-tuned for one hour on a live cell, it reaches BLER ≈ 1.2×10-5 (meeting 99.999%), air-interface latency averaging 0.87 ms with P99 = 0.95 ms, and ~38% higher spectral efficiency than static block reservation — with zero recorded violations.

5 · Engineer — reward: multi-objective with a hard red line

Intuition. Three things matter at once — keep customers satisfied, use spectrum efficiently, and never breach a contract — and they pull against each other. Linear weighting gets you started, but the reliability term cannot be a soft cost the agent is free to trade away.

Engineering detail. A workable scalarization is a linear combination plus a steep red-line penalty:

R = 0.6·satisfaction + 0.3·utilization − 100·SLA_violation  |  Rpen = −β·log₁₀(1 + pt/ε), β = 10

The reliability penalty is shaped in log scale so it grows smoothly from ≈−0.05 at pt=10-5 to ≈−0.3 at pt=10-3, with ε=10-7 guarding the log. Choose the outer weights in two stages — coarse grid search, then Bayesian fine-tuning. Two training tricks keep the rare-violation signal from being drowned out: clip Rpen to [−10, 0] so its variance does not swamp the critic, and use a dual buffer that over-samples violation episodes against normal ones ~1:10. Recalibrate β against live operations-and-maintenance data every ~10k steps so the simulator-to-field error stays under 5%.

Engineering detail — when reward is a monthly bill. If the objective is operator revenue, the true reward (the settled bill) only arrives at month-end while decisions are made every slot. Use a two-level eligibility trace: within the day, build a proxy reward from 15-minute usage records with trace λ=0.7; at the billing boundary, replace that month's proxy trajectories with the real bill via Monte-Carlo return, restoring an unbiased target. Feed the critic two extra state features — days remaining in the billing period and fraction of budget already spent — so the value function feels the month-end window closing and the policy stops over-spending subsidy late in the cycle.

6 · Engineer — isolation: no slice may starve its neighbour

Intuition. The slices share one physical carrier, so a greedy allocation to one slice raises interference and queue pressure on the others. The headline failure is one elastic broadband slice monopolizing resources. Isolation needs three layers, because no single one is trustworthy alone.

Engineering detail — algorithm, system, operations. (Algorithm) add an isolation penalty R = Rbiz − λ·max(0, KPIother − KPIthreshold) so the moment a neighbour's KPI is crushed the policy eats a large negative reward and learns self-restraint; to stop user-level monopoly, model each broadband user as a selfish agent under a mean-field approximation and add a Jain-fairness marginal penalty that turns a user's marginal reward negative once its rate exceeds ~150% of the mean. (System) put an action-shielding module in the slice orchestrator: every RL action is checked against the isolation budget first, and if it would overflow a slice's CPU, block, or queue allotment it is replaced with a safe action — zero illegal allocations reach the air. Run a two-timescale architecture: RL decides on a ~1 s cadence with a classical controller backstopping every ~10 ms, rolling back to the last safe configuration within ~200 ms on any KPI anomaly. And hard-partition the critical resources (cgroups, NUMA pinning, static block reservation) so the RL action can only shuffle within a soft partition — the physical ceiling is untouchable. (Operations) chaos-test in a digital-twin sandbox before launch, then gray-release at 5% traffic until 24 h of zero violations, and continuously report the isolation-violation rate to the network operations center; if it exceeds 0.1% for three consecutive windows, auto-trip the model and fall back to a conservative heuristic.

A control-barrier-function isolation proof
Beyond shielding you can prove isolation. Define a barrier B(x) on a safety margin h(x) (e.g. a neighbour's headroom); the control-barrier condition Lfh(x) + Lgh(x)·u ≥ −α(h(x)) guarantees h stays positive — the boundary is never crossed. Enforce it online by projecting the actor's raw action onto the safe set, aproj = argminu∈UCBF ‖u − araw‖², and enforce it offline with a differentiable penalty Jcbf = E[max(0, −(Lfh + Lgh·a + α(h)))] added to the loss. As Jcbf→0 the policy satisfies the barrier almost everywhere — isolation in the probabilistic sense — without abandoning return maximization.

7 · Guard — non-stationary traffic and the production loop

Intuition. Traffic shifts faster than retraining: new services launch weekly, and a promotion can rewrite the load profile overnight. A frozen policy decays silently. So detect the drift, adapt fast, and keep a safe fallback.

Engineering detail. Watch the return distribution, not raw inputs: compute the Wasserstein-1 distance between recent and reference returns; if W₁ > 0.08 with p < 0.005, pull the new policy into a small A/B bucket capped at 5% traffic. Adapt quickly with meta-RL — on new traffic, take ~100 logged samples and do a one-step MAML update (learning rate ×10) to get θ′, run it, and re-tune every 30 minutes from a second replay buffer. Gate every adaptation with a bootstrapped meta-critic confidence lower bound: if the lower bound drops more than 5% below the incumbent, auto-roll back, keeping the launch incident-free. For drift detection that must avoid catastrophic forgetting, compute the conflict between new- and old-task gradients on incremental MAML updates and project out the conflicting component when it exceeds a threshold.

Constrained PPO: the throughput ↔ reliability dual price

The Lagrangian multiplier μ is the price the scheduler pays for risking the SLA. Push the throughput weight up and the policy wants to over-pack the spectrum; the dual update raises μ until the measured violation rate falls back to the budget ε. Feel how the achieved throughput and the violation rate trade off as you move the knobs.

dual price μ
achieved throughput
violation rate
verdict

1 · FORMULATE S, A, R, P capped hybrid 2 · DIAGNOSE huge stale state, red-line SLA 3 · ENGINEER embed+compress, constrained PPO 4 · GUARD action shield, drift detect 5 · ITERATE re-diagnose removing one difficulty exposes the next — re-run the loop
The through-line
Every section is one row of the MDP table turned into a mechanism: huge, stale state → service embedding + attention compression + an information state verified Markov under delay; capped hybrid action → structured discrete/continuous heads with a differentiable power projection; red-line reward → constrained PPO with a Lagrangian dual price and over-sampled violation episodes; coupled slices → algorithm penalty + action shielding + control-barrier proof + hard partitions; non-stationary traffic → return-distribution change detection, MAML fast-adaptation, and confidence-gated rollback. You never reached for a tool until a row demanded it.

Further considerations