Network congestion control

A congestion controller decides how fast to send into a link it cannot see, using signals — round-trip time, loss, ECN marks — that are noisy, delayed, and not Markov. Cast as RL, the binding difficulties of this domain are: a partially-observable, non-stationary state built from jittery measurements; an action space that must change rate without oscillating; a throughput-vs-latency reward with a hard tail-latency constraint; a multi-agent fairness requirement (don't starve the TCP flows sharing your link); and a µs-scale inference budget that lives in the kernel datapath. Each names a mechanism.

The method — five steps, every lesson

Same loop as always. (1) Formulate the MDP — state, action, reward, transition, horizon. (2) Diagnose the properties that make this MDP hard. (3) Engineer the mechanism that removes exactly those difficulties. (4) Guard it in production — detect drift, monitor, fall back. (5) Iterate. For congestion control the first difficulty is that your "state" is an estimate of an estimate: the network is invisible, and everything you measure arrived one RTT too late.

1 · Formulate — the MDP behind a learned congestion controller

Intuition. A sender wants to push bytes as fast as the path allows without overflowing the bottleneck queue. It never sees the queue directly; it only sees what comes back — acknowledgements, their timing (RTT), losses, and ECN congestion-experienced (CE) marks. So the agent observes symptoms, picks a sending rate or window, and is rewarded for moving a lot of data while keeping delay low. That is an MDP, but a thin and lagged one.

MDP = (S, A, R, P, γ) · maximize E[ Σₜ γᵗ ( throughputₜ − λ·delayₜ ) ]

Piece	For a learned congestion controller	The awkward part
State S	smoothed RTT, ΔRTT, ECN mark ratio, loss flags, inflight, last action	built from noisy, one-RTT-delayed estimates → effectively partially observable
Action A	sending rate / cwnd multiplier, or a discrete throttle level	fine control oscillates; flipping levels every step destabilizes the link
Reward R	throughput minus a latency penalty, under a hard tail-latency bound	two objectives that trade off, plus log-utility that vanishes at zero
Transition P	the link, the queue, and every other flow sharing the bottleneck	co-existing flows are other agents → non-stationary & multi-agent

The rightmost column is the lesson. Unlike a game, where the engine gives you a clean frame, here even the state is engineering — half this lesson is just turning raw measurements into something Markov enough to learn on.

2 · Diagnose — the state is an estimate of an estimate

Intuition. RTT samples bounce around: queueing, scheduling, Wi-Fi retransmits, a 4G→5G handover that adds 200 ms in one step. If you feed raw RTT into the policy, the value function learns to chase noise. If you over-smooth, you lag real congestion and react too late. And the deepest problem: a single instantaneous reading is not Markov — packet-level sequences are non-stationary and grow with bandwidth, violating the MDP assumption outright. You must construct a state that is smooth, current, and Markov, all at once.

Engineering detail — adaptive smoothing with a confidence gate. Smooth RTT with an exponentially-weighted moving statistic whose gain adapts to the measured variance, so it tracks fast when the signal is real and damps hard when it is noise:

αₜ = 1 / ( 1 + exp( −(vₜ − v_target)/κ ) ) · RTT_smoothₜ = αₜ·RTTₜ + (1−αₜ)·RTT_smoothₜ₋₁

where vₜ is an online estimate of RTT variance and v_target is set by service class — roughly 0.1 ms for finance, 1 ms for cloud gaming. Large variance shrinks αₜ (trust history, suppress noise); small variance grows it (track the true change-point). The filter lag stays under its theoretical bound of 0.8 frames (<16 ms at 60 FPS), which is what keeps a cloud-gaming controller inside its decision deadline.

Engineering detail — the state vector itself. Don't feed one number; feed a first-order difference plus a confidence mask so the network can tell "the link changed" from "the meter is shaking":

sₜ = [ RTT_smooth, ΔRTT, maskₜ ] · maskₜ = 𝟙{ σₜ < σ_max }

When mask = 0 (high-noise mode) the critic lowers its bootstrap weight so a wrong TD target can't poison the value function, while the actor keeps its output scale stable via PopArt normalization. The whole pipeline runs in TensorRT at <0.05 ms GPU latency, so it costs the trainer nothing.

Engineering detail — fusing loss and ECN into the state. Loss events are sparse, so a single loss bit carries little signal. Add a saturating consecutive-loss counter and a learned joint-confidence channel so the agent reads a 4-dim summary instead of a raw packet trace:

loss_cntₜ = min(loss_cntₜ₋₁ + 1, 7) · confₜ = σ( w₁·ecn_smoothₜ − w₂·loss_cntₜ ) · sₜ = [ ecn_smooth, loss_bin, loss_cnt, conf ]

The weights w₁, w₂ are found by offline Bayesian optimization over 10k real traces, with the objective being the policy's mean reward over a 10-second sliding window. The 4-dim state runs at <1% CPU at 40 Gbps line rate, and in A/B tests against Cubic it raised throughput by 22% in incast scenarios while cutting queueing delay 35%.

Why statistics, not raw packets — the Markov argument

If an interviewer asks "why not feed the raw packet-level sequence?", the answer is exactly the MDP assumption: a packet sequence is non-stationary and its dimension grows linearly with bandwidth, so it is not a fixed-shape Markov state. Hand-built statistics (smoothed RTT, ECN ratio, loss count) restore Markov-ness at fixed dimension, and an attention layer can re-weight them dynamically — "stay invariant by summarizing." The same logic warns against feeding a black-box BBR estimate as state: BBR's output depends on its own internal history, so KL( P(s′|s,a) ‖ P(s′|o,a) ) > 0 — it is not Markov in your observation, and you should measure that information loss before trusting it.

3 · The action space — change rate without oscillating

Intuition. If the agent can jump from 10% to 100% throttle in one step, the link rings like a bell: overshoot, mass loss, collapse, overshoot again. The art of the action space is to make large moves possible but abrupt moves rare — and to give the agent an explicit "hold" so steady state is cheap to maintain.

Engineering detail — log-spaced discretization with action inertia. Bucket the continuous rate on a log scale, not a linear one, so small rates get fine resolution and large rates get coarse resolution: e.g. throttle 0–100% as {0, 1, 3, 7, 15, 31, 63, 100}. This holds the bucket count to ≤16 while bounding the maximum single-step jump Δ_max. Add the previous action aₜ₋₁ to the state so the network carries an "inertia prior" that suppresses adjacent-step bucket flips.

Engineering detail — smoothing at three stages.

Training: use an IQN quantile network with Huber quantile loss instead of MSE, so neighbouring buckets' value distributions overlap and probability mass can't teleport between buckets on a gradient step. Add target-policy smoothing: perturb the chosen action's one-hot with ε-uniform noise (ε=0.1) so the value net sees "neighbour buckets" and doesn't overfit a spike.
Inference: make the softmax temperature T an online-tunable knob (default T=0.5). If monitoring sees bucket-switch frequency exceed a threshold, auto-raise T→1.2 to spread the output, then low-pass it with an EMA: aₜ = 0.7·aₜ₋₁ + 0.3·a_raw — two lines of C, zero added latency on an edge NPU.
Hold: add an explicit KEEP action so "do nothing" is a first-class choice; reusing aₜ₋₁ when KEEP fires removes a whole class of pointless oscillation.

Engineering detail — integrating with the kernel state machine. The learned action must drive a real TCP stack. The pattern: an eBPF program (BPF_PROG_TYPE_STRUCT_OPS) hooks the congestion-control ops, computes per-step reward into a per-CPU reward map, and a user-space RL process pulls batches every 20 ms via the BPF skeleton for off-policy updates — completing one iteration in <50 ms. This requires no hardware change and coexists with existing kernels; reported online gains were +8.7% single-flow throughput, −11% RTT, with zero-loss restart under 8 ms.

4 · Reward — the throughput–latency knife-edge

Intuition. More sending rate fills the queue, which raises delay; less rate empties the queue but wastes the link. The reward is literally throughput − λ·delay, and λ is the policy. Too small and you blow the tail-latency budget; too large and throughput collapses. You want to sit on the "knee" of the trade-off curve and stay there as conditions drift.

Engineering detail — pick the knee, then auto-tune λ slowly. Sweep λ offline and choose the knee — where throughput drops <2% while 99th-percentile delay stays ≤98 ms — giving an online starting value of λ=0.28. Then adapt λ on an hourly cadence, with a bounded step so it can't oscillate:

λ = 0.28 · 1.1^Δt_up · 0.95^Δt_down · |Δλ| ≤ 20% per step · β = 0.01 fixed

If 99p delay exceeded 99 ms over the last 30 min, multiply λ by 1.1 (penalize delay harder); if it fell below 95 ms, multiply by 0.95 (let throughput breathe). Under a 100 ms hard constraint this delivered +12–15% throughput with seven consecutive days of zero violations.

Log-utility goes to −∞ at zero throughput

Using a log utility log(r̂) gives the right diminishing-returns economics, but a momentarily zero throughput makes the gradient explode (1/r → ∞), producing NaNs. Don't just "add an eps" — splice a robust hybrid that is C¹-continuous at the floor and degrades to linear below it, so the gradient is bounded at exactly 1/ε_safe:

L_r = log(r̂) if r ≥ ε_safe ; (r/ε_safe) − 1 + log(ε_safe) if r < ε_safe

Set ε_safe to the business's smallest meaningful unit; one deployment ran 20 days without a single NaN. If the business insists zero must stay zero (real reporting), switch to a bounded utility U(x)=1−e^−αx, defined and finite at x=0 — but warn them the risk-aversion has changed and α must be recalibrated.

5 · Cross-flow fairness — the link is full of other agents

Intuition. Your flow does not own the bottleneck. A purely selfish RL flow learns the optimal predator strategy: grab bandwidth until the legacy TCP flows starve. Operators forbid this — "TCP must not starve" is a hard requirement. So fairness has to be built into the reward, the action space, and the training distribution.

Engineering detail — three layers of "see TCP, yield."

Reward penalty. Add a term proportional to the estimated TCP rate r_TCP (read via eBPF from the co-existing flows' SRTT and loss), λ∈[0.8,1.2]; the gradient mask pushes "aggressive" behaviour toward zero on the backward pass.
ECN back-off gate in the action space. When the agent sees CE-mark ratio >5% and co-flow loss >1%, force cwnd ×0.7 and tag the event "fairness-trigger" — injected via BPF_PROG_TYPE_STRUCT_OPS, no hardware change, vSwitch-compatible.
Offline pre-train + online meta-learning. In a simulator, inject 30% TCP background flows and randomize BDP every episode (10–200 Mbps, RTT 20–200 ms) so the policy learns "see TCP, yield" up front; then a MAML inner-loop runs every 200 ms on real co-flow features, cutting starvation risk below 5% within 1 RTT.

Together these let the RL flow hand 50% of bandwidth to TCP within 100 ms at a self-throughput cost under 8%. Theoretically, when fairness is framed as a potential game, any global maximum of the potential Φ is a Nash equilibrium — so a centralized critic that does gradient ascent on Φ converges to a fair equilibrium, which is the foundation MAPPO/MAAC build on.

6 · The µs wall — inference lives in the datapath

Intuition. At line rate there is no time to call a GPU. The decision has to happen inside the kernel, per-packet, in microseconds. A 1%-better policy that adds latency to the datapath is a net loss.

Engineering detail. Distill the policy into a tiny lookup table — e.g. 16 actions × 256 states, stored as uint16 fixed-point (Q6.10), total ~8 kB so it fits in L1 and respects the eBPF 512-byte stack limit. Pin it read-only in a BPF_MAP_TYPE_ARRAY, sample at an XDP hook with a fully-unrolled 16-iteration loop (~38 instructions, ≈12–15 CPU cycles ≈ 4–5 µs). Measure latency with bpf_ktime_get_ns into a per-CPU histogram: observed P99 = 42 µs, P50 = 28 µs, inside a 50 µs SLA. Online learning stays in user space: eBPF only executes high-confidence actions and ringbufs (s,a,r,s′) batches up; user space does incremental SGD every 100 ms and atomically swaps the map via double-buffering (pointer flip <1 µs, zero-jitter hot update).

Throughput–latency knee: where to set λ

Pick the link, the queue-delay penalty λ, and your tail-latency budget. The napkin model: utilization rises with offered rate but so does queue delay (≈ ρ/(1−ρ) of the base RTT). The agent's per-step reward is throughput − λ·delay; we report the reward-maximizing operating point and whether it stays under your hard 99p bound.

link capacity (Mbps) 200 delay penalty λ 0.28 99p budget (ms) 100

best util ρ*

–

throughput

–

99p delay

–

verdict

–

The through-line

Every section is one row of the MDP table turned into a mechanism: noisy, lagged, non-Markov state → variance-adaptive smoothing + confidence-gated critic + fixed-dim statistics; oscillating action → log buckets, distributional value, EMA, an explicit KEEP; throughput–latency reward → knee selection, slow λ auto-tune, a robust log-floor; multi-agent fairness → reward penalty + ECN back-off + meta-learned "yield"; µs transition deadline → distilled eBPF table with double-buffered hot-swap. You never reached for a tool until a row demanded it.

7 · Guard & iterate — production safety

Intuition. Networks change regime without warning — a 4G→5G handover, a path flap, an adversary deliberately jittering RTT to fool you into being conservative. The controller must detect the regime change and recover, and any model swap must never drop a connection.

Engineering detail. Put online CUSUM change-point detection ahead of the smoothing layer; on a step-shift it force-resets the EWMT history moments so a stale 200 ms "pseudo-high-latency" state can't mislead the policy. For experiment reproducibility, pin the kernel and tc/NetEm versions, lock CPU governor to performance, isolate cores, and validate that the empirical RTT CDF matches the NetEm target by KS statistic (<0.01) — one such harness cut reward variance from 18.7% to 2.3%. For hot updates, use double model slots A/B with a KL gate: if KL(π_old‖π_new) > 0.01 or the importance ratio exceeds 1.2, skip the swap and roll back; release the old model only 30 s later so in-flight requests are never destructed — connection-drop rate stays at zero.

Further considerations

Multi-path QUIC blows up the state. Each subflow emits its own ECN/loss, so the state grows O(n²). Use hierarchical RL: a top level observes an aggregated "path-group" vector, a bottom level refines per-path decisions — bringing the state back to O(n).
Wireless loss vs. congestion loss. On 5G edge links, random radio loss coexists with true ECN-marked congestion and the confidence channel fails. Add RTT-gradient as a third modality and fuse all three with a Bayesian update — one report drove misclassification from 12% to 3%.
Adversarial jitter as a POMDP. An opponent who deliberately injects RTT jitter to make you over-conservative turns smoothing into belief estimation: treat the smoothed RTT as a partial observation and run a recursive Bayesian particle filter over the opponent's latency type, upgrading the problem to a POMDP belief update.
Lagged observations. When the upstream estimator (e.g. BBR) lags 1–2 RTT, predict 2 steps ahead with a time-lagged generative model and let the policy act on the imagined state, MBRL-style.
Datacenter scale & explainability. Training on ~1M concurrent flows needs three-layer decoupling (sampling / transport via GPUDirect-RDMA / a two-level all-reduce tree that cut reduction from 580→42 ms). And for auditability, log attention weights + SHAP attributions of which signal drove each rate change, so a black-box policy can still meet algorithmic-traceability rules.