all lessons / reinforcement learning / 73 · Network congestion control lesson 73 / 87

Network congestion control

A congestion controller decides how fast to send into a link it cannot see, using signals — round-trip time, loss, ECN marks — that are noisy, delayed, and not Markov. Cast as RL, the binding difficulties of this domain are: a partially-observable, non-stationary state built from jittery measurements; an action space that must change rate without oscillating; a throughput-vs-latency reward with a hard tail-latency constraint; a multi-agent fairness requirement (don't starve the TCP flows sharing your link); and a µs-scale inference budget that lives in the kernel datapath. Each names a mechanism.

The method — five steps, every lesson
Same loop as always. (1) Formulate the MDP — state, action, reward, transition, horizon. (2) Diagnose the properties that make this MDP hard. (3) Engineer the mechanism that removes exactly those difficulties. (4) Guard it in production — detect drift, monitor, fall back. (5) Iterate. For congestion control the first difficulty is that your "state" is an estimate of an estimate: the network is invisible, and everything you measure arrived one RTT too late.

1 · Formulate — the MDP behind a learned congestion controller

Intuition. A sender wants to push bytes as fast as the path allows without overflowing the bottleneck queue. It never sees the queue directly; it only sees what comes back — acknowledgements, their timing (RTT), losses, and ECN congestion-experienced (CE) marks. So the agent observes symptoms, picks a sending rate or window, and is rewarded for moving a lot of data while keeping delay low. That is an MDP, but a thin and lagged one.

MDP = (S, A, R, P, γ)  ·  maximize  E[ Σₜ γᵗ ( throughputₜ − λ·delayₜ ) ]
PieceFor a learned congestion controllerThe awkward part
State Ssmoothed RTT, ΔRTT, ECN mark ratio, loss flags, inflight, last actionbuilt from noisy, one-RTT-delayed estimates → effectively partially observable
Action Asending rate / cwnd multiplier, or a discrete throttle levelfine control oscillates; flipping levels every step destabilizes the link
Reward Rthroughput minus a latency penalty, under a hard tail-latency boundtwo objectives that trade off, plus log-utility that vanishes at zero
Transition Pthe link, the queue, and every other flow sharing the bottleneckco-existing flows are other agents → non-stationary & multi-agent

The rightmost column is the lesson. Unlike a game, where the engine gives you a clean frame, here even the state is engineering — half this lesson is just turning raw measurements into something Markov enough to learn on.

2 · Diagnose — the state is an estimate of an estimate

Intuition. RTT samples bounce around: queueing, scheduling, Wi-Fi retransmits, a 4G→5G handover that adds 200 ms in one step. If you feed raw RTT into the policy, the value function learns to chase noise. If you over-smooth, you lag real congestion and react too late. And the deepest problem: a single instantaneous reading is not Markov — packet-level sequences are non-stationary and grow with bandwidth, violating the MDP assumption outright. You must construct a state that is smooth, current, and Markov, all at once.

Engineering detail — adaptive smoothing with a confidence gate. Smooth RTT with an exponentially-weighted moving statistic whose gain adapts to the measured variance, so it tracks fast when the signal is real and damps hard when it is noise:

αₜ = 1 / ( 1 + exp( −(vₜ − v_target)/κ ) )  ·  RTT_smoothₜ = αₜ·RTTₜ + (1−αₜ)·RTT_smoothₜ₋₁

where vₜ is an online estimate of RTT variance and v_target is set by service class — roughly 0.1 ms for finance, 1 ms for cloud gaming. Large variance shrinks αₜ (trust history, suppress noise); small variance grows it (track the true change-point). The filter lag stays under its theoretical bound of 0.8 frames (<16 ms at 60 FPS), which is what keeps a cloud-gaming controller inside its decision deadline.

Engineering detail — the state vector itself. Don't feed one number; feed a first-order difference plus a confidence mask so the network can tell "the link changed" from "the meter is shaking":

sₜ = [ RTT_smooth, ΔRTT, maskₜ ]  ·  maskₜ = 𝟙{ σₜ < σ_max }

When mask = 0 (high-noise mode) the critic lowers its bootstrap weight so a wrong TD target can't poison the value function, while the actor keeps its output scale stable via PopArt normalization. The whole pipeline runs in TensorRT at <0.05 ms GPU latency, so it costs the trainer nothing.

Engineering detail — fusing loss and ECN into the state. Loss events are sparse, so a single loss bit carries little signal. Add a saturating consecutive-loss counter and a learned joint-confidence channel so the agent reads a 4-dim summary instead of a raw packet trace:

loss_cntₜ = min(loss_cntₜ₋₁ + 1, 7)  ·  confₜ = σ( w₁·ecn_smoothₜ − w₂·loss_cntₜ )  ·  sₜ = [ ecn_smooth, loss_bin, loss_cnt, conf ]

The weights w₁, w₂ are found by offline Bayesian optimization over 10k real traces, with the objective being the policy's mean reward over a 10-second sliding window. The 4-dim state runs at <1% CPU at 40 Gbps line rate, and in A/B tests against Cubic it raised throughput by 22% in incast scenarios while cutting queueing delay 35%.

Why statistics, not raw packets — the Markov argument
If an interviewer asks "why not feed the raw packet-level sequence?", the answer is exactly the MDP assumption: a packet sequence is non-stationary and its dimension grows linearly with bandwidth, so it is not a fixed-shape Markov state. Hand-built statistics (smoothed RTT, ECN ratio, loss count) restore Markov-ness at fixed dimension, and an attention layer can re-weight them dynamically — "stay invariant by summarizing." The same logic warns against feeding a black-box BBR estimate as state: BBR's output depends on its own internal history, so KL( P(s′|s,a) ‖ P(s′|o,a) ) > 0 — it is not Markov in your observation, and you should measure that information loss before trusting it.

3 · The action space — change rate without oscillating

Intuition. If the agent can jump from 10% to 100% throttle in one step, the link rings like a bell: overshoot, mass loss, collapse, overshoot again. The art of the action space is to make large moves possible but abrupt moves rare — and to give the agent an explicit "hold" so steady state is cheap to maintain.

Engineering detail — log-spaced discretization with action inertia. Bucket the continuous rate on a log scale, not a linear one, so small rates get fine resolution and large rates get coarse resolution: e.g. throttle 0–100% as {0, 1, 3, 7, 15, 31, 63, 100}. This holds the bucket count to ≤16 while bounding the maximum single-step jump Δmax. Add the previous action aₜ₋₁ to the state so the network carries an "inertia prior" that suppresses adjacent-step bucket flips.

Engineering detail — smoothing at three stages.

Engineering detail — integrating with the kernel state machine. The learned action must drive a real TCP stack. The pattern: an eBPF program (BPF_PROG_TYPE_STRUCT_OPS) hooks the congestion-control ops, computes per-step reward into a per-CPU reward map, and a user-space RL process pulls batches every 20 ms via the BPF skeleton for off-policy updates — completing one iteration in <50 ms. This requires no hardware change and coexists with existing kernels; reported online gains were +8.7% single-flow throughput, −11% RTT, with zero-loss restart under 8 ms.

1 · FORMULATE RTT, ECN, loss → rate / cwnd 2 · DIAGNOSE noisy, lagged, non-Markov state 3 · ENGINEER adaptive smoothing, log buckets, fairness 4 · GUARD CUSUM reset, KL hot-swap 5 · ITERATE re-diagnose removing one difficulty exposes the next — re-run the loop

4 · Reward — the throughput–latency knife-edge

Intuition. More sending rate fills the queue, which raises delay; less rate empties the queue but wastes the link. The reward is literally throughput − λ·delay, and λ is the policy. Too small and you blow the tail-latency budget; too large and throughput collapses. You want to sit on the "knee" of the trade-off curve and stay there as conditions drift.

Engineering detail — pick the knee, then auto-tune λ slowly. Sweep λ offline and choose the knee — where throughput drops <2% while 99th-percentile delay stays ≤98 ms — giving an online starting value of λ=0.28. Then adapt λ on an hourly cadence, with a bounded step so it can't oscillate:

λ = 0.28 · 1.1Δt_up · 0.95Δt_down  ·  |Δλ| ≤ 20% per step  ·  β = 0.01 fixed

If 99p delay exceeded 99 ms over the last 30 min, multiply λ by 1.1 (penalize delay harder); if it fell below 95 ms, multiply by 0.95 (let throughput breathe). Under a 100 ms hard constraint this delivered +12–15% throughput with seven consecutive days of zero violations.

Log-utility goes to −∞ at zero throughput
Using a log utility log(r̂) gives the right diminishing-returns economics, but a momentarily zero throughput makes the gradient explode (1/r → ∞), producing NaNs. Don't just "add an eps" — splice a robust hybrid that is C¹-continuous at the floor and degrades to linear below it, so the gradient is bounded at exactly 1/ε_safe:
L_r = log(r̂)  if  r ≥ ε_safe  ;  (r/ε_safe) − 1 + log(ε_safe)  if  r < ε_safe
Set ε_safe to the business's smallest meaningful unit; one deployment ran 20 days without a single NaN. If the business insists zero must stay zero (real reporting), switch to a bounded utility U(x)=1−e−αx, defined and finite at x=0 — but warn them the risk-aversion has changed and α must be recalibrated.

5 · Cross-flow fairness — the link is full of other agents

Intuition. Your flow does not own the bottleneck. A purely selfish RL flow learns the optimal predator strategy: grab bandwidth until the legacy TCP flows starve. Operators forbid this — "TCP must not starve" is a hard requirement. So fairness has to be built into the reward, the action space, and the training distribution.

Engineering detail — three layers of "see TCP, yield."

Together these let the RL flow hand 50% of bandwidth to TCP within 100 ms at a self-throughput cost under 8%. Theoretically, when fairness is framed as a potential game, any global maximum of the potential Φ is a Nash equilibrium — so a centralized critic that does gradient ascent on Φ converges to a fair equilibrium, which is the foundation MAPPO/MAAC build on.

6 · The µs wall — inference lives in the datapath

Intuition. At line rate there is no time to call a GPU. The decision has to happen inside the kernel, per-packet, in microseconds. A 1%-better policy that adds latency to the datapath is a net loss.

Engineering detail. Distill the policy into a tiny lookup table — e.g. 16 actions × 256 states, stored as uint16 fixed-point (Q6.10), total ~8 kB so it fits in L1 and respects the eBPF 512-byte stack limit. Pin it read-only in a BPF_MAP_TYPE_ARRAY, sample at an XDP hook with a fully-unrolled 16-iteration loop (~38 instructions, ≈12–15 CPU cycles ≈ 4–5 µs). Measure latency with bpf_ktime_get_ns into a per-CPU histogram: observed P99 = 42 µs, P50 = 28 µs, inside a 50 µs SLA. Online learning stays in user space: eBPF only executes high-confidence actions and ringbufs (s,a,r,s′) batches up; user space does incremental SGD every 100 ms and atomically swaps the map via double-buffering (pointer flip <1 µs, zero-jitter hot update).

Throughput–latency knee: where to set λ

Pick the link, the queue-delay penalty λ, and your tail-latency budget. The napkin model: utilization rises with offered rate but so does queue delay (≈ ρ/(1−ρ) of the base RTT). The agent's per-step reward is throughput − λ·delay; we report the reward-maximizing operating point and whether it stays under your hard 99p bound.

best util ρ*
throughput
99p delay
verdict

The through-line
Every section is one row of the MDP table turned into a mechanism: noisy, lagged, non-Markov state → variance-adaptive smoothing + confidence-gated critic + fixed-dim statistics; oscillating action → log buckets, distributional value, EMA, an explicit KEEP; throughput–latency reward → knee selection, slow λ auto-tune, a robust log-floor; multi-agent fairness → reward penalty + ECN back-off + meta-learned "yield"; µs transition deadline → distilled eBPF table with double-buffered hot-swap. You never reached for a tool until a row demanded it.

7 · Guard & iterate — production safety

Intuition. Networks change regime without warning — a 4G→5G handover, a path flap, an adversary deliberately jittering RTT to fool you into being conservative. The controller must detect the regime change and recover, and any model swap must never drop a connection.

Engineering detail. Put online CUSUM change-point detection ahead of the smoothing layer; on a step-shift it force-resets the EWMT history moments so a stale 200 ms "pseudo-high-latency" state can't mislead the policy. For experiment reproducibility, pin the kernel and tc/NetEm versions, lock CPU governor to performance, isolate cores, and validate that the empirical RTT CDF matches the NetEm target by KS statistic (<0.01) — one such harness cut reward variance from 18.7% to 2.3%. For hot updates, use double model slots A/B with a KL gate: if KL(π_old‖π_new) > 0.01 or the importance ratio exceeds 1.2, skip the swap and roll back; release the old model only 30 s later so in-flight requests are never destructed — connection-drop rate stays at zero.

Further considerations