Network congestion control
A congestion controller decides how fast to send into a link it cannot see, using signals — round-trip time, loss, ECN marks — that are noisy, delayed, and not Markov. Cast as RL, the binding difficulties of this domain are: a partially-observable, non-stationary state built from jittery measurements; an action space that must change rate without oscillating; a throughput-vs-latency reward with a hard tail-latency constraint; a multi-agent fairness requirement (don't starve the TCP flows sharing your link); and a µs-scale inference budget that lives in the kernel datapath. Each names a mechanism.
1 · Formulate — the MDP behind a learned congestion controller
Intuition. A sender wants to push bytes as fast as the path allows without overflowing the bottleneck queue. It never sees the queue directly; it only sees what comes back — acknowledgements, their timing (RTT), losses, and ECN congestion-experienced (CE) marks. So the agent observes symptoms, picks a sending rate or window, and is rewarded for moving a lot of data while keeping delay low. That is an MDP, but a thin and lagged one.
| Piece | For a learned congestion controller | The awkward part |
|---|---|---|
| State S | smoothed RTT, ΔRTT, ECN mark ratio, loss flags, inflight, last action | built from noisy, one-RTT-delayed estimates → effectively partially observable |
| Action A | sending rate / cwnd multiplier, or a discrete throttle level | fine control oscillates; flipping levels every step destabilizes the link |
| Reward R | throughput minus a latency penalty, under a hard tail-latency bound | two objectives that trade off, plus log-utility that vanishes at zero |
| Transition P | the link, the queue, and every other flow sharing the bottleneck | co-existing flows are other agents → non-stationary & multi-agent |
The rightmost column is the lesson. Unlike a game, where the engine gives you a clean frame, here even the state is engineering — half this lesson is just turning raw measurements into something Markov enough to learn on.
2 · Diagnose — the state is an estimate of an estimate
Intuition. RTT samples bounce around: queueing, scheduling, Wi-Fi retransmits, a 4G→5G handover that adds 200 ms in one step. If you feed raw RTT into the policy, the value function learns to chase noise. If you over-smooth, you lag real congestion and react too late. And the deepest problem: a single instantaneous reading is not Markov — packet-level sequences are non-stationary and grow with bandwidth, violating the MDP assumption outright. You must construct a state that is smooth, current, and Markov, all at once.
Engineering detail — adaptive smoothing with a confidence gate. Smooth RTT with an exponentially-weighted moving statistic whose gain adapts to the measured variance, so it tracks fast when the signal is real and damps hard when it is noise:
where vₜ is an online estimate of RTT variance and v_target is set by service class — roughly 0.1 ms for finance, 1 ms for cloud gaming. Large variance shrinks αₜ (trust history, suppress noise); small variance grows it (track the true change-point). The filter lag stays under its theoretical bound of 0.8 frames (<16 ms at 60 FPS), which is what keeps a cloud-gaming controller inside its decision deadline.
Engineering detail — the state vector itself. Don't feed one number; feed a first-order difference plus a confidence mask so the network can tell "the link changed" from "the meter is shaking":
When mask = 0 (high-noise mode) the critic lowers its bootstrap weight so a wrong TD target can't poison the value function, while the actor keeps its output scale stable via PopArt normalization. The whole pipeline runs in TensorRT at <0.05 ms GPU latency, so it costs the trainer nothing.
Engineering detail — fusing loss and ECN into the state. Loss events are sparse, so a single loss bit carries little signal. Add a saturating consecutive-loss counter and a learned joint-confidence channel so the agent reads a 4-dim summary instead of a raw packet trace:
The weights w₁, w₂ are found by offline Bayesian optimization over 10k real traces, with the objective being the policy's mean reward over a 10-second sliding window. The 4-dim state runs at <1% CPU at 40 Gbps line rate, and in A/B tests against Cubic it raised throughput by 22% in incast scenarios while cutting queueing delay 35%.
3 · The action space — change rate without oscillating
Intuition. If the agent can jump from 10% to 100% throttle in one step, the link rings like a bell: overshoot, mass loss, collapse, overshoot again. The art of the action space is to make large moves possible but abrupt moves rare — and to give the agent an explicit "hold" so steady state is cheap to maintain.
Engineering detail — log-spaced discretization with action inertia. Bucket the continuous rate on a log scale, not a linear one, so small rates get fine resolution and large rates get coarse resolution: e.g. throttle 0–100% as {0, 1, 3, 7, 15, 31, 63, 100}. This holds the bucket count to ≤16 while bounding the maximum single-step jump Δmax. Add the previous action aₜ₋₁ to the state so the network carries an "inertia prior" that suppresses adjacent-step bucket flips.
Engineering detail — smoothing at three stages.
- Training: use an IQN quantile network with Huber quantile loss instead of MSE, so neighbouring buckets' value distributions overlap and probability mass can't teleport between buckets on a gradient step. Add target-policy smoothing: perturb the chosen action's one-hot with ε-uniform noise (ε=0.1) so the value net sees "neighbour buckets" and doesn't overfit a spike.
- Inference: make the softmax temperature T an online-tunable knob (default T=0.5). If monitoring sees bucket-switch frequency exceed a threshold, auto-raise T→1.2 to spread the output, then low-pass it with an EMA:
aₜ = 0.7·aₜ₋₁ + 0.3·a_raw— two lines of C, zero added latency on an edge NPU. - Hold: add an explicit KEEP action so "do nothing" is a first-class choice; reusing
aₜ₋₁when KEEP fires removes a whole class of pointless oscillation.
Engineering detail — integrating with the kernel state machine. The learned action must drive a real TCP stack. The pattern: an eBPF program (BPF_PROG_TYPE_STRUCT_OPS) hooks the congestion-control ops, computes per-step reward into a per-CPU reward map, and a user-space RL process pulls batches every 20 ms via the BPF skeleton for off-policy updates — completing one iteration in <50 ms. This requires no hardware change and coexists with existing kernels; reported online gains were +8.7% single-flow throughput, −11% RTT, with zero-loss restart under 8 ms.
4 · Reward — the throughput–latency knife-edge
Intuition. More sending rate fills the queue, which raises delay; less rate empties the queue but wastes the link. The reward is literally throughput − λ·delay, and λ is the policy. Too small and you blow the tail-latency budget; too large and throughput collapses. You want to sit on the "knee" of the trade-off curve and stay there as conditions drift.
Engineering detail — pick the knee, then auto-tune λ slowly. Sweep λ offline and choose the knee — where throughput drops <2% while 99th-percentile delay stays ≤98 ms — giving an online starting value of λ=0.28. Then adapt λ on an hourly cadence, with a bounded step so it can't oscillate:
If 99p delay exceeded 99 ms over the last 30 min, multiply λ by 1.1 (penalize delay harder); if it fell below 95 ms, multiply by 0.95 (let throughput breathe). Under a 100 ms hard constraint this delivered +12–15% throughput with seven consecutive days of zero violations.
5 · Cross-flow fairness — the link is full of other agents
Intuition. Your flow does not own the bottleneck. A purely selfish RL flow learns the optimal predator strategy: grab bandwidth until the legacy TCP flows starve. Operators forbid this — "TCP must not starve" is a hard requirement. So fairness has to be built into the reward, the action space, and the training distribution.
Engineering detail — three layers of "see TCP, yield."
- Reward penalty. Add a term proportional to the estimated TCP rate
r_TCP(read via eBPF from the co-existing flows' SRTT and loss), λ∈[0.8,1.2]; the gradient mask pushes "aggressive" behaviour toward zero on the backward pass. - ECN back-off gate in the action space. When the agent sees CE-mark ratio >5% and co-flow loss >1%, force cwnd ×0.7 and tag the event "fairness-trigger" — injected via
BPF_PROG_TYPE_STRUCT_OPS, no hardware change, vSwitch-compatible. - Offline pre-train + online meta-learning. In a simulator, inject 30% TCP background flows and randomize BDP every episode (10–200 Mbps, RTT 20–200 ms) so the policy learns "see TCP, yield" up front; then a MAML inner-loop runs every 200 ms on real co-flow features, cutting starvation risk below 5% within 1 RTT.
Together these let the RL flow hand 50% of bandwidth to TCP within 100 ms at a self-throughput cost under 8%. Theoretically, when fairness is framed as a potential game, any global maximum of the potential Φ is a Nash equilibrium — so a centralized critic that does gradient ascent on Φ converges to a fair equilibrium, which is the foundation MAPPO/MAAC build on.
6 · The µs wall — inference lives in the datapath
Intuition. At line rate there is no time to call a GPU. The decision has to happen inside the kernel, per-packet, in microseconds. A 1%-better policy that adds latency to the datapath is a net loss.
Engineering detail. Distill the policy into a tiny lookup table — e.g. 16 actions × 256 states, stored as uint16 fixed-point (Q6.10), total ~8 kB so it fits in L1 and respects the eBPF 512-byte stack limit. Pin it read-only in a BPF_MAP_TYPE_ARRAY, sample at an XDP hook with a fully-unrolled 16-iteration loop (~38 instructions, ≈12–15 CPU cycles ≈ 4–5 µs). Measure latency with bpf_ktime_get_ns into a per-CPU histogram: observed P99 = 42 µs, P50 = 28 µs, inside a 50 µs SLA. Online learning stays in user space: eBPF only executes high-confidence actions and ringbufs (s,a,r,s′) batches up; user space does incremental SGD every 100 ms and atomically swaps the map via double-buffering (pointer flip <1 µs, zero-jitter hot update).
7 · Guard & iterate — production safety
Intuition. Networks change regime without warning — a 4G→5G handover, a path flap, an adversary deliberately jittering RTT to fool you into being conservative. The controller must detect the regime change and recover, and any model swap must never drop a connection.
Engineering detail. Put online CUSUM change-point detection ahead of the smoothing layer; on a step-shift it force-resets the EWMT history moments so a stale 200 ms "pseudo-high-latency" state can't mislead the policy. For experiment reproducibility, pin the kernel and tc/NetEm versions, lock CPU governor to performance, isolate cores, and validate that the empirical RTT CDF matches the NetEm target by KS statistic (<0.01) — one such harness cut reward variance from 18.7% to 2.3%. For hot updates, use double model slots A/B with a KL gate: if KL(π_old‖π_new) > 0.01 or the importance ratio exceeds 1.2, skip the swap and roll back; release the old model only 30 s later so in-flight requests are never destructed — connection-drop rate stays at zero.
Further considerations
- Multi-path QUIC blows up the state. Each subflow emits its own ECN/loss, so the state grows O(n²). Use hierarchical RL: a top level observes an aggregated "path-group" vector, a bottom level refines per-path decisions — bringing the state back to O(n).
- Wireless loss vs. congestion loss. On 5G edge links, random radio loss coexists with true ECN-marked congestion and the confidence channel fails. Add RTT-gradient as a third modality and fuse all three with a Bayesian update — one report drove misclassification from 12% to 3%.
- Adversarial jitter as a POMDP. An opponent who deliberately injects RTT jitter to make you over-conservative turns smoothing into belief estimation: treat the smoothed RTT as a partial observation and run a recursive Bayesian particle filter over the opponent's latency type, upgrading the problem to a POMDP belief update.
- Lagged observations. When the upstream estimator (e.g. BBR) lags 1–2 RTT, predict 2 steps ahead with a time-lagged generative model and let the policy act on the imagined state, MBRL-style.
- Datacenter scale & explainability. Training on ~1M concurrent flows needs three-layer decoupling (sampling / transport via GPUDirect-RDMA / a two-level all-reduce tree that cut reduction from 580→42 ms). And for auditability, log attention weights + SHAP attributions of which signal drove each rate change, so a black-box policy can still meet algorithmic-traceability rules.