Cloud-resource elastic scaling
An autoscaler decides, every few seconds, how many container instances to run and how much CPU to grant each one — trading a latency SLA against a cloud bill. It looks like a textbook control problem, but the binding difficulties are RL-specific: a hybrid action space (how many instances × how much CPU each), a delayed transition because a new instance is not ready until it cold-starts, a two-objective reward (SLA vs. cost) that invites both reward-hacking and oscillation, non-stationary traffic that breaks the stationary-MDP assumption, and a hard budget that the policy may never violate. Each of those names a tool.
1 · Formulate — the MDP behind an autoscaler
Intuition. Every few seconds the scaler reads cluster telemetry (CPU, memory, queue depth, request latency), decides to add/remove instances and reset per-container CPU limits, and is rewarded if requests stayed fast and the bill stayed low. That is a Markov Decision Process — and every difficulty below is one of these four pieces being awkward in a real cluster.
| Piece | For an elastic autoscaler | The awkward part |
|---|---|---|
| State S | raw metrics from thousands of containers, fused to a ~16-dim bottleneck vector (CPU saturation, memory slope, queue depth, p99) + request-type tags | raw dim is enormous; naive tag one-hots cause a curse of dimensionality; predicted traffic risks leaking the future |
| Action A | ad = instance count Δ (discrete) and ac = per-container CPU limit (continuous) | hybrid discrete-continuous; most options are infeasible; raw output oscillates |
| Reward R | SLA attainment (p99 ≤ target) minus weighted cost (instances × price, amortized reserved-instance discount) | two objectives in tension; sparse violations; cost has lumpy discounts |
| Transition P | add an instance now → it serves traffic only after a cold-start delay τ; traffic itself drifts daily and on promo events | action effect is delayed; environment is non-stationary |
The rightmost column is the lesson. Each awkward part below gets a named mechanism — and we never reach for the mechanism before the row demands it.
2 · State — fuse it, embed it, and keep it causal
Intuition. You cannot feed thousands of raw container metrics into a policy network — it is high-dimensional, noisy, and changes shape as containers come and go. You compress it into a small fixed-width vector that still carries the bottleneck signal: how saturated is CPU, is memory climbing, how deep is the queue, how bad is tail latency. Smaller and stabler state means faster, more reliable training.
Engineering detail — the 16-dim bottleneck vector. Per-rack metrics are aggregated (a GRU over the time series gives an 8-dim bottleneck embedding), then concatenated with container-spec and deployment-unit features into a final 16-dim state. This keeps the real-time bottleneck signal while satisfying a production rule that the PPO input dimension stay ≤ 32. In one deployment this cut training-convergence time by ~40%, and a Kolmogorov–Smirnov test showed distribution drift < 0.02 — under the SRE team's ≤ 0.05 go-live threshold.
Engineering detail — request-type tags without the curse of dimensionality. Request types (API, batch, streaming, …) are categorical and can explode the one-hot width. Use a learned embedding table with hashing instead of one-hots, and adapt it to RL's distribution drift with a meta-finetune (MAML-style): each offline round samples "tasks" (different time windows), updates the embedding on a support set, computes the policy gradient on a query set, and back-propagates the second-order term only into the gate network — so the embedding hot-updates hourly without breaking global semantics. The whole table stays < 200 MB (fits one GPU), and a lookup costs ~0.8 ms at inference.
3 · The hybrid action space — keep the structure, mask the infeasible, damp the rest
Intuition. "Add one instance and set its CPU limit to 2.3 cores" is two decisions of different types: how many instances (a discrete count) and how much CPU each (a continuous scalar). Flattening both into one giant menu explodes; treating both as continuous loses the count's discreteness. Keep the structure: a discrete head for the count and a count-conditioned continuous head for the CPU limit.
Engineering detail — one encoder, two heads. A shared encoder emits state feature h. The discrete head is a linear layer → logits → Categorical (e.g. "scale out 1"). The continuous head takes h concatenated with one-hot(ad), passes two MLP layers, and outputs (μ, log σ) → Gaussian for the CPU limit. Sample ad first, then ac. Train with a combined objective: clipped-PPO on the continuous part with the reparameterization trick, PPO ratio on the discrete part with a Gumbel-Softmax (τ = 1.0) gradient estimator, and an entropy bonus:
Engineering detail — masking and hard penalties. Add an action mask after the logits: if free cores < 4, force the probability of "scale out 2" to −∞ before softmax so it can never be sampled. Add a hard reward penalty: if CPU over-commit drives p99 > 200 ms, return −1000 immediately. Deployed on a managed-Kubernetes cluster, this lifted elastic-decision accuracy by ~18%, raised CPU utilization by ~12%, and held QoS-violation rate below 0.3%.
4 · Reward — two objectives, kept causal and policy-invariant
Intuition. You want low latency and low cost, and the two fight: the cheapest cluster is empty and slow, the fastest is huge and expensive. You also must measure the reward the instant a request finishes (no waiting for batch stats) and you must avoid future information. The honest signal is "did we hit the SLA, at what cost," and the engineering trick is to shape it so the optimum you train toward is still the optimum you meant.
Engineering detail — a measurable, unbiased, invariant reward. The per-request latency lₜ is measurable the instant the request returns (immediacy) and depends only on past/present (no future leakage → MDP-causal). A variance term over a window m = 10 penalizes latency jitter (weight β ≈ 0.1) so the policy cannot lower the mean by blowing up the tail. The SLA, cost, and variance coefficients λ, α, β are grid-searched on the offline replay buffer with a Shapiro–Wilk normality check so the reward distribution stays roughly Gaussian — which keeps PPO's GAE estimates well-behaved.
Engineering detail — potential-based shaping keeps the optimum. The key worry with shaping is "does my added term change the optimal policy?" Potential-based shaping provably does not. Use
so the optimal policy is identical to the unshaped "maximize SLA attainment" objective, while the agent gets a dense gradient toward the SLA boundary. Roll out behind a config center (λ, α, β hot-updatable, no training restart): 7 days in a shadow cluster vs. a PPO baseline, then 5% → 30% → 100% canary, each stage signed off by SRE and compliance on three plots — reward curve, KL divergence, violation rate.
5 · The cold-start delay — put it in the transition, not the algorithm
Intuition. When the policy says "add an instance," nothing happens for a few seconds — the container must pull its image, boot, warm caches. If you pretend the capacity is available immediately, the agent learns a fantasy; it scales out, sees no relief, scales out again, and over-provisions. The honest fix is to model the delay in the transition function so the environment itself reflects "the action you took τ seconds ago is the one that lands now."
Engineering detail — a delay-aware transition wrapper. Augment the state with pending actions and a timer, and write the transition kernel as
where the indicator gates whether action aₜ has actually taken effect given its remaining delay τₜ. Crucially this lives in an environment wrapper, not the algorithm kernel — so it is compatible with DQN, PPO, SAC alike. In one large e-commerce scheduler this modeling alone brought p99 from ~480 ms to ~210 ms. If the delay is continuous and arbitrarily distributed (e.g. Weibull, because different images pull at different speeds), upgrade to an SMDP with event-driven simulation: inverse-transform-sample the next decision epoch and write V(s) = E[ ∫₀ᴰ γᵗ r(s(t)) dt + γᴰ V(s′) ]. If the delay couples to the action content, learn a parametric delay model dθ(a, s) jointly with the policy and bound the worst case with a CVaR constraint on the SLA.
6 · Non-stationary traffic — adapt the model, the forecast, and the discount
Intuition. Traffic is not stationary: it has daily peaks, weekly cycles, and sudden spikes (a livestream, a flash sale). A policy trained on yesterday's distribution silently degrades today. You need to detect the shift, refresh the model safely, and let the agent's time horizon itself react — be far-sighted when the world is calm, short-sighted when it just lurched.
Engineering detail — hot-update with a three-step gate. Retrain offline under a trust region (KL(πnew‖πold) ≤ 0.02), sign the artifact (MD5) into a Redis sentinel key. Inference nodes long-poll the key and on a new version run three checks: ① numeric check — 100 random states, output μ/σ error vs. the training node < 1e-3; ② shadow experiment — route 5% traffic to canary pods, pass only if p99 rises < 10% and cumulative return drops < 2% over 3 min; ③ atomic swap — replace the network behind an RCU pointer so the old model unloads at zero in-flight requests. If a canary metric goes bad, roll back within 5 s and lock the update gate for 30 min to prevent flapping. A validated deployment held end-to-end detect-to-update latency ≤ 90 s at peak, with rollback rate < 0.3%.
Engineering detail — an RL corrector on top of the LSTM forecast. Keep an LSTM producing a raw forecast ŷₜ, then graft an RL corrector: a policy network takes the LSTM hidden state plus an error-window vector and outputs a Gaussian correction Δₜ; apply ẏₜ = ŷₜ + Δₜ. Train with PPO-clip offline (50 epochs, lr 3e-4, batch 2048, clip 0.2, entropy 0.01) on ~1.2×10⁷ five-minute windows; online, do incremental training every hour on only the last 4 h of data (to fight concept drift) with a 7:3 offline:online sampling mix so the distribution does not collapse. Monitor the correction rate ηₜ = |Δₜ|/ŷₜ; when ηₜ > 15%, trigger human review to block "black-swan" actions. On real CDN traffic this cut forecast RMSE ~18.7% and reduced alerts ~35%.
Engineering detail — change-point-driven discount. Run online CUSUM on the discounted-return series Gₜ = Σ γᵏ rₜ₊ₖ with drift threshold h = 3σ (σ via EMA). On a detected change it emits a confidence ρ and a direction d ∈ {−1, +1} (+1 = world turned optimistic, raise γ). Adjust the discount γₜ₊₁ = clip(γₜ + d·η·ρ, 0.5, 0.995) with η = 0.02, a one-adjustment-per-episode cooldown, and a ρ ≥ 0.8 trigger. After a change, freeze the behavior policy for one batch and recompute advantages by importance sampling so the gradient direction matches the new γ; temporarily widen PPO-clip 0.2 → 0.3 for 3 batches to speed migration, then restore.
7 · Guard the budget — a constraint the policy may never break
Intuition. Latency is a soft goal; the monthly budget is a hard wall. A reward penalty alone is not enough — a hungry policy will happily eat a penalty to chase latency and blow the budget. You enforce the budget at the sampling boundary so an infeasible action can never even be chosen, and you keep the dual machinery only to speed convergence, not to do the actual cutting.
Engineering detail — mask + safety-layer projection, on top of a CMDP. Carry remaining budget b in the state. Add a mask layer at the network's output: maski = 𝟙{cᵢ ≤ b}, where cᵢ is the immediate cost of action aᵢ — for discrete actions add −∞ to the logit, for continuous actions clip the output to [0, b]; implement the operator in C++ for < 0.2 ms latency. If the cost is nonlinear in the continuous action, first emit a raw action â, then analytically project a* = min(â, b / cost_per_unit) so a single decision can never overdraw; the monthly cumulative cap closes automatically through the budget-in-state loop. Train as a CMDP with PPO: cost function c(s,a) = actual spend, threshold d = monthly budget, with a Lagrange multiplier λ used only to accelerate convergence — the mask and safety layer already guarantee feasibility at the sampling end, so you never rely on λ for hard cutoff. One bidding-budget deployment drove the budget-overrun rate from 1.2% to 0.00% while CPM fell only ~0.8%.
8 · Iterate — removing one difficulty exposes the next
Compress the state and the action space starts to flap; damp the oscillation and the cold-start delay surfaces as the real cause; model the delay and the non-stationary traffic dominates; adapt to drift and the budget wall becomes the binding constraint. Applied autoscaling is this loop, run until the cluster's SLA, cost, and stability all clear the SRE review in one pass.
Further considerations
- Multi-task state sharing. If the same state must feed scheduling and capacity forecasting, add a task-specific attention mask over the shared 8-dim bottleneck vector — a tiny 2-dim per-task weight (~256 extra params) lets scheduling attend to CPU saturation while forecasting attends to the memory slope.
- Two-level aggregation at extreme scale. For 100k+ container clusters, compress raw metrics to 8 dims at the rack edge via RPC, then fuse a second GRU layer centrally — dropping cross-region telemetry from ~90 MB/s to ~3 MB/s. Reserve one state dimension for a "confidence score" (forecast variance) so SREs can wire data-quality alerts and operate the RL system as a white box.
- Discrete-action explosion. When instance types number in the dozens across availability zones, the discrete count head can reach tens of thousands of entries; switch to autoregressive discretization or a pointer network that generates the action sequence rather than enumerating the cross-product. A safety-sensitive variant adds a projection layer plus Lagrange-multiplier feedback for safe RL.
- Multi-tenant reward conflict. Share one global environment model (SAC) for transitions and platform-level reward, give each tenant a lightweight 2-layer Actor distilled from it (< 5 ms inference), and tune Pareto weights online by EMA. Add differential-privacy noise on rewards, a circuit-breaker if inter-tenant policy cosine similarity > 0.85, and federated aggregation for finance tenants whose raw rewards may not leave their domain.
- Sparse-violation gradient vanishing. If p99 sits comfortably at ~180 ms and violations are < 0.1%, even potential-based shaping can starve the gradient; use a rarity-based reward rₜ·exp((lₜ − p99target)/τ) (τ ≈ 5 ms) to exponentially amplify the violating tail, corrected by importance sampling for unbiased off-policy evaluation.
- Auditability. Regulated production wants replayable reward signals: write each step's lₜ, p99target, λ/α/β to a Kafka topic bound to a trace ID, and log the per-update multiplier, KL divergence, and cost histogram so a third party can replay and verify — turning "no reward forgery" into an enforced audit hook in the training framework.