Cloud-resource elastic scaling

An autoscaler decides, every few seconds, how many container instances to run and how much CPU to grant each one — trading a latency SLA against a cloud bill. It looks like a textbook control problem, but the binding difficulties are RL-specific: a hybrid action space (how many instances × how much CPU each), a delayed transition because a new instance is not ready until it cold-starts, a two-objective reward (SLA vs. cost) that invites both reward-hacking and oscillation, non-stationary traffic that breaks the stationary-MDP assumption, and a hard budget that the policy may never violate. Each of those names a tool.

The method — five steps, every lesson

Applied RL is the same loop in every domain: (1) Formulate the MDP — state, action, reward, transition, horizon. (2) Diagnose the one or two properties that make this MDP hard. (3) Engineer the mechanism that removes exactly that difficulty. (4) Guard it in production — detect when it breaks and fall back. (5) Iterate. For an autoscaler the table writes itself; the lesson is which row binds first and what mechanism that row forces.

1 · Formulate — the MDP behind an autoscaler

Intuition. Every few seconds the scaler reads cluster telemetry (CPU, memory, queue depth, request latency), decides to add/remove instances and reset per-container CPU limits, and is rewarded if requests stayed fast and the bill stayed low. That is a Markov Decision Process — and every difficulty below is one of these four pieces being awkward in a real cluster.

MDP = (S, A, R, P, γ) · maximize E[ Σₜ γᵗ rₜ ] · rₜ = SLA term − cost term

Piece	For an elastic autoscaler	The awkward part
State S	raw metrics from thousands of containers, fused to a ~16-dim bottleneck vector (CPU saturation, memory slope, queue depth, p99) + request-type tags	raw dim is enormous; naive tag one-hots cause a curse of dimensionality; predicted traffic risks leaking the future
Action A	a_d = instance count Δ (discrete) and a_c = per-container CPU limit (continuous)	hybrid discrete-continuous; most options are infeasible; raw output oscillates
Reward R	SLA attainment (p99 ≤ target) minus weighted cost (instances × price, amortized reserved-instance discount)	two objectives in tension; sparse violations; cost has lumpy discounts
Transition P	add an instance now → it serves traffic only after a cold-start delay τ; traffic itself drifts daily and on promo events	action effect is delayed; environment is non-stationary

The rightmost column is the lesson. Each awkward part below gets a named mechanism — and we never reach for the mechanism before the row demands it.

2 · State — fuse it, embed it, and keep it causal

Intuition. You cannot feed thousands of raw container metrics into a policy network — it is high-dimensional, noisy, and changes shape as containers come and go. You compress it into a small fixed-width vector that still carries the bottleneck signal: how saturated is CPU, is memory climbing, how deep is the queue, how bad is tail latency. Smaller and stabler state means faster, more reliable training.

Engineering detail — the 16-dim bottleneck vector. Per-rack metrics are aggregated (a GRU over the time series gives an 8-dim bottleneck embedding), then concatenated with container-spec and deployment-unit features into a final 16-dim state. This keeps the real-time bottleneck signal while satisfying a production rule that the PPO input dimension stay ≤ 32. In one deployment this cut training-convergence time by ~40%, and a Kolmogorov–Smirnov test showed distribution drift < 0.02 — under the SRE team's ≤ 0.05 go-live threshold.

Engineering detail — request-type tags without the curse of dimensionality. Request types (API, batch, streaming, …) are categorical and can explode the one-hot width. Use a learned embedding table with hashing instead of one-hots, and adapt it to RL's distribution drift with a meta-finetune (MAML-style): each offline round samples "tasks" (different time windows), updates the embedding on a support set, computes the policy gradient on a query set, and back-propagates the second-order term only into the gate network — so the embedding hot-updates hourly without breaking global semantics. The whole table stays < 200 MB (fits one GPU), and a lookup costs ~0.8 ms at inference.

Do not put predicted traffic in the state naively

Predicted future load is tempting — it cures cold-start lag — but if your forecaster used true future values during training, the state now contains information the agent could not have at decision time. That breaks the MDP causal chain and inflates offline metrics. The red-line test: ablate "leaky state vs. compliant state"; if leaky AUC beats compliant by > 3%, leakage is real. Fixes that keep the prior without leaking: (a) a rolling-window lagged truth — forecast t using real load at t−1 and earlier, and feed that as state; (b) model-based RL — the forecast feeds an environment model for planning, while the policy's state stays Markov-compliant ("plan with prediction, decide on present"). Best practice is a CI/CD gate: a tool that walks feature timestamps and DAG lineage, scans the replay buffer, and emits a leakage report — turning "we'll eyeball it" into an enforced merge check.

3 · The hybrid action space — keep the structure, mask the infeasible, damp the rest

Intuition. "Add one instance and set its CPU limit to 2.3 cores" is two decisions of different types: how many instances (a discrete count) and how much CPU each (a continuous scalar). Flattening both into one giant menu explodes; treating both as continuous loses the count's discreteness. Keep the structure: a discrete head for the count and a count-conditioned continuous head for the CPU limit.

a = (a_d, a_c) | a_d ~ Categorical(logits) a_c ~ 𝒩(μ, σ), input = [h ; one-hot(a_d)]

Engineering detail — one encoder, two heads. A shared encoder emits state feature h. The discrete head is a linear layer → logits → Categorical (e.g. "scale out 1"). The continuous head takes h concatenated with one-hot(a_d), passes two MLP layers, and outputs (μ, log σ) → Gaussian for the CPU limit. Sample a_d first, then a_c. Train with a combined objective: clipped-PPO on the continuous part with the reparameterization trick, PPO ratio on the discrete part with a Gumbel-Softmax (τ = 1.0) gradient estimator, and an entropy bonus:

L = L_PPO^discrete + L_PPO^continuous − β · H(π)

Engineering detail — masking and hard penalties. Add an action mask after the logits: if free cores < 4, force the probability of "scale out 2" to −∞ before softmax so it can never be sampled. Add a hard reward penalty: if CPU over-commit drives p99 > 200 ms, return −1000 immediately. Deployed on a managed-Kubernetes cluster, this lifted elastic-decision accuracy by ~18%, raised CPU utilization by ~12%, and held QoS-violation rate below 0.3%.

Stopping oscillation — train-side, deploy-side, eval-side

A raw policy flaps: scale out, scale in, scale out, churning instances and burning cold-starts. Damp it on three layers. Train. Add an action-difference penalty −λ‖aₜ − aₜ₋₁‖² (tune λ by grid search + Bayesian opt); tighten PPO-clip ε from 0.2 to 0.1 and use KL[π_old‖π_new] < 0.01 as an early-stop signal; for the continuous part use TD3-style delayed actor updates (actor trains once per 2 critic updates) plus target-policy smoothing, cutting target-Q variance by > 30%. Deploy. Low-pass the raw action aₜ = 0.7·aₜ₋₁ + 0.3·âₜ, then a safety layer solves min‖a − aₜ‖² s.t. |a − aₜ₋₁| ≤ Δ_max, daily_change ≤ 3; quantize the continuous knob to five steps {−8,−4,0,+4,+8} to lower the back-jump probability. Eval. Estimate oscillation rate offline by importance sampling (target < 3%); in canary, compare against a PID baseline and roll out fully only if p99 does not regress > 5% and loop-count drops ≥ 50%. One e-commerce scaler went from ~120 daily scaling loops to ~8, oscillation rate 0.7%.

4 · Reward — two objectives, kept causal and policy-invariant

Intuition. You want low latency and low cost, and the two fight: the cheapest cluster is empty and slow, the fastest is huge and expensive. You also must measure the reward the instant a request finishes (no waiting for batch stats) and you must avoid future information. The honest signal is "did we hit the SLA, at what cost," and the engineering trick is to shape it so the optimum you train toward is still the optimum you meant.

Engineering detail — a measurable, unbiased, invariant reward. The per-request latency lₜ is measurable the instant the request returns (immediacy) and depends only on past/present (no future leakage → MDP-causal). A variance term over a window m = 10 penalizes latency jitter (weight β ≈ 0.1) so the policy cannot lower the mean by blowing up the tail. The SLA, cost, and variance coefficients λ, α, β are grid-searched on the offline replay buffer with a Shapiro–Wilk normality check so the reward distribution stays roughly Gaussian — which keeps PPO's GAE estimates well-behaved.

Engineering detail — potential-based shaping keeps the optimum. The key worry with shaping is "does my added term change the optimal policy?" Potential-based shaping provably does not. Use

F(s, a, s′) = γ · Φ(s′) − Φ(s), Φ(s) = −λ · max(0, (p99_rolling(s) − p99_target) / p99_target)

so the optimal policy is identical to the unshaped "maximize SLA attainment" objective, while the agent gets a dense gradient toward the SLA boundary. Roll out behind a config center (λ, α, β hot-updatable, no training restart): 7 days in a shadow cluster vs. a PPO baseline, then 5% → 30% → 100% canary, each stage signed off by SRE and compliance on three plots — reward curve, KL divergence, violation rate.

Amortizing the reserved-instance discount into per-step cost

Reserved instances (RI) buy a year of capacity up front at a discount — a single lumpy payment that, naively, would slam one step's reward to a huge negative and starve every other step of signal. Amortize it: spread the one-time discounted cost smoothly across each step's cost term, so the per-step reward is stable and the policy converges, while the books still reconcile to the true annual spend. Extensions: make the RI amortization coefficient part of the action ("where to buy, how much") for multi-region fleets; treat the amortized RI cost as a price anchor and fall back from spot to RI when spot price exceeds it; carry remaining annual budget as a state feature so a large negative penalty fires as spend nears the budget red line.

5 · The cold-start delay — put it in the transition, not the algorithm

Intuition. When the policy says "add an instance," nothing happens for a few seconds — the container must pull its image, boot, warm caches. If you pretend the capacity is available immediately, the agent learns a fantasy; it scales out, sees no relief, scales out again, and over-provisions. The honest fix is to model the delay in the transition function so the environment itself reflects "the action you took τ seconds ago is the one that lands now."

Engineering detail — a delay-aware transition wrapper. Augment the state with pending actions and a timer, and write the transition kernel as

P(s̃_t+1 | s̃_t, aₜ) = Σ_s̃* P̂(s̃* | s̃_t) · 𝟙[ Apply(aₜ, τₜ) ]

where the indicator gates whether action aₜ has actually taken effect given its remaining delay τₜ. Crucially this lives in an environment wrapper, not the algorithm kernel — so it is compatible with DQN, PPO, SAC alike. In one large e-commerce scheduler this modeling alone brought p99 from ~480 ms to ~210 ms. If the delay is continuous and arbitrarily distributed (e.g. Weibull, because different images pull at different speeds), upgrade to an SMDP with event-driven simulation: inverse-transform-sample the next decision epoch and write V(s) = E[ ∫₀ᴰ γᵗ r(s(t)) dt + γᴰ V(s′) ]. If the delay couples to the action content, learn a parametric delay model d_θ(a, s) jointly with the policy and bound the worst case with a CVaR constraint on the SLA.

6 · Non-stationary traffic — adapt the model, the forecast, and the discount

Intuition. Traffic is not stationary: it has daily peaks, weekly cycles, and sudden spikes (a livestream, a flash sale). A policy trained on yesterday's distribution silently degrades today. You need to detect the shift, refresh the model safely, and let the agent's time horizon itself react — be far-sighted when the world is calm, short-sighted when it just lurched.

Engineering detail — hot-update with a three-step gate. Retrain offline under a trust region (KL(π_new‖π_old) ≤ 0.02), sign the artifact (MD5) into a Redis sentinel key. Inference nodes long-poll the key and on a new version run three checks: ① numeric check — 100 random states, output μ/σ error vs. the training node < 1e-3; ② shadow experiment — route 5% traffic to canary pods, pass only if p99 rises < 10% and cumulative return drops < 2% over 3 min; ③ atomic swap — replace the network behind an RCU pointer so the old model unloads at zero in-flight requests. If a canary metric goes bad, roll back within 5 s and lock the update gate for 30 min to prevent flapping. A validated deployment held end-to-end detect-to-update latency ≤ 90 s at peak, with rollback rate < 0.3%.

Engineering detail — an RL corrector on top of the LSTM forecast. Keep an LSTM producing a raw forecast ŷₜ, then graft an RL corrector: a policy network takes the LSTM hidden state plus an error-window vector and outputs a Gaussian correction Δₜ; apply ẏₜ = ŷₜ + Δₜ. Train with PPO-clip offline (50 epochs, lr 3e-4, batch 2048, clip 0.2, entropy 0.01) on ~1.2×10⁷ five-minute windows; online, do incremental training every hour on only the last 4 h of data (to fight concept drift) with a 7:3 offline:online sampling mix so the distribution does not collapse. Monitor the correction rate ηₜ = |Δₜ|/ŷₜ; when ηₜ > 15%, trigger human review to block "black-swan" actions. On real CDN traffic this cut forecast RMSE ~18.7% and reduced alerts ~35%.

Engineering detail — change-point-driven discount. Run online CUSUM on the discounted-return series Gₜ = Σ γᵏ rₜ₊ₖ with drift threshold h = 3σ (σ via EMA). On a detected change it emits a confidence ρ and a direction d ∈ {−1, +1} (+1 = world turned optimistic, raise γ). Adjust the discount γₜ₊₁ = clip(γₜ + d·η·ρ, 0.5, 0.995) with η = 0.02, a one-adjustment-per-episode cooldown, and a ρ ≥ 0.8 trigger. After a change, freeze the behavior policy for one batch and recompute advantages by importance sampling so the gradient direction matches the new γ; temporarily widen PPO-clip 0.2 → 0.3 for 3 batches to speed migration, then restore.

SLA penalty vs. cloud bill — the autoscaler's central knob

An autoscaler trades a latency SLA against a cloud bill, and the cold-start delay decides how much headroom you must pre-provision. Move the target utilization up to save money; move it down to protect p99. The cold-start delay shifts the whole frontier — slower starts force more idle headroom.

peak demand (cores) 200 target utilization % 70 cold-start delay (s) 30 SLA penalty weight λ 2

provisioned cores

–

SLA breach risk

–

relative bill

–

total reward

–

7 · Guard the budget — a constraint the policy may never break

Intuition. Latency is a soft goal; the monthly budget is a hard wall. A reward penalty alone is not enough — a hungry policy will happily eat a penalty to chase latency and blow the budget. You enforce the budget at the sampling boundary so an infeasible action can never even be chosen, and you keep the dual machinery only to speed convergence, not to do the actual cutting.

Engineering detail — mask + safety-layer projection, on top of a CMDP. Carry remaining budget b in the state. Add a mask layer at the network's output: mask_i = 𝟙{cᵢ ≤ b}, where cᵢ is the immediate cost of action aᵢ — for discrete actions add −∞ to the logit, for continuous actions clip the output to [0, b]; implement the operator in C++ for < 0.2 ms latency. If the cost is nonlinear in the continuous action, first emit a raw action â, then analytically project a* = min(â, b / cost_per_unit) so a single decision can never overdraw; the monthly cumulative cap closes automatically through the budget-in-state loop. Train as a CMDP with PPO: cost function c(s,a) = actual spend, threshold d = monthly budget, with a Lagrange multiplier λ used only to accelerate convergence — the mask and safety layer already guarantee feasibility at the sampling end, so you never rely on λ for hard cutoff. One bidding-budget deployment drove the budget-overrun rate from 1.2% to 0.00% while CPM fell only ~0.8%.

mask_i = 𝟙{ cᵢ ≤ b } · a* = min( â, b / cost_per_unit ) · CMDP: max E[Aᵣ] s.t. E[A_c] ≤ B

Lagrange multipliers blow up — three guards

Left alone, the dual variable λ can explode when constraint violations spike, swamping the reward gradient and crashing policy entropy. Tame it in three moves. (1) Low-pass the violation signal with momentum β ≈ 0.08 — effectively filtering the constraint-violation amount, which cut multiplier variance ~70% and halved entropy swings in one project. (2) Zeroth-order fallback: every 1k steps, if the violation δ < ε (ε set to the business tolerance, e.g. 0.01) for 3 consecutive checks, multiply λ by 0.5 to quickly release over-penalty and avoid "multiplier stickiness" dragging down return. (3) Scheduled step size: when the constraint is time-averaged, upgrade to a differential multiplier flow dλ/dt = η·(δₜ − ε) — big η early for fast convergence, exponentially-decayed η late to prevent overshoot, with a Lyapunov stability argument before discretizing. One deployment pulled the multiplier L2 norm from ~2×10⁴ down to ~4×10², with 30 consecutive alert-free days.

Proving feasibility is preserved across a policy update — and the through-line

Before promoting a new policy you owe the business a feasibility argument in three layers. Formalize: write the budget as a CMDP cost constraint J_c(π) ≤ B (use a terminal/absorbing-state cost if no trajectory may ever exceed). Constrained improvement: solve max_θ E[Aᵣ] s.t. KL(π_θ‖π_old) ≤ δ, E[A_c] ≤ B − J_c(π_old); because the constraint set is convex, a feasible start ⇒ feasible iterates, so a convex-feasibility theorem proves the budget is never breached. Certificate: offline, run N ≥ (z_1−α·σ_c/ε)² simulated trajectories and form the (1−α) upper bound μ̂_c + ε ≤ B — a statistical certificate to hand the business, backstopped online by a safety-margin monitor that auto-rolls-back if the sliding-mean cost exceeds B·(1−η), η ≈ 3–5%. That "convex feasibility + statistical certificate + live monitor" trio mirrors the through-line of the whole lesson: 16-dim state → fused/embedded/causal; hybrid action → structured heads + mask + oscillation damping; two-objective reward → potential-based shaping + amortized discount; delayed, drifting transition → delay wrapper + hot-update + adaptive γ; hard budget → mask + safety layer + tamed multiplier. Every tool answered exactly one row of the MDP table.

8 · Iterate — removing one difficulty exposes the next

Compress the state and the action space starts to flap; damp the oscillation and the cold-start delay surfaces as the real cause; model the delay and the non-stationary traffic dominates; adapt to drift and the budget wall becomes the binding constraint. Applied autoscaling is this loop, run until the cluster's SLA, cost, and stability all clear the SRE review in one pass.

Further considerations

Multi-task state sharing. If the same state must feed scheduling and capacity forecasting, add a task-specific attention mask over the shared 8-dim bottleneck vector — a tiny 2-dim per-task weight (~256 extra params) lets scheduling attend to CPU saturation while forecasting attends to the memory slope.
Two-level aggregation at extreme scale. For 100k+ container clusters, compress raw metrics to 8 dims at the rack edge via RPC, then fuse a second GRU layer centrally — dropping cross-region telemetry from ~90 MB/s to ~3 MB/s. Reserve one state dimension for a "confidence score" (forecast variance) so SREs can wire data-quality alerts and operate the RL system as a white box.
Discrete-action explosion. When instance types number in the dozens across availability zones, the discrete count head can reach tens of thousands of entries; switch to autoregressive discretization or a pointer network that generates the action sequence rather than enumerating the cross-product. A safety-sensitive variant adds a projection layer plus Lagrange-multiplier feedback for safe RL.
Multi-tenant reward conflict. Share one global environment model (SAC) for transitions and platform-level reward, give each tenant a lightweight 2-layer Actor distilled from it (< 5 ms inference), and tune Pareto weights online by EMA. Add differential-privacy noise on rewards, a circuit-breaker if inter-tenant policy cosine similarity > 0.85, and federated aggregation for finance tenants whose raw rewards may not leave their domain.
Sparse-violation gradient vanishing. If p99 sits comfortably at ~180 ms and violations are < 0.1%, even potential-based shaping can starve the gradient; use a rarity-based reward rₜ·exp((lₜ − p99_target)/τ) (τ ≈ 5 ms) to exponentially amplify the violating tail, corrected by importance sampling for unbiased off-policy evaluation.
Auditability. Regulated production wants replayable reward signals: write each step's lₜ, p99_target, λ/α/β to a Kafka topic bound to a trace ID, and log the per-update multiplier, KL divergence, and cost histogram so a third party can replay and verify — turning "no reward forgery" into an enforced audit hook in the training framework.