all_lessons / ml_system_design / 03 · SLOs & workload lesson 3 / 20

Requirements — SLOs, workload, and Little's Law

Step 1 of the loop turns a vague ask into numbers. For ML serving those numbers are a pair of latency SLOs (because there are two phases), a throughput target, and a workload shape. Then one 1909 result — Little's Law — converts them into the concurrency you must sustain, and from there into a GPU count. This is the lesson that makes "how many GPUs?" a calculation instead of a guess.

Why latency is a pair, not a number

Lesson 01 established that a request is prefill then decode. The user experiences each differently, so you specify each separately:

MetricWhat it measuresPhaseUser feels it as…
TTFT — time to first tokensubmit → first token appearsprefill"did it hear me?" responsiveness
TPOT / ITL — time per output tokengap between successive tokensdecodereading speed of the stream
E2E latencyTTFT + TPOT · output_lenbothtotal wait for a non-streamed answer
Throughputtokens/s or requests/s served by the fleet(the operator feels it — it's the bill)

A chat UI wants low TTFT (snappy start) and a TPOT just below reading speed (~6–10 tokens/s is plenty; faster is invisible). A batch summarization job doesn't care about either latency and wants only throughput. A coding agent making tool calls cares about E2E because nothing streams to a human. Same model, three different systems — because the SLOs differ. This is why requirements come first: they change the design more than the model does.

Goodput, the metric that actually pays the bills
Raw throughput counts every token; goodput counts only tokens from requests that met their SLO. A server cranking 50k tok/s where 40% of requests blew their TTFT budget has goodput of 30k. Optimizing raw throughput while goodput sags is the most common self-inflicted wound in serving — you "scaled" but the product got worse. Design against goodput.

Specify percentiles, never means

Latency distributions are heavy-tailed: a few requests with 8K-token prompts or unlucky batching sit far above the median. A mean hides them. State SLOs as percentiles — typically p50, p95, p99 — e.g. "p95 TTFT < 500 ms, p99 < 1 s."

The tail matters more than it looks because of fan-out amplification: an agent that issues 10 sequential model calls inherits the p99 of each. If one call is p99-slow, the whole task is slow. With 10 calls, the chance the task hits at least one p99-latency call is 1 − 0.99¹⁰ ≈ 10% — your p99 service becomes a p90 product. Tail latency compounds; design for p99, validate at p99.

The fundamental tension: latency vs throughput

Here is the trade-off that every serving design negotiates. Batching more requests together raises throughput (the bandwidth-bound weight read is amortized — lesson 01's free win) but raises each request's latency (it waits for the batch, and shares compute). You cannot maximize both.

throughput (tokens/s) → latency (TPOT) → batch=1: low latency, GPU idle, $$$ per token knee: best $/token within SLO ← operate here huge batch: cheap tokens, latency blows SLO SLO ceiling — anything above is goodput=0

The design target is the knee under the SLO ceiling: the largest batch (cheapest tokens) whose latency still clears the SLO. Continuous batching (lesson 04) is the mechanism that lets you ride near that knee dynamically as load shifts. The SLO line is set by requirements; your job is to push the curve down-and-right (better kernels, paging, disaggregation) so the knee sits at higher throughput.

Little's Law — from latency to GPU count

The bridge from "latency target + request rate" to "how much hardware" is one equation that holds for any stable queueing system, no assumptions needed:

L = λ · W

where L = average number of requests in the system (concurrency), λ = arrival rate (req/s), W = average time in system (E2E latency, s). The logic is unavoidable: if 100 requests arrive per second and each lingers 2 seconds, then on average 100 · 2 = 200 are in flight at once. You must have capacity for 200 concurrent requests or the queue grows without bound and latency → ∞.

Worked: sizing a chat service
Target: 1,000 requests/s, average output 300 tokens, TPOT 50 ms ⇒ E2E ≈ 0.3 + 300·0.05 = 15.3 s. Little's Law: L = 1000 · 15.3 ≈ 15,300 concurrent requests in flight. If lesson 04's memory math says one H100 holds a batch of ~50 such requests, you need 15300 / 50 ≈ 306 GPUs of decode capacity — before redundancy or headroom. That is how a latency SLO becomes a purchase order.

Notice the leverage: halve TPOT (faster decode) and W drops, L drops, GPU count drops proportionally. Every decode optimization in lesson 06 cashes out through this equation. Conversely, a long-output workload (reasoning models emitting 10k thinking tokens) inflates W enormously — which is why reasoning-model serving is a different capacity regime, revisited in the capstone.

Characterizing a workload you can't see yet

The arithmetic above needs four workload inputs. Before launch you estimate them; after launch you measure and revise (the loop never stops):

  1. Input length distribution. Sets prefill cost and TTFT. A RAG app (4K-token contexts) and a chat app (200-token) have wildly different prefill bills. Use a distribution, not a mean — the p95 prompt sizes your worst-case batch.
  2. Output length distribution. Sets decode cost, KV growth, and W. Often the dominant cost driver and the hardest to predict (reasoning models surprised everyone here).
  3. Request rate & burstiness. The λ in Little's Law, plus its peak-to-average ratio. A 10× diurnal swing means you either over-provision for peak or autoscale (lesson 05).
  4. Prefix-sharing structure. How much of the input repeats across requests (system prompts, few-shot examples, agent history). Decides whether prefix caching (lesson 06) is a rounding error or a 5× win — exactly the question the SGLang track opens on.
The number you'll get wrong
Output length. Teams size for the prompts they imagined and get crushed by outputs they didn't — chain-of-thought, agent loops, "explain step by step." Because W (and thus GPU count) scales with output length, an unbudgeted shift from 200- to 2,000-token average outputs is a 10× capacity event. Instrument output-length percentiles on day one and alert on drift.

Interactive · requirements → fleet size

Turn the four requirements into a GPU count via Little's Law. Watch how a tighter TPOT SLO or a longer output distribution multiplies the fleet — and notice that the model isn't even an input here. Requirements, not the model, size the system.

Capacity calculator (Little's Law)

Per-GPU concurrent-request capacity comes from lesson 04's memory math; here it's a knob so you can see its leverage.

E2E latency (W)
concurrency (L)
GPUs (no headroom)
+30% headroom

What carries forward