Requirements — SLOs, workload, and Little's Law
Step 1 of the loop turns a vague ask into numbers. For ML serving those numbers are a pair of latency SLOs (because there are two phases), a throughput target, and a workload shape. Then one 1909 result — Little's Law — converts them into the concurrency you must sustain, and from there into a GPU count. This is the lesson that makes "how many GPUs?" a calculation instead of a guess.
Why latency is a pair, not a number
Lesson 01 established that a request is prefill then decode. The user experiences each differently, so you specify each separately:
| Metric | What it measures | Phase | User feels it as… |
|---|---|---|---|
| TTFT — time to first token | submit → first token appears | prefill | "did it hear me?" responsiveness |
| TPOT / ITL — time per output token | gap between successive tokens | decode | reading speed of the stream |
| E2E latency | TTFT + TPOT · output_len | both | total wait for a non-streamed answer |
| Throughput | tokens/s or requests/s served by the fleet | — | (the operator feels it — it's the bill) |
A chat UI wants low TTFT (snappy start) and a TPOT just below reading speed (~6–10 tokens/s is plenty; faster is invisible). A batch summarization job doesn't care about either latency and wants only throughput. A coding agent making tool calls cares about E2E because nothing streams to a human. Same model, three different systems — because the SLOs differ. This is why requirements come first: they change the design more than the model does.
Specify percentiles, never means
Latency distributions are heavy-tailed: a few requests with 8K-token prompts or unlucky batching sit far above the median. A mean hides them. State SLOs as percentiles — typically p50, p95, p99 — e.g. "p95 TTFT < 500 ms, p99 < 1 s."
The tail matters more than it looks because of fan-out amplification: an agent that issues 10 sequential model calls inherits the p99 of each. If one call is p99-slow, the whole task is slow. With 10 calls, the chance the task hits at least one p99-latency call is 1 − 0.99¹⁰ ≈ 10% — your p99 service becomes a p90 product. Tail latency compounds; design for p99, validate at p99.
The fundamental tension: latency vs throughput
Here is the trade-off that every serving design negotiates. Batching more requests together raises throughput (the bandwidth-bound weight read is amortized — lesson 01's free win) but raises each request's latency (it waits for the batch, and shares compute). You cannot maximize both.
The design target is the knee under the SLO ceiling: the largest batch (cheapest tokens) whose latency still clears the SLO. Continuous batching (lesson 04) is the mechanism that lets you ride near that knee dynamically as load shifts. The SLO line is set by requirements; your job is to push the curve down-and-right (better kernels, paging, disaggregation) so the knee sits at higher throughput.
Little's Law — from latency to GPU count
The bridge from "latency target + request rate" to "how much hardware" is one equation that holds for any stable queueing system, no assumptions needed:
where L = average number of requests in the system (concurrency), λ = arrival rate (req/s), W = average time in system (E2E latency, s). The logic is unavoidable: if 100 requests arrive per second and each lingers 2 seconds, then on average 100 · 2 = 200 are in flight at once. You must have capacity for 200 concurrent requests or the queue grows without bound and latency → ∞.
Notice the leverage: halve TPOT (faster decode) and W drops, L drops, GPU count drops proportionally. Every decode optimization in lesson 06 cashes out through this equation. Conversely, a long-output workload (reasoning models emitting 10k thinking tokens) inflates W enormously — which is why reasoning-model serving is a different capacity regime, revisited in the capstone.
Characterizing a workload you can't see yet
The arithmetic above needs four workload inputs. Before launch you estimate them; after launch you measure and revise (the loop never stops):
- Input length distribution. Sets prefill cost and TTFT. A RAG app (4K-token contexts) and a chat app (200-token) have wildly different prefill bills. Use a distribution, not a mean — the p95 prompt sizes your worst-case batch.
- Output length distribution. Sets decode cost, KV growth, and W. Often the dominant cost driver and the hardest to predict (reasoning models surprised everyone here).
- Request rate & burstiness. The λ in Little's Law, plus its peak-to-average ratio. A 10× diurnal swing means you either over-provision for peak or autoscale (lesson 05).
- Prefix-sharing structure. How much of the input repeats across requests (system prompts, few-shot examples, agent history). Decides whether prefix caching (lesson 06) is a rounding error or a 5× win — exactly the question the SGLang track opens on.
Interactive · requirements → fleet size
Turn the four requirements into a GPU count via Little's Law. Watch how a tighter TPOT SLO or a longer output distribution multiplies the fleet — and notice that the model isn't even an input here. Requirements, not the model, size the system.
What carries forward
- Two latency SLOs (TTFT, TPOT) + throughput + goodput, all at percentiles. The pair exists because the workload has two phases.
- Latency and throughput trade off along a frontier; design to the knee under the SLO ceiling.
- Little's Law (L = λW) converts SLOs into concurrency into GPUs. It is the most reused equation in serving design.
- Workload shape — especially output length — drives cost more than model choice. Measure it; alert on drift.