Designing one serving replica
A serving fleet is just N copies of one thing — a replica: one model loaded on one or more GPUs, serving requests. Lesson 03 gave you a concurrency target (Little's Law) and a per-GPU capacity knob you took on faith. This lesson computes that knob from first principles. Get one replica right and lesson 05 is mostly multiplication; get it wrong and you buy the mistake N times.
1 · The request lifecycle on a replica
Lesson 01 split a request into prefill then decode. On a real replica those two phases pass through five stages:
Prefill processes the whole prompt in one forward pass — a big matmul over hundreds or thousands of tokens, so it sits right of the roofline ridge (~295 FLOP/byte on an H100) and is compute-bound. It ends when the first token is emitted: prefill latency is TTFT. Decode then runs one forward pass per output token, each reading the full weight set to produce a single token — far left of the ridge, bandwidth-bound. Each decode step's latency is TPOT. The whole serving game is "keep prefill busy and make decode read weights for many requests at once."
2 · Static vs continuous batching
Batching is how decode escapes the bandwidth roof: read each weight tensor once from HBM, apply it to B requests' tokens. But how you batch decides whether the GPU is busy or idle.
Static batching groups K requests, runs them together, and returns the batch only when the last one finishes. Because output lengths vary wildly (lesson 03), a batch of 32 where one request emits 2,000 tokens and the rest emit 50 keeps 31 slots stalled, decoding padding, for the entire tail. Utilization collapses toward 1/B in the worst case.
Continuous (iteration-level) batching makes the batch a living set: every decode step, finished requests are evicted and waiting requests are admitted into the freed slots. The GPU stays full regardless of length skew. This is the single most important thing a replica does — it is the difference between 20% and 80%+ decode utilization, and it is the precondition for riding the latency–throughput knee of lesson 03. It is table stakes: every serious engine implements it. The mechanism, including how prefill and decode steps interleave, is the subject of vLLM 04.
3 · The KV memory budget — the core arithmetic
This is the lesson's spine. On a replica, the batch ceiling is almost never set by compute — it's set by how much KV cache fits in HBM after the weights move in. Start from the budget:
max_concurrent_tokens = usable_KV_HBM / kv_bytes_per_token
max_concurrent_requests ≈ max_concurrent_tokens / avg_context_len
Weights are 2N bytes (lesson 02). Overhead — the CUDA context, activation scratch, NCCL buffers, the framework's own bookkeeping — is a roughly fixed 2–4 GB you must reserve or you OOM under load. KV bytes/token is the lesson-02 formula 2 · n_layers · n_kv_heads · head_dim · dtype_bytes.
KV/token (32 layers, 8 KV heads, head_dim 128, fp16): 2 · 32 · 8 · 128 · 2 = 131,072 B = 128 KB/token.
At a 2,000-token average context: 2000 · 128 KB = 256 MB / request. So the replica fits 60 GB / 256 MB ≈ 234 concurrent requests.
The ceiling is memory, not compute. Compute would love a batch near the ridge (~295 tokens-in-flight to be compute-efficient) — and here memory generously allows it. The two constraints happen to be compatible for an 8B model. That is a happy accident of model size, as the next box shows.
4 · Prefill vs decode contention
One replica runs both phases on the same GPU, and they fight. A decode step for a full batch is cheap and short (bandwidth-bound, microseconds-to-milliseconds). A prefill of a 4,000-token prompt is a long compute-bound burst. If the scheduler runs that prefill as one monolithic step, every request currently decoding is frozen until it finishes — a TPOT spike for everyone, plus the prefilling request's own TTFT. Worse, a stream of long prompts can starve decode entirely.
5 · The two knobs that fall out of the math
Every serving engine exposes two ceilings, and §3 hands you both directly:
| Knob | What it caps | Set it from |
|---|---|---|
| max-num-seqs | concurrent requests in the running batch | max_concurrent_requests from §3 (the 234) |
| max-model-len | tokens per request (prompt + output) | the context your workload's p95 actually needs |
They trade against each other through one shared pool of KV bytes. Raising max-model-len reserves more KV per slot, so fewer slots fit:
For the 8B replica with 60 GB of KV: an 8K context allows 60e9 / (8192 · 131072) ≈ 56 seqs; drop to a 2K context and you get ~234. Context length and batch size are the same budget spent two ways. Setting them is a workload decision (lesson 03's input/output distributions), not a default to leave alone — over-provisioning max-model-len for a rare long prompt silently quarters your batch ceiling and your throughput.
6 · The single-replica frontier
Lesson 03's latency–throughput curve is now a property you can locate on a real replica. As you raise the running batch (up to the max-num-seqs ceiling), throughput climbs because each weight read serves more tokens — until decode steps get heavy enough that TPOT starts rising. The knee is the largest batch whose p99 TPOT still clears the SLO.
That operating-point batch is exactly the requests/GPU number lesson 03's Little's-Law calculator took as a knob. The loop closes: requirements set the SLO line, the KV budget sets the ceiling, and the knee under the line is the per-replica capacity you multiply out into a fleet in lesson 05.
Interactive · KV budget & batch ceiling calculator
Pick a GPU, a model, a context length, and a KV dtype. The widget runs the §3 arithmetic and tells you both whether the model even fits and where the binding constraint is — memory or the roofline ridge.
What carries forward
- A replica is the atomic serving unit; design it (GPU count, knobs, operating point) before scaling out in lesson 05.
- Continuous batching is mandatory, and it relies on paged KV — that's why PagedAttention and continuous batching are the first mechanisms to learn.
- The batch ceiling is a memory calculation: (HBM − 2N − overhead) / (ctx · kv_bytes_per_token). Memory binds before compute for typical models; the weights term decides if a model fits one GPU at all.
- max-num-seqs and max-model-len spend one KV budget two ways — raise one, the other falls. Set both from the workload, not from defaults.
- Prefill and decode contend on a shared replica; chunked prefill (lesson 06) keeps a long prompt from spiking everyone's TPOT.
- The replica's knee under the SLO is the requests/GPU number lesson 03 plugged into Little's Law — the loop is now closed.