Designing one serving replica

A serving fleet is just N copies of one thing — a replica: one model loaded on one or more GPUs, serving requests. Lesson 03 gave you a concurrency target (Little's Law) and a per-GPU capacity knob you took on faith. This lesson computes that knob from first principles. Get one replica right and lesson 05 is mostly multiplication; get it wrong and you buy the mistake N times.

The atomic unit

A replica is the smallest thing that can independently answer a request end-to-end. Everything above it (load balancing, autoscaling, routing) is lesson 05; everything below it (kernels, paging) is the mechanism tracks. The replica is where the design loop's "topology → bottleneck" step lands for serving: you pick a GPU count, then ask what binds — memory or compute?

1 · The request lifecycle on a replica

Lesson 01 split a request into prefill then decode. On a real replica those two phases pass through five stages:

Prefill processes the whole prompt in one forward pass — a big matmul over hundreds or thousands of tokens, so it sits right of the roofline ridge (~295 FLOP/byte on an H100) and is compute-bound. It ends when the first token is emitted: prefill latency is TTFT. Decode then runs one forward pass per output token, each reading the full weight set to produce a single token — far left of the ridge, bandwidth-bound. Each decode step's latency is TPOT. The whole serving game is "keep prefill busy and make decode read weights for many requests at once."

2 · Static vs continuous batching

Batching is how decode escapes the bandwidth roof: read each weight tensor once from HBM, apply it to B requests' tokens. But how you batch decides whether the GPU is busy or idle.

Static batching groups K requests, runs them together, and returns the batch only when the last one finishes. Because output lengths vary wildly (lesson 03), a batch of 32 where one request emits 2,000 tokens and the rest emit 50 keeps 31 slots stalled, decoding padding, for the entire tail. Utilization collapses toward 1/B in the worst case.

static batch (B=4), 'x'=real token, '.'=wasted slot req A xxxxx req B xx... ← done at step 2, slot idle 8 more steps req C xxx.. ← done at step 3, slot idle req D xxxxxxxxxx ← the straggler holds the whole batch |---------| batch returns only here

Continuous (iteration-level) batching makes the batch a living set: every decode step, finished requests are evicted and waiting requests are admitted into the freed slots. The GPU stays full regardless of length skew. This is the single most important thing a replica does — it is the difference between 20% and 80%+ decode utilization, and it is the precondition for riding the latency–throughput knee of lesson 03. It is table stakes: every serious engine implements it. The mechanism, including how prefill and decode steps interleave, is the subject of vLLM 04.

Why continuous batching needs paged memory

Admitting and evicting requests every step means KV blocks are constantly allocated and freed at token granularity. A contiguous-per-request KV layout fragments instantly. PagedAttention (vLLM 02) stores KV in fixed-size blocks like OS virtual memory, so eviction is just freeing pages — that's what makes the batch ceiling we compute next actually reachable instead of theoretical.

3 · The KV memory budget — the core arithmetic

This is the lesson's spine. On a replica, the batch ceiling is almost never set by compute — it's set by how much KV cache fits in HBM after the weights move in. Start from the budget:

usable_KV_HBM = HBM − weights − overhead
max_concurrent_tokens = usable_KV_HBM / kv_bytes_per_token
max_concurrent_requests ≈ max_concurrent_tokens / avg_context_len

Weights are 2N bytes (lesson 02). Overhead — the CUDA context, activation scratch, NCCL buffers, the framework's own bookkeeping — is a roughly fixed 2–4 GB you must reserve or you OOM under load. KV bytes/token is the lesson-02 formula 2 · n_layers · n_kv_heads · head_dim · dtype_bytes.

Worked: Llama-3-8B fp16 on one H100 80GB

Weights: 2 · 8e9 = 16 GB. Overhead: ~4 GB. Usable for KV: 80 − 16 − 4 = 60 GB.
KV/token (32 layers, 8 KV heads, head_dim 128, fp16): 2 · 32 · 8 · 128 · 2 = 131,072 B = 128 KB/token.
At a 2,000-token average context: 2000 · 128 KB = 256 MB / request. So the replica fits 60 GB / 256 MB ≈ 234 concurrent requests.
The ceiling is memory, not compute. Compute would love a batch near the ridge (~295 tokens-in-flight to be compute-efficient) — and here memory generously allows it. The two constraints happen to be compatible for an 8B model. That is a happy accident of model size, as the next box shows.

The 70B squeeze — when memory wins the argument violently

Try Llama-3-70B fp16 on one H100: weights alone are 2 · 70e9 = 140 GB > 80 GB. It does not fit one GPU at all — there is zero room for KV because there isn't even room for the weights. The replica must span multiple GPUs with tensor parallelism, which splits both weights and KV across devices. That is lesson 05's first job. The point stands: model size, through the weight term, decides whether KV is roomy, tight, or impossible.

4 · Prefill vs decode contention

One replica runs both phases on the same GPU, and they fight. A decode step for a full batch is cheap and short (bandwidth-bound, microseconds-to-milliseconds). A prefill of a 4,000-token prompt is a long compute-bound burst. If the scheduler runs that prefill as one monolithic step, every request currently decoding is frozen until it finishes — a TPOT spike for everyone, plus the prefilling request's own TTFT. Worse, a stream of long prompts can starve decode entirely.

The fix, named here and detailed later

Chunked prefill breaks a long prompt into token chunks and interleaves them with decode steps, so a 4K prompt becomes (say) eight 512-token slices that each share a step with the ongoing decode batch. TTFT rises slightly; TPOT stops spiking; goodput improves. We size this trade in lesson 06 (mechanism in vLLM's KV/scheduling track). For now: know that prefill–decode interference is a real source of tail latency on a shared replica, and that it has a standard cure.

5 · The two knobs that fall out of the math

Every serving engine exposes two ceilings, and §3 hands you both directly:

Knob	What it caps	Set it from
max-num-seqs	concurrent requests in the running batch	max_concurrent_requests from §3 (the 234)
max-model-len	tokens per request (prompt + output)	the context your workload's p95 actually needs

They trade against each other through one shared pool of KV bytes. Raising max-model-len reserves more KV per slot, so fewer slots fit:

max-num-seqs ≈ usable_KV_HBM / (max-model-len · kv_bytes_per_token)

For the 8B replica with 60 GB of KV: an 8K context allows 60e9 / (8192 · 131072) ≈ 56 seqs; drop to a 2K context and you get ~234. Context length and batch size are the same budget spent two ways. Setting them is a workload decision (lesson 03's input/output distributions), not a default to leave alone — over-provisioning max-model-len for a rare long prompt silently quarters your batch ceiling and your throughput.

6 · The single-replica frontier

Lesson 03's latency–throughput curve is now a property you can locate on a real replica. As you raise the running batch (up to the max-num-seqs ceiling), throughput climbs because each weight read serves more tokens — until decode steps get heavy enough that TPOT starts rising. The knee is the largest batch whose p99 TPOT still clears the SLO.

That operating-point batch is exactly the requests/GPU number lesson 03's Little's-Law calculator took as a knob. The loop closes: requirements set the SLO line, the KV budget sets the ceiling, and the knee under the line is the per-replica capacity you multiply out into a fleet in lesson 05.

Interactive · KV budget & batch ceiling calculator

Pick a GPU, a model, a context length, and a KV dtype. The widget runs the §3 arithmetic and tells you both whether the model even fits and where the binding constraint is — memory or the roofline ridge.

KV budget & batch ceiling

Assumption (kept simple, stated honestly): weights are always fp16 (2N); the KV dtype slider only changes KV bytes. We anchor KV at the measured 128 KB/token for an 8B model (32 layers, 8 KV heads, head_dim 128, fp16) and scale it linearly with params — bigger models have more layers, so KV/token grows roughly with N. Overhead is a fixed 4 GB. These are design-grade (±30%) estimates, not a profiler's truth.

HBM (GB) 80 params N (B) 8 context length (tok) 2048 KV dtype fp16

weights

–

usable KV

–

KV / token

–

max concurrent reqs

–

What carries forward

A replica is the atomic serving unit; design it (GPU count, knobs, operating point) before scaling out in lesson 05.
Continuous batching is mandatory, and it relies on paged KV — that's why PagedAttention and continuous batching are the first mechanisms to learn.
The batch ceiling is a memory calculation: (HBM − 2N − overhead) / (ctx · kv_bytes_per_token). Memory binds before compute for typical models; the weights term decides if a model fits one GPU at all.
max-num-seqs and max-model-len spend one KV budget two ways — raise one, the other falls. Set both from the workload, not from defaults.
Prefill and decode contend on a shared replica; chunked prefill (lesson 06) keeps a long prompt from spiking everyone's TPOT.
The replica's knee under the SLO is the requests/GPU number lesson 03 plugged into Little's Law — the loop is now closed.