Serving architecture
From an HTTP POST to a CUDA kernel and back. The plumbing that lets one engine loop feed thousands of concurrent streams.
The request lifecycle, end to end
Lessons 01-05 lived inside the engine. This one zooms out. A single request is a long chain of handoffs between a Python web framework, an async runtime, an engine loop, and N GPU workers. The diagram below is what actually happens between curl hitting :8000 and the first token coming back over a server-sent event.
[client] ── HTTP POST /v1/chat/completions ─────────────────────────────────►
│
▼
[FastAPI / uvicorn worker] (one OS thread,
parse OpenAI JSON many asyncio
tokenize prompt on CPU (tokenizer pool) tasks)
build Request{req_id, sampling, stream=True}
│
▼
[AsyncLLMEngine.add_request(req)]
enqueue on input queue
return async iterator ────────────────────────────────► (awaitable)
│
▼
[engine loop — one iter ≈ one forward pass] (dedicated thread
sched_out = scheduler.schedule() ◄── admit / evict or own process)
model_out = model.execute_model(sched_out)
engine_out = process_outputs(model_out) ◄── sample + detok
deliver(engine_out) ──► output queues
│
▼
[TP workers — one per GPU, Ray actors]
embed → L × {attn, mlp, all-reduce} → unembed
PagedAttention reads KV through block tables (lesson 02)
│
▼
[sampler] temperature / top-k / top-p / repetition penalty
[detok] incremental BPE decode, handles partial UTF-8
│
▼
[async iterator yields token delta]
│
▼
[FastAPI streams SSE chunk] data: {"choices":[{"delta":{"content":"…"}}]}
│
▼
[client receives token] ── repeat until [DONE] ───────────────────────────►
The whole thing is asynchronous from the FastAPI side. The engine loop is what gives the request its actual GPU time. Everything above the engine loop is plumbing; everything below is execution. The loop is the contract.
The engine loop in three calls
Strip away wrapping classes and the real vLLM engine reduces to this:
while not stopping:
sched_out = scheduler.schedule() # (1) who runs this step
model_out = model.execute_model(sched_out) # (2) one forward pass
engine_out = process_outputs(model_out) # (3) sample + detokenize
deliver(engine_out) # stream to clients
scheduler.schedule()— PagedAttention (02), continuous batching (04), prefix caching (05), chunked prefill (10), preemption (11). All change what fits in the batch.model.execute_model()— FlashAttention (03), CUDA graphs (05), TP (06), GQA (09), LoRA grouped GEMM (12). All change how fast the forward runs.process_outputs()— speculative decoding (07), disagg (08). Both change how many useful tokens the forward produced.
Why asyncio, not threads
A request can spend most of its life in "waiting" state — waiting for prefill, streaming tokens out at the human-readable rate of ~50/sec, idling between user turns. A naive thread-per-request design hits a wall fast.
That's just the kernel stacks. Add scheduler overhead, context switch syscalls (~1-3 μs each), and Python's GIL — and you're not getting 1000 connections off the ground on one process. asyncio collapses this:
| thread-per-request | asyncio task-per-request | |
|---|---|---|
| memory / connection | ~2 MB (stack) | ~3 KB (coroutine frame) |
| context switch | syscall, ~1-3 μs | function call, ~100 ns |
| I/O wait | blocks an OS thread | frees the loop via await |
| concurrency unit | OS-scheduled | cooperatively scheduled |
1000 idle SSE streams cost ~3 MB total in asyncio. The loop ticks through them between engine steps with no syscall pressure.
Why the engine runs on its own thread (or process)
Counter-intuitive after the previous section: the engine itself does not benefit from asyncio. Inside the loop body, between CUDA calls, sits a pile of CPU-heavy Python: building block tables, packing input tensors, running the sampler, incremental detokenization. If you put that work directly on the asyncio loop, every decode step blocks the loop for many milliseconds and the SSE streams stutter.
So vLLM puts the engine loop on its own thread (single-process mode) or its own process (Ray mode). The split:
[asyncio loop — main thread] [engine thread/process]
HTTP handlers scheduler.schedule()
SSE streaming model.execute_model()
token-level fan-out sampling + detokenization
handshake via queues ◄──────────► handshake via queues
The two communicate via thread-safe queues. The asyncio loop never blocks. The engine never waits on a socket. The GIL is released inside CUDA calls (which are non-Python), so the asyncio side gets cycles even while the engine is in a forward pass.
Scaling beyond one GPU
One model copy on one GPU only takes you so far. Four orthogonal axes exist; production deployments combine them.
| axis | what's split | communication | where it lives |
|---|---|---|---|
| TP · tensor parallel | each weight matrix across N GPUs along the hidden dim | all-reduce after every split matmul; bytes ≈ batch · seq · d | intra-node, NVLink (~900 GB/s) |
| PP · pipeline parallel | layers across nodes — first 20 on node A, next 20 on node B | activations only, GPU→GPU; bytes ≈ batch · seq · d per stage boundary | cross-node — activations are much smaller than weights |
| DP · data parallel | full model replicated; requests load-balanced across replicas | none on the critical path — just LB routing | whole fleet |
| EP · expert parallel | MoE experts distributed across GPUs; tokens routed to their expert's GPU | all-to-all per MoE layer | intra-node for routing efficiency |
The rule of thumb is straightforward and follows from the bandwidth column. TP needs NVLink; once you cross to InfiniBand (~25-50 GB/s) the all-reduce stalls dominate the forward pass. PP is cheap across nodes because the network only sees one activation tensor per stage boundary, not a per-matmul all-reduce.
Typical 70B on one 8-GPU node: TP=8, PP=1, DP=many. 405B across two nodes: TP=8 within a node, PP=2 across, DP=many in front for throughput.
Multi-node with Ray
vLLM uses Ray to orchestrate multi-node clusters. The master process — the one running the API server and the engine — issues a remote call to each worker for every forward pass. Workers are Python actors with one GPU each; they own a slice of the model weights.
[engine, node 0] [worker, node 0, GPU 0..7]
schedule() hold TP shard of weights
pack input ids ────► (Ray RPC) ──► run forward on shard
all-reduce with peers (NVLink)
sample tokens ◄──── (Ray RPC) ◄── return logits
deliver to clients
The all-reduce is on the critical path of every forward. Network topology matters: NVLink for intra-node TP groups, NVSwitch full-mesh ideal, InfiniBand for PP stage crossings. A single under-provisioned link will bottleneck your H100 cluster down to A100 throughput.
Production topology
Putting it all together — what one team's deployment actually looks like:
┌──────────────┐
│ load balancer │ (nginx / envoy / cloud LB)
└──────┬───────┘
┌────────────────────┼────────────────────┐
▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐
│ DP repl A │ │ DP repl B │ │ DP repl C │
│ │ │ │ │ │
│ FastAPI │ │ FastAPI │ │ FastAPI │
│ + │ │ + │ │ + │
│ AsyncLLM │ │ AsyncLLM │ │ AsyncLLM │
│ Engine │ │ Engine │ │ Engine │
│ + │ │ + │ │ + │
│ 8 × GPU │ │ 8 × GPU │ │ 8 × GPU │
│ (TP=8 via │ │ (TP=8 via │ │ (TP=8 via │
│ Ray) │ │ Ray) │ │ Ray) │
└───────────┘ └───────────┘ └───────────┘
each replica: one FastAPI process · one AsyncLLMEngine · N Ray workers
shared scheduler + block manager · OpenAI-compatible endpoint
Three DP replicas in front, each running TP=8 internally. The LB handles routing (least-connections or weighted round-robin); replicas don't coordinate. This is the canonical single-region deployment.
Interactive · feel the engine loop
Below: a discrete-event simulator of the async engine. Top lane is the arrival timeline. Middle lane is the engine's active set — at most max_active requests in flight at once, advancing one token per step. Bottom lane is the output stream per finished request.
Try this sequence and watch the KPIs:
- arrival rate 3/sec, max_active 8, step 15 ms — comfortable. GPU well utilized, TTFT low.
- arrival rate 8/sec, max_active 4 — the queue grows without bound. p99 TTFT explodes. Classic queueing behavior: when arrival rate λ exceeds service rate μ, the queue grows unbounded (the M/M/1 stability condition ρ < 1 is violated). Little's law (L = λW) holds in steady state — when the system is stable, queue length scales with arrival rate × mean wait.
- arrival rate 2/sec, max_active 1 — even with light load, single-slot queuing pushes p99 TTFT up. Concurrency isn't a luxury, it's a latency lever.
Takeaways
- The engine is three calls in a loop. Every optimization in lessons 02-12 makes one of those calls faster or thicker.
- Above the loop is asyncio — cheap concurrency for thousands of streaming HTTP connections.
- The loop itself runs on a dedicated thread/process so its CPU work doesn't stall the asyncio side.
- TP for intra-node, PP for cross-node, DP for throughput. Pick by where the bandwidth is.
- Multi-node Ray makes the all-reduce a network-critical-path operation. The fabric is the bottleneck, not the GPU.