Serving architecture

From an HTTP POST to a CUDA kernel and back. The plumbing that lets one engine loop feed thousands of concurrent streams.

The request lifecycle, end to end

Lessons 01-05 lived inside the engine. This one zooms out. A single request is a long chain of handoffs between a Python web framework, an async runtime, an engine loop, and N GPU workers. The diagram below is what actually happens between curl hitting :8000 and the first token coming back over a server-sent event.

[client] ── HTTP POST /v1/chat/completions ─────────────────────────────────►
    │
    ▼
[FastAPI / uvicorn worker]                                  (one OS thread,
    parse OpenAI JSON                                        many asyncio
    tokenize prompt on CPU (tokenizer pool)                  tasks)
    build Request{req_id, sampling, stream=True}
    │
    ▼
[AsyncLLMEngine.add_request(req)]
    enqueue on input queue
    return async iterator ────────────────────────────────► (awaitable)
    │
    ▼
[engine loop — one iter ≈ one forward pass]                 (dedicated thread
    sched_out  = scheduler.schedule()      ◄── admit / evict  or own process)
    model_out  = model.execute_model(sched_out)
    engine_out = process_outputs(model_out) ◄── sample + detok
    deliver(engine_out)                    ──► output queues
    │
    ▼
[TP workers — one per GPU, Ray actors]
    embed → L × {attn, mlp, all-reduce} → unembed
    PagedAttention reads KV through block tables  (lesson 02)
    │
    ▼
[sampler]   temperature / top-k / top-p / repetition penalty
[detok]     incremental BPE decode, handles partial UTF-8
    │
    ▼
[async iterator yields token delta]
    │
    ▼
[FastAPI streams SSE chunk]  data: {"choices":[{"delta":{"content":"…"}}]}
    │
    ▼
[client receives token]   ── repeat until [DONE] ───────────────────────────►

The whole thing is asynchronous from the FastAPI side. The engine loop is what gives the request its actual GPU time. Everything above the engine loop is plumbing; everything below is execution. The loop is the contract.

The engine loop in three calls

Strip away wrapping classes and the real vLLM engine reduces to this:

while not stopping:
    sched_out  = scheduler.schedule()             # (1) who runs this step
    model_out  = model.execute_model(sched_out)   # (2) one forward pass
    engine_out = process_outputs(model_out)       # (3) sample + detokenize
    deliver(engine_out)                           #     stream to clients

The unifying lens

Every optimization in lessons 02-12 makes exactly one of these three calls go faster, or admit more work per call:

scheduler.schedule() — PagedAttention (02), continuous batching (04), prefix caching (05), chunked prefill (10), preemption (11). All change what fits in the batch.
model.execute_model() — FlashAttention (03), CUDA graphs (05), TP (06), GQA (09), LoRA grouped GEMM (12). All change how fast the forward runs.
process_outputs() — speculative decoding (07), disagg (08). Both change how many useful tokens the forward produced.

Why `asyncio`, not threads

A request can spend most of its life in "waiting" state — waiting for prefill, streaming tokens out at the human-readable rate of ~50/sec, idling between user turns. A naive thread-per-request design hits a wall fast.

1000 concurrent requests × 2 MB default kernel stack ≈ 2 GB of RSS, before a single token is produced

That's just the kernel stacks. Add scheduler overhead, context switch syscalls (~1-3 μs each), and Python's GIL — and you're not getting 1000 connections off the ground on one process. asyncio collapses this:

	thread-per-request	asyncio task-per-request
memory / connection	~2 MB (stack)	~3 KB (coroutine frame)
context switch	syscall, ~1-3 μs	function call, ~100 ns
I/O wait	blocks an OS thread	frees the loop via `await`
concurrency unit	OS-scheduled	cooperatively scheduled

1000 idle SSE streams cost ~3 MB total in asyncio. The loop ticks through them between engine steps with no syscall pressure.

Why the engine runs on its own thread (or process)

Counter-intuitive after the previous section: the engine itself does not benefit from asyncio. Inside the loop body, between CUDA calls, sits a pile of CPU-heavy Python: building block tables, packing input tensors, running the sampler, incremental detokenization. If you put that work directly on the asyncio loop, every decode step blocks the loop for many milliseconds and the SSE streams stutter.

So vLLM puts the engine loop on its own thread (single-process mode) or its own process (Ray mode). The split:

[asyncio loop — main thread]              [engine thread/process]
    HTTP handlers                            scheduler.schedule()
    SSE streaming                            model.execute_model()
    token-level fan-out                      sampling + detokenization
    handshake via queues   ◄──────────►      handshake via queues

The two communicate via thread-safe queues. The asyncio loop never blocks. The engine never waits on a socket. The GIL is released inside CUDA calls (which are non-Python), so the asyncio side gets cycles even while the engine is in a forward pass.

Scaling beyond one GPU

One model copy on one GPU only takes you so far. Four orthogonal axes exist; production deployments combine them.

axis	what's split	communication	where it lives
TP · tensor parallel	each weight matrix across N GPUs along the hidden dim	all-reduce after every split matmul; bytes ≈ batch · seq · d	intra-node, NVLink (~900 GB/s)
PP · pipeline parallel	layers across nodes — first 20 on node A, next 20 on node B	activations only, GPU→GPU; bytes ≈ batch · seq · d per stage boundary	cross-node — activations are much smaller than weights
DP · data parallel	full model replicated; requests load-balanced across replicas	none on the critical path — just LB routing	whole fleet
EP · expert parallel	MoE experts distributed across GPUs; tokens routed to their expert's GPU	all-to-all per MoE layer	intra-node for routing efficiency

The rule of thumb is straightforward and follows from the bandwidth column. TP needs NVLink; once you cross to InfiniBand (~25-50 GB/s) the all-reduce stalls dominate the forward pass. PP is cheap across nodes because the network only sees one activation tensor per stage boundary, not a per-matmul all-reduce.

Typical 70B on one 8-GPU node: TP=8, PP=1, DP=many. 405B across two nodes: TP=8 within a node, PP=2 across, DP=many in front for throughput.

Multi-node with Ray

vLLM uses Ray to orchestrate multi-node clusters. The master process — the one running the API server and the engine — issues a remote call to each worker for every forward pass. Workers are Python actors with one GPU each; they own a slice of the model weights.

[engine, node 0]                  [worker, node 0, GPU 0..7]
    schedule()                         hold TP shard of weights
    pack input ids ────► (Ray RPC) ──► run forward on shard
                                       all-reduce with peers (NVLink)
    sample tokens ◄────  (Ray RPC) ◄── return logits
    deliver to clients

The all-reduce is on the critical path of every forward. Network topology matters: NVLink for intra-node TP groups, NVSwitch full-mesh ideal, InfiniBand for PP stage crossings. A single under-provisioned link will bottleneck your H100 cluster down to A100 throughput.

Production topology

Putting it all together — what one team's deployment actually looks like:

                            ┌──────────────┐
                            │ load balancer │   (nginx / envoy / cloud LB)
                            └──────┬───────┘
              ┌────────────────────┼────────────────────┐
              ▼                    ▼                    ▼
        ┌───────────┐        ┌───────────┐        ┌───────────┐
        │ DP repl A │        │ DP repl B │        │ DP repl C │
        │           │        │           │        │           │
        │  FastAPI  │        │  FastAPI  │        │  FastAPI  │
        │     +     │        │     +     │        │     +     │
        │ AsyncLLM  │        │ AsyncLLM  │        │ AsyncLLM  │
        │  Engine   │        │  Engine   │        │  Engine   │
        │     +     │        │     +     │        │     +     │
        │ 8 × GPU   │        │ 8 × GPU   │        │ 8 × GPU   │
        │ (TP=8 via │        │ (TP=8 via │        │ (TP=8 via │
        │  Ray)     │        │  Ray)     │        │  Ray)     │
        └───────────┘        └───────────┘        └───────────┘

   each replica:  one FastAPI process · one AsyncLLMEngine · N Ray workers
                  shared scheduler + block manager · OpenAI-compatible endpoint

Three DP replicas in front, each running TP=8 internally. The LB handles routing (least-connections or weighted round-robin); replicas don't coordinate. This is the canonical single-region deployment.

Interactive · feel the engine loop

Below: a discrete-event simulator of the async engine. Top lane is the arrival timeline. Middle lane is the engine's active set — at most max_active requests in flight at once, advancing one token per step. Bottom lane is the output stream per finished request.

Try this sequence and watch the KPIs:

arrival rate 3/sec, max_active 8, step 15 ms — comfortable. GPU well utilized, TTFT low.
arrival rate 8/sec, max_active 4 — the queue grows without bound. p99 TTFT explodes. Classic queueing behavior: when arrival rate λ exceeds service rate μ, the queue grows unbounded (the M/M/1 stability condition ρ < 1 is violated). Little's law (L = λW) holds in steady state — when the system is stable, queue length scales with arrival rate × mean wait.
arrival rate 2/sec, max_active 1 — even with light load, single-slot queuing pushes p99 TTFT up. Concurrency isn't a luxury, it's a latency lever.

Async engine simulator

Top = arrivals. Middle = active slots (each row is one slot; coloured bands are requests in flight). Bottom = completed streams' first-and-last-token marks. Ten seconds of simulation at 60 fps.

arrival rate: 4 req/s mean output: 30 tok max_active: 6 step latency: 15 ms

mean TTFT

—

p99 TTFT

—

throughput

—

GPU util

—

queue depth

—

finished

—

show what each tick of the loop does (the same three calls)

// One simulation step (≈ one engine iteration):
// 1) admit waiting → running while running.length < max_active     ── scheduler.schedule()
// 2) advance every running request by one token                    ── model.execute_model()
// 3) finished: tokens_produced >= output_len → drop, record stats  ── process_outputs()

Takeaways

What to keep

The engine is three calls in a loop. Every optimization in lessons 02-12 makes one of those calls faster or thicker.
Above the loop is asyncio — cheap concurrency for thousands of streaming HTTP connections.
The loop itself runs on a dedicated thread/process so its CPU work doesn't stall the asyncio side.
TP for intra-node, PP for cross-node, DP for throughput. Pick by where the bandwidth is.
Multi-node Ray makes the all-reduce a network-critical-path operation. The fabric is the bottleneck, not the GPU.

Serving architecture

The request lifecycle, end to end

The engine loop in three calls

Why asyncio, not threads

Why the engine runs on its own thread (or process)

Scaling beyond one GPU

Multi-node with Ray

Production topology

Interactive · feel the engine loop

Takeaways

Why `asyncio`, not threads