vllm_lessons / 06 · serving architecture lesson 6 / 12

Serving architecture

From an HTTP POST to a CUDA kernel and back. The plumbing that lets one engine loop feed thousands of concurrent streams.

The request lifecycle, end to end

Lessons 01-05 lived inside the engine. This one zooms out. A single request is a long chain of handoffs between a Python web framework, an async runtime, an engine loop, and N GPU workers. The diagram below is what actually happens between curl hitting :8000 and the first token coming back over a server-sent event.

[client] ── HTTP POST /v1/chat/completions ─────────────────────────────────►
    │
    ▼
[FastAPI / uvicorn worker]                                  (one OS thread,
    parse OpenAI JSON                                        many asyncio
    tokenize prompt on CPU (tokenizer pool)                  tasks)
    build Request{req_id, sampling, stream=True}
    │
    ▼
[AsyncLLMEngine.add_request(req)]
    enqueue on input queue
    return async iterator ────────────────────────────────► (awaitable)
    │
    ▼
[engine loop — one iter ≈ one forward pass]                 (dedicated thread
    sched_out  = scheduler.schedule()      ◄── admit / evict  or own process)
    model_out  = model.execute_model(sched_out)
    engine_out = process_outputs(model_out) ◄── sample + detok
    deliver(engine_out)                    ──► output queues
    │
    ▼
[TP workers — one per GPU, Ray actors]
    embed → L × {attn, mlp, all-reduce} → unembed
    PagedAttention reads KV through block tables  (lesson 02)
    │
    ▼
[sampler]   temperature / top-k / top-p / repetition penalty
[detok]     incremental BPE decode, handles partial UTF-8
    │
    ▼
[async iterator yields token delta]
    │
    ▼
[FastAPI streams SSE chunk]  data: {"choices":[{"delta":{"content":"…"}}]}
    │
    ▼
[client receives token]   ── repeat until [DONE] ───────────────────────────►

The whole thing is asynchronous from the FastAPI side. The engine loop is what gives the request its actual GPU time. Everything above the engine loop is plumbing; everything below is execution. The loop is the contract.

The engine loop in three calls

Strip away wrapping classes and the real vLLM engine reduces to this:

while not stopping:
    sched_out  = scheduler.schedule()             # (1) who runs this step
    model_out  = model.execute_model(sched_out)   # (2) one forward pass
    engine_out = process_outputs(model_out)       # (3) sample + detokenize
    deliver(engine_out)                           #     stream to clients
The unifying lens
Every optimization in lessons 02-12 makes exactly one of these three calls go faster, or admit more work per call:

Why asyncio, not threads

A request can spend most of its life in "waiting" state — waiting for prefill, streaming tokens out at the human-readable rate of ~50/sec, idling between user turns. A naive thread-per-request design hits a wall fast.

1000 concurrent requests × 2 MB default kernel stack ≈ 2 GB of RSS, before a single token is produced

That's just the kernel stacks. Add scheduler overhead, context switch syscalls (~1-3 μs each), and Python's GIL — and you're not getting 1000 connections off the ground on one process. asyncio collapses this:

thread-per-requestasyncio task-per-request
memory / connection~2 MB (stack)~3 KB (coroutine frame)
context switchsyscall, ~1-3 μsfunction call, ~100 ns
I/O waitblocks an OS threadfrees the loop via await
concurrency unitOS-scheduledcooperatively scheduled

1000 idle SSE streams cost ~3 MB total in asyncio. The loop ticks through them between engine steps with no syscall pressure.

Why the engine runs on its own thread (or process)

Counter-intuitive after the previous section: the engine itself does not benefit from asyncio. Inside the loop body, between CUDA calls, sits a pile of CPU-heavy Python: building block tables, packing input tensors, running the sampler, incremental detokenization. If you put that work directly on the asyncio loop, every decode step blocks the loop for many milliseconds and the SSE streams stutter.

So vLLM puts the engine loop on its own thread (single-process mode) or its own process (Ray mode). The split:

[asyncio loop — main thread]              [engine thread/process]
    HTTP handlers                            scheduler.schedule()
    SSE streaming                            model.execute_model()
    token-level fan-out                      sampling + detokenization
    handshake via queues   ◄──────────►      handshake via queues

The two communicate via thread-safe queues. The asyncio loop never blocks. The engine never waits on a socket. The GIL is released inside CUDA calls (which are non-Python), so the asyncio side gets cycles even while the engine is in a forward pass.

Scaling beyond one GPU

One model copy on one GPU only takes you so far. Four orthogonal axes exist; production deployments combine them.

axiswhat's splitcommunicationwhere it lives
TP · tensor paralleleach weight matrix across N GPUs along the hidden dimall-reduce after every split matmul; bytes ≈ batch · seq · dintra-node, NVLink (~900 GB/s)
PP · pipeline parallellayers across nodes — first 20 on node A, next 20 on node Bactivations only, GPU→GPU; bytes ≈ batch · seq · d per stage boundarycross-node — activations are much smaller than weights
DP · data parallelfull model replicated; requests load-balanced across replicasnone on the critical path — just LB routingwhole fleet
EP · expert parallelMoE experts distributed across GPUs; tokens routed to their expert's GPUall-to-all per MoE layerintra-node for routing efficiency

The rule of thumb is straightforward and follows from the bandwidth column. TP needs NVLink; once you cross to InfiniBand (~25-50 GB/s) the all-reduce stalls dominate the forward pass. PP is cheap across nodes because the network only sees one activation tensor per stage boundary, not a per-matmul all-reduce.

Typical 70B on one 8-GPU node: TP=8, PP=1, DP=many. 405B across two nodes: TP=8 within a node, PP=2 across, DP=many in front for throughput.

Multi-node with Ray

vLLM uses Ray to orchestrate multi-node clusters. The master process — the one running the API server and the engine — issues a remote call to each worker for every forward pass. Workers are Python actors with one GPU each; they own a slice of the model weights.

[engine, node 0]                  [worker, node 0, GPU 0..7]
    schedule()                         hold TP shard of weights
    pack input ids ────► (Ray RPC) ──► run forward on shard
                                       all-reduce with peers (NVLink)
    sample tokens ◄────  (Ray RPC) ◄── return logits
    deliver to clients

The all-reduce is on the critical path of every forward. Network topology matters: NVLink for intra-node TP groups, NVSwitch full-mesh ideal, InfiniBand for PP stage crossings. A single under-provisioned link will bottleneck your H100 cluster down to A100 throughput.

Production topology

Putting it all together — what one team's deployment actually looks like:

                            ┌──────────────┐
                            │ load balancer │   (nginx / envoy / cloud LB)
                            └──────┬───────┘
              ┌────────────────────┼────────────────────┐
              ▼                    ▼                    ▼
        ┌───────────┐        ┌───────────┐        ┌───────────┐
        │ DP repl A │        │ DP repl B │        │ DP repl C │
        │           │        │           │        │           │
        │  FastAPI  │        │  FastAPI  │        │  FastAPI  │
        │     +     │        │     +     │        │     +     │
        │ AsyncLLM  │        │ AsyncLLM  │        │ AsyncLLM  │
        │  Engine   │        │  Engine   │        │  Engine   │
        │     +     │        │     +     │        │     +     │
        │ 8 × GPU   │        │ 8 × GPU   │        │ 8 × GPU   │
        │ (TP=8 via │        │ (TP=8 via │        │ (TP=8 via │
        │  Ray)     │        │  Ray)     │        │  Ray)     │
        └───────────┘        └───────────┘        └───────────┘

   each replica:  one FastAPI process · one AsyncLLMEngine · N Ray workers
                  shared scheduler + block manager · OpenAI-compatible endpoint

Three DP replicas in front, each running TP=8 internally. The LB handles routing (least-connections or weighted round-robin); replicas don't coordinate. This is the canonical single-region deployment.

Interactive · feel the engine loop

Below: a discrete-event simulator of the async engine. Top lane is the arrival timeline. Middle lane is the engine's active set — at most max_active requests in flight at once, advancing one token per step. Bottom lane is the output stream per finished request.

Try this sequence and watch the KPIs:

Async engine simulator
Top = arrivals. Middle = active slots (each row is one slot; coloured bands are requests in flight). Bottom = completed streams' first-and-last-token marks. Ten seconds of simulation at 60 fps.
mean TTFT
p99 TTFT
throughput
GPU util
queue depth
finished
show what each tick of the loop does (the same three calls)
// One simulation step (≈ one engine iteration):
// 1) admit waiting → running while running.length < max_active     ── scheduler.schedule()
// 2) advance every running request by one token                    ── model.execute_model()
// 3) finished: tokens_produced >= output_len → drop, record stats  ── process_outputs()

Takeaways

What to keep