all_lessons / sglang / 11 · vs vLLM lesson 11 / 11

SGLang vs vLLM — picking a side for your workload

Two frameworks, two theses, broadly the same kernels at the bottom. SGLang bets that the program is the unit of work; vLLM bets that the call is. The two bets converge on similar architectures but split on what's optimized for free vs. what's bolted on.

The thesis statements, side by side

vLLM "Make the single forward fast." • Unit of work: one prompt → one completion • Prefix sharing: hashmap APC (block-aligned) • Scheduling: continuous batching, chunked prefill • Kernels: PagedAttention, FlashAttention • Constrained outputs: outlines integration, optional • Surface: OpenAI-compat API; no DSL Strength: predictable, well-tuned on plain workloads. SGLang "Make the program of calls fast." • Unit of work: a program of many model calls • Prefix sharing: radix tree (any prefix, any depth) • Scheduling: cache-aware LPM-first + fairness • Kernels: FlashInfer, Triton, CUDA graphs, MLA • Constrained outputs: compressed FSM + xgrammar, default • Surface: OpenAI-compat API and embedded Python DSL Strength: multi-call, multi-tenant, structured workloads.

What each picks up for free

Workload propertyvLLM gets it for free?SGLang gets it for free?
Continuous batching, paged KVyesyes
Block-aligned prefix sharingyes (APC)yes (subset of radix)
Mid-prompt prefix sharingnoyes
Cache-aware request orderingpartialyes (default)
Multi-turn agent KV pinningrequires APC tuningnatural through radix
Strict JSON / schema outputoptional pluginbuilt-in, fast-forwarded
Forking / parallel sampling structuren > 1 samplingfork() primitive
MLA / DP attention (DeepSeek)supportedsupported (DP attention default)
EP for MoEsupportedsupported (DeepEP integration)
Speculative decodingsupportedsupported (incl. EAGLE)
LoRA servingsupported (S-LoRA-style)supported

Below the "supported" line both frameworks are competent. The real diverges are in the top six rows — that's where the design theses bite.

The workload-decision tree

how shared are prefixes? measure: % of total prefill tokens low (< 20%) → vLLM is enough single-call optimization dominates; SGLang's prefix machinery adds overhead with little benefit. moderate / high (> 30%) heavy JSON / structured outputs? no → SGLang likely wins radix tree + LPM scheduling carry the day yes → SGLang strongly wins prefix + constrained both compound

Two more secondary considerations:

Common misreadings of benchmarks

Numbers to distrust

How to benchmark fairly

  1. Use your real prompts. Synthetic benchmarks favor whoever wrote them. Your prompt mix is the only honest workload.
  2. Run for at least 5 minutes after warm-up. Caches need time to populate. Spec-decoding's α stabilizes after a few hundred requests.
  3. Report both throughput and p99 latency. Throughput-only numbers hide queueing effects.
  4. Match GPU memory utilization. SGLang and vLLM default differently; one with 90% utilization vs. the other with 70% is not a fair fight.
  5. Match KV dtype. fp8 KV on one framework vs fp16 KV on the other is a 2× memory advantage, not an architectural advantage.
  6. Match attention backend. Both can use FlashInfer; both can use FlashAttention. Specify which.
  7. Don't compare on different model versions. Llama-3 70B Instruct vs Llama-3.1 70B Instruct is not the same workload.

The convergent future

Watch this space: both projects are evolving toward each other.

The architecture differences will narrow. What is likely to remain durable: the frontend DSL and the radix-tree-shaped cache. Those are SGLang's bets on what the workload is. vLLM's bet — that one call is the right unit — has driven different choices and will keep driving them. Pick the framework whose bet matches your workload, not whose benchmark slide you saw last.

The synthesis question — what did you actually learn

If you've gotten this far, you can answer five questions concretely:

  1. Why does prefix sharing matter? Because 50–95% of modern workloads' prefill is redundant — lessons 01, 03.
  2. Why a radix tree? Because prefixes form a tree, and a tree captures partial overlaps that a hashmap cannot — lesson 04.
  3. Why is the scheduler half the story? Because the tree is passive; only LPM-first ordering turns its capability into hit rate — lesson 05.
  4. Why is constrained decoding sample-time, not post-hoc? Because compressed-FSM + xgrammar make it free, and parse-then-retry is anything but free — lessons 06, 07.
  5. Why FlashInfer + DP attention + EP? Each layer of the kernel and parallelism stack picks up the slack of the previous one. Each is a measured response, not a stylistic choice — lessons 08, 09.

If any of those answers is fuzzy, that's the lesson to re-read. The series builds tightly, and removing any one of these mechanisms collapses the others into something a plain serving framework already provides.

Interactive · predicted ratio for your workload

Slide your workload's actual signals — shared-prefix fraction, JSON share, batch size, model class — and watch the predicted speedup ratio. The model is illustrative, not authoritative: it composes the cache, scheduling, and constrained-decoding lessons into one back-of-envelope estimate. Run real benchmarks before betting infrastructure on it.

SGLang / vLLM throughput predictor

Numbers are illustrative. The point is which knob moves the ratio, not the exact value.

What to do next
Stand up a SGLang server on your real workload for 24 hours. Compare to vLLM on the same hardware, same model, same KV dtype, same memory utilization. If SGLang's prefix cache hit rate is > 30% (visible via the runtime's /get_server_info JSON or the Prometheus /metrics endpoint), it's already paying for itself. If you go past 50%, every other knob (kernels, parallelism, spec decoding) is gravy.