SGLang vs vLLM — picking a side for your workload
Two frameworks, two theses, broadly the same kernels at the bottom. SGLang bets that the program is the unit of work; vLLM bets that the call is. The two bets converge on similar architectures but split on what's optimized for free vs. what's bolted on.
The thesis statements, side by side
What each picks up for free
| Workload property | vLLM gets it for free? | SGLang gets it for free? |
|---|---|---|
| Continuous batching, paged KV | yes | yes |
| Block-aligned prefix sharing | yes (APC) | yes (subset of radix) |
| Mid-prompt prefix sharing | no | yes |
| Cache-aware request ordering | partial | yes (default) |
| Multi-turn agent KV pinning | requires APC tuning | natural through radix |
| Strict JSON / schema output | optional plugin | built-in, fast-forwarded |
| Forking / parallel sampling structure | n > 1 sampling | fork() primitive |
| MLA / DP attention (DeepSeek) | supported | supported (DP attention default) |
| EP for MoE | supported | supported (DeepEP integration) |
| Speculative decoding | supported | supported (incl. EAGLE) |
| LoRA serving | supported (S-LoRA-style) | supported |
Below the "supported" line both frameworks are competent. The real diverges are in the top six rows — that's where the design theses bite.
The workload-decision tree
Two more secondary considerations:
- Are you running DeepSeek-V3 / R1? Both frameworks support it. SGLang has shipped DP-attention + DeepEP support earlier and is usually slightly ahead at the H200-class deployment. Benchmark on your hardware before committing.
- Do you need stability for production today, or feature velocity? vLLM has a longer track record in production at scale. SGLang ships new features faster (xgrammar, EAGLE-3, DP attention) but breaks API contracts more often.
Common misreadings of benchmarks
- "SGLang is 5× faster than vLLM." Usually true on the SGLang authors' multi-call benchmarks. Often false on single-call streaming. Read the workload before quoting the number.
- "vLLM has lower latency at p99." Often true at small concurrency. Reverses at higher concurrency where SGLang's hit rate compounds.
- "They are identical with the same kernels." The kernel layer is convergent. The scheduler and prefix cache are not. At a workload with > 50% shared prefix the scheduler is half the win.
- "vLLM's APC equals SGLang's RadixAttention." They overlap. vLLM's APC hashes block-aligned prefixes starting at token 0 of each sequence; it cannot share a middle segment inserted at a different offset. RadixAttention is token-granular at the index level (splits where prompts diverge, even mid-block) but block-granular at the KV pool — partial-block tails are handled by allocating a fresh block, not by copying. When your prefixes are clean and short and arrive close in time, they match. When prefixes are long, branchy, or arrive minutes apart, the radix tree pulls ahead.
How to benchmark fairly
- Use your real prompts. Synthetic benchmarks favor whoever wrote them. Your prompt mix is the only honest workload.
- Run for at least 5 minutes after warm-up. Caches need time to populate. Spec-decoding's α stabilizes after a few hundred requests.
- Report both throughput and p99 latency. Throughput-only numbers hide queueing effects.
- Match GPU memory utilization. SGLang and vLLM default differently; one with 90% utilization vs. the other with 70% is not a fair fight.
- Match KV dtype. fp8 KV on one framework vs fp16 KV on the other is a 2× memory advantage, not an architectural advantage.
- Match attention backend. Both can use FlashInfer; both can use FlashAttention. Specify which.
- Don't compare on different model versions. Llama-3 70B Instruct vs Llama-3.1 70B Instruct is not the same workload.
The convergent future
Watch this space: both projects are evolving toward each other.
- vLLM has added prefix caching, chunked prefill, EAGLE spec, and is improving its scheduler.
- SGLang has added LoRA, broader OpenAI-compat coverage, and tightened its kernels.
- FlashInfer is used (or being adopted) by both.
- The xgrammar library is shared between them.
The architecture differences will narrow. What is likely to remain durable: the frontend DSL and the radix-tree-shaped cache. Those are SGLang's bets on what the workload is. vLLM's bet — that one call is the right unit — has driven different choices and will keep driving them. Pick the framework whose bet matches your workload, not whose benchmark slide you saw last.
The synthesis question — what did you actually learn
If you've gotten this far, you can answer five questions concretely:
- Why does prefix sharing matter? Because 50–95% of modern workloads' prefill is redundant — lessons 01, 03.
- Why a radix tree? Because prefixes form a tree, and a tree captures partial overlaps that a hashmap cannot — lesson 04.
- Why is the scheduler half the story? Because the tree is passive; only LPM-first ordering turns its capability into hit rate — lesson 05.
- Why is constrained decoding sample-time, not post-hoc? Because compressed-FSM + xgrammar make it free, and parse-then-retry is anything but free — lessons 06, 07.
- Why FlashInfer + DP attention + EP? Each layer of the kernel and parallelism stack picks up the slack of the previous one. Each is a measured response, not a stylistic choice — lessons 08, 09.
If any of those answers is fuzzy, that's the lesson to re-read. The series builds tightly, and removing any one of these mechanisms collapses the others into something a plain serving framework already provides.
Interactive · predicted ratio for your workload
Slide your workload's actual signals — shared-prefix fraction, JSON share, batch size, model class — and watch the predicted speedup ratio. The model is illustrative, not authoritative: it composes the cache, scheduling, and constrained-decoding lessons into one back-of-envelope estimate. Run real benchmarks before betting infrastructure on it.
/get_server_info JSON or the Prometheus /metrics endpoint), it's already paying for itself. If you go past 50%, every other knob (kernels, parallelism, spec decoding) is gravy.