SGLang vs vLLM — picking a side for your workload

Two frameworks, two theses, broadly the same kernels at the bottom. SGLang bets that the program is the unit of work; vLLM bets that the call is. The two bets converge on similar architectures but split on what's optimized for free vs. what's bolted on.

The thesis statements, side by side

and embedded Python DSL Strength: multi-call, multi-tenant, structured workloads.

What each picks up for free

Workload property	vLLM gets it for free?	SGLang gets it for free?
Continuous batching, paged KV	yes	yes
Block-aligned prefix sharing	yes (APC)	yes (subset of radix)
Mid-prompt prefix sharing	no	yes
Cache-aware request ordering	partial	yes (default)
Multi-turn agent KV pinning	requires APC tuning	natural through radix
Strict JSON / schema output	optional plugin	built-in, fast-forwarded
Forking / parallel sampling structure	n > 1 sampling	fork() primitive
MLA / DP attention (DeepSeek)	supported	supported (DP attention default)
EP for MoE	supported	supported (DeepEP integration)
Speculative decoding	supported	supported (incl. EAGLE)
LoRA serving	supported (S-LoRA-style)	supported

Below the "supported" line both frameworks are competent. The real diverges are in the top six rows — that's where the design theses bite.

The workload-decision tree

Two more secondary considerations:

Are you running DeepSeek-V3 / R1? Both frameworks support it. SGLang has shipped DP-attention + DeepEP support earlier and is usually slightly ahead at the H200-class deployment. Benchmark on your hardware before committing.
Do you need stability for production today, or feature velocity? vLLM has a longer track record in production at scale. SGLang ships new features faster (xgrammar, EAGLE-3, DP attention) but breaks API contracts more often.

Common misreadings of benchmarks

Numbers to distrust

"SGLang is 5× faster than vLLM." Usually true on the SGLang authors' multi-call benchmarks. Often false on single-call streaming. Read the workload before quoting the number.
"vLLM has lower latency at p99." Often true at small concurrency. Reverses at higher concurrency where SGLang's hit rate compounds.
"They are identical with the same kernels." The kernel layer is convergent. The scheduler and prefix cache are not. At a workload with > 50% shared prefix the scheduler is half the win.
"vLLM's APC equals SGLang's RadixAttention." They overlap. vLLM's APC hashes block-aligned prefixes starting at token 0 of each sequence; it cannot share a middle segment inserted at a different offset. RadixAttention is token-granular at the index level (splits where prompts diverge, even mid-block) but block-granular at the KV pool — partial-block tails are handled by allocating a fresh block, not by copying. When your prefixes are clean and short and arrive close in time, they match. When prefixes are long, branchy, or arrive minutes apart, the radix tree pulls ahead.

How to benchmark fairly

Use your real prompts. Synthetic benchmarks favor whoever wrote them. Your prompt mix is the only honest workload.
Run for at least 5 minutes after warm-up. Caches need time to populate. Spec-decoding's α stabilizes after a few hundred requests.
Report both throughput and p99 latency. Throughput-only numbers hide queueing effects.
Match GPU memory utilization. SGLang and vLLM default differently; one with 90% utilization vs. the other with 70% is not a fair fight.
Match KV dtype. fp8 KV on one framework vs fp16 KV on the other is a 2× memory advantage, not an architectural advantage.
Match attention backend. Both can use FlashInfer; both can use FlashAttention. Specify which.
Don't compare on different model versions. Llama-3 70B Instruct vs Llama-3.1 70B Instruct is not the same workload.

The convergent future

Watch this space: both projects are evolving toward each other.

vLLM has added prefix caching, chunked prefill, EAGLE spec, and is improving its scheduler.
SGLang has added LoRA, broader OpenAI-compat coverage, and tightened its kernels.
FlashInfer is used (or being adopted) by both.
The xgrammar library is shared between them.

The architecture differences will narrow. What is likely to remain durable: the frontend DSL and the radix-tree-shaped cache. Those are SGLang's bets on what the workload is. vLLM's bet — that one call is the right unit — has driven different choices and will keep driving them. Pick the framework whose bet matches your workload, not whose benchmark slide you saw last.

The synthesis question — what did you actually learn

If you've gotten this far, you can answer five questions concretely:

Why does prefix sharing matter? Because 50–95% of modern workloads' prefill is redundant — lessons 01, 03.
Why a radix tree? Because prefixes form a tree, and a tree captures partial overlaps that a hashmap cannot — lesson 04.
Why is the scheduler half the story? Because the tree is passive; only LPM-first ordering turns its capability into hit rate — lesson 05.
Why is constrained decoding sample-time, not post-hoc? Because compressed-FSM + xgrammar make it free, and parse-then-retry is anything but free — lessons 06, 07.
Why FlashInfer + DP attention + EP? Each layer of the kernel and parallelism stack picks up the slack of the previous one. Each is a measured response, not a stylistic choice — lessons 08, 09.

If any of those answers is fuzzy, that's the lesson to re-read. The series builds tightly, and removing any one of these mechanisms collapses the others into something a plain serving framework already provides.

Interactive · predicted ratio for your workload

Slide your workload's actual signals — shared-prefix fraction, JSON share, batch size, model class — and watch the predicted speedup ratio. The model is illustrative, not authoritative: it composes the cache, scheduling, and constrained-decoding lessons into one back-of-envelope estimate. Run real benchmarks before betting infrastructure on it.

What to do next

Stand up a SGLang server on your real workload for 24 hours. Compare to vLLM on the same hardware, same model, same KV dtype, same memory utilization. If SGLang's prefix cache hit rate is > 30% (visible via the runtime's /get_server_info JSON or the Prometheus /metrics endpoint), it's already paying for itself. If you go past 50%, every other knob (kernels, parallelism, spec decoding) is gravy.