Tensor, data, and expert parallelism

A 70B model doesn't fit on one GPU. The way you split it changes what scales and what doesn't. SGLang ships TP for dense linears, DP attention for MLA-style architectures, and EP for MoE experts — and the choice between them is determined by what becomes the bottleneck after the previous split.

The three parallelism axes

weight matrices; DP splits batch; EP splits experts. They are orthogonal — a real deployment uses all three. Communication cost summary (per layer): TP: ~2·B·S·d bytes via all-reduce per layer (ring all-reduce is rank-count independent). DP: 0 inside attention; a small all-gather at the attention↔MLP boundary. EP: B·S·top_k·d bytes via all-to-all — every MoE layer, only when the MLP is replaced.

Tensor parallelism — the default split

Tensor parallelism (Megatron-style) shards the linear layers across N GPUs. The two big linears in a transformer block (QKV projection and the MLP up-projection) are split column-wise: each rank holds hidden/N output channels. The reverse-direction linears (output projection, MLP down-projection) are split row-wise. After each block, an all-reduce sums the partial outputs.

The good: TP linearly increases the available HBM and bandwidth for weights and activations. A 70B fp16 model needs ~140 GB of weights → fits on 2× H100 80 GB at TP=2; on 4× at TP=4.

The bad: TP requires fast intra-node interconnect. Ring all-reduce moves roughly 2 × batch × seq × hidden bytes per layer per step regardless of rank count (each rank sends ~2(N−1)/N of the data). For batch 8 × seq 1 (decode) × hidden 8192 × 2 B ≈ 256 KB per layer per step — tiny per layer, but ×80 layers ≈ 20 MB per step. At those small message sizes the NVLink latency floor (~10 µs per all-reduce) is the binding cost, not bandwidth. Above TP=8 most servers lose NVLink and fall back to PCIe / IB; the per-all-reduce latency jumps an order of magnitude, throughput drops despite the larger model fitting.

Where TP saturates

Model	Sweet TP	Why
Llama-3-8B	1	Fits on one H100; TP only hurts.
Llama-3-70B fp16	4 or 8	Two H100s is tight; 4 is comfortable.
Llama-3-405B fp8	8	Fills one DGX node.
DeepSeek-V3 (671B, MoE)	8 (with EP)	Weights × experts force EP, not pure TP.

DP attention — the right answer for MLA

DeepSeek's Multi-head Latent Attention (MLA) compresses the KV cache to a single low-rank vector per token instead of per-head K and V. This gives huge memory savings (~10× smaller KV for the same model). But it also breaks the standard TP-on-attention layout: with TP=8 and only one logical KV per token, splitting along the head dimension shards a vector that has no head dimension to shard.

The real problem with TP on MLA: head-parallel TP would force every rank to keep a full copy of the (tiny) latent KV cache — duplicating the very thing MLA exists to compress. SGLang's response: DP attention. The attention path is data-parallel across DP ranks; each rank owns its own slice of the (small) latent KV pool, holding KV only for the sequences assigned to it. The MLP and other linears stay tensor-parallel as before. Concretely:

What this buys you:

Each rank has its own full KV pool. 4 ranks × 60 GB = 240 GB total KV. With TP, you'd share one logical 60 GB pool across all ranks.
No all-reduce inside attention. Each rank's attention runs independently and writes to its own KV.
Throughput scales with rank count for attention. Each rank now has its own HBM bandwidth serving its own KV slice. Two ranks at batch B/2 deliver 2× the aggregate KV bandwidth of one rank at batch B — which is what matters for the bandwidth-bound decode step.

The cost: an all-gather at the boundary between DP-attention and TP-MLP, and KV is not shared between ranks (so RadixAttention is per-rank). The all-gather is small (one hidden dim per token), and most workloads have enough request-locality that per-rank caches still work well.

When DP attention is the right call

Use DP attention when the model uses MLA or when KV cache pressure dominates the workload. For standard GQA models with sufficient TP headroom (Llama-3-70B at TP=4), pure TP is simpler and roughly equivalent.

Expert parallelism for MoE

A Mixture-of-Experts model replaces each MLP with a router that picks top-k of E experts. DeepSeek-V3 has 256 routed experts; at k=8, every token activates 8 of them. The experts' weights are gigantic in aggregate (~600 GB for V3 at fp8) but only ~16 GB are active per token.

Naive serving: replicate the experts across all ranks. Cost: 600 GB × N replicas. Unaffordable.

Expert parallelism: distribute the experts across ranks. Each rank holds E/EP experts. At each MoE layer, tokens routed to expert i are sent (all-to-all) to the rank that owns i, computed there, and returned.

The all-to-all is the dominant communication primitive for MoE serving. SGLang integrates DeepEP — DeepSeek's optimized all-to-all library — for the dispatch and combine. Without an optimized all-to-all, EP overhead can exceed 30% of the layer time; DeepEP brings it down to single-digit percentages on H800/H100 with NVLink.

Why EP and TP have to coexist for V3-class models

TP=8 alone can't hold 671B at fp8 (~600 GB > 8 × 80 GB). EP alone can't shard the dense attention path. The actual layout is TP=8 for attention + dense linears, EP=8 across the same ranks for experts. Each rank owns 1/8 of every dense weight AND 1/8 of the routed experts. The two parallelisms compose because they target different layers.

Pipeline parallelism — present but rarely needed at inference

Pipeline parallelism splits the model by layer: ranks 0–3 hold layers 0–20, ranks 4–7 hold layers 21–40, etc. Activations flow rank-to-rank. It is the standard way to fit huge dense models at training time.

At inference time PP is rarely worth it. Reasons:

Decode is autoregressive — there's only one new token per step, so the pipeline "bubble" (time spent waiting for the previous stage) is harder to hide than at training.
For models that fit in one node with TP, PP adds latency without throughput.
PP shines when you need to span multiple nodes, which is when network bandwidth becomes the binding constraint anyway.

SGLang supports PP but defaults to TP/DP/EP for single-node deployments.

Reading a real deployment

An H200 node serving DeepSeek-R1 today looks like:

Knob	Value	Why
--tp 8	8	Full node; saturates NVLink; dense weights fit.
--dp 8	8	DP attention reuses the same 8 ranks (not 8 × 8 = 64). Each rank acts as TP for the MLP and DP for attention.
--enable-dp-attention	true	Activates the DP attention path for MLA.
--ep-size 8	8	256 experts across the same 8 ranks.
--mem-fraction-static	0.9	Pool size — lesson 03.
--enable-torch-compile	true	Slight extra speed on supported ops.

Reading this config, you can predict the bottlenecks: KV is bandwidth-bound (mitigated by MLA + DP attention); the MoE layer's all-to-all is bandwidth-bound (mitigated by DeepEP); the dense linears need NVLink (TP=8). Any one of these going wrong is visible in the profiler trace.

Interactive · pick your parallelism

Set model size, KV pressure, and expert count. Watch which parallelism dominates the layer time.

What lesson 10 builds

With the model fitting and the kernels fast, the remaining lever on decode throughput is how many useful tokens come out of each forward pass. Lesson 10 turns one big-model forward into multiple emitted tokens via speculative decoding — draft K tokens cheaply, verify them all in one target-model forward, accept what the target would have sampled anyway.