all_lessons / sglang / 09 · parallelism lesson 9 / 11

Tensor, data, and expert parallelism

A 70B model doesn't fit on one GPU. The way you split it changes what scales and what doesn't. SGLang ships TP for dense linears, DP attention for MLA-style architectures, and EP for MoE experts — and the choice between them is determined by what becomes the bottleneck after the previous split.

The three parallelism axes

tensor-parallel (TP) W₁ cols 0..d/2 W₁ cols d/2..d GPU 0 GPU 1 all-reduce after each layer data-parallel attention (DP) seqs 0..N/2 seqs N/2..N replica 0 replica 1 no comm; per-replica KV expert-parallel (EP) experts 0..E/2 experts E/2..E GPU 0 GPU 1 all-to-all on every MoE layer TP splits weight matrices; DP splits batch; EP splits experts. They are orthogonal — a real deployment uses all three. Communication cost summary (per layer): TP: ~2·B·S·d bytes via all-reduce per layer (ring all-reduce is rank-count independent). DP: 0 inside attention; a small all-gather at the attention↔MLP boundary. EP: B·S·top_k·d bytes via all-to-all — every MoE layer, only when the MLP is replaced.

Tensor parallelism — the default split

Tensor parallelism (Megatron-style) shards the linear layers across N GPUs. The two big linears in a transformer block (QKV projection and the MLP up-projection) are split column-wise: each rank holds hidden/N output channels. The reverse-direction linears (output projection, MLP down-projection) are split row-wise. After each block, an all-reduce sums the partial outputs.

The good: TP linearly increases the available HBM and bandwidth for weights and activations. A 70B fp16 model needs ~140 GB of weights → fits on 2× H100 80 GB at TP=2; on 4× at TP=4.

The bad: TP requires fast intra-node interconnect. Ring all-reduce moves roughly 2 × batch × seq × hidden bytes per layer per step regardless of rank count (each rank sends ~2(N−1)/N of the data). For batch 8 × seq 1 (decode) × hidden 8192 × 2 B ≈ 256 KB per layer per step — tiny per layer, but ×80 layers ≈ 20 MB per step. At those small message sizes the NVLink latency floor (~10 µs per all-reduce) is the binding cost, not bandwidth. Above TP=8 most servers lose NVLink and fall back to PCIe / IB; the per-all-reduce latency jumps an order of magnitude, throughput drops despite the larger model fitting.

Where TP saturates

ModelSweet TPWhy
Llama-3-8B1Fits on one H100; TP only hurts.
Llama-3-70B fp164 or 8Two H100s is tight; 4 is comfortable.
Llama-3-405B fp88Fills one DGX node.
DeepSeek-V3 (671B, MoE)8 (with EP)Weights × experts force EP, not pure TP.

DP attention — the right answer for MLA

DeepSeek's Multi-head Latent Attention (MLA) compresses the KV cache to a single low-rank vector per token instead of per-head K and V. This gives huge memory savings (~10× smaller KV for the same model). But it also breaks the standard TP-on-attention layout: with TP=8 and only one logical KV per token, splitting along the head dimension shards a vector that has no head dimension to shard.

The real problem with TP on MLA: head-parallel TP would force every rank to keep a full copy of the (tiny) latent KV cache — duplicating the very thing MLA exists to compress. SGLang's response: DP attention. The attention path is data-parallel across DP ranks; each rank owns its own slice of the (small) latent KV pool, holding KV only for the sequences assigned to it. The MLP and other linears stay tensor-parallel as before. Concretely:

input batch (sequences 0..N) DP rank 0: attn(seqs 0..) own KV cache DP rank 1: attn(seqs …) own KV cache DP rank 2: attn(seqs …) DP rank 3: attn(seqs …) all-gather → TP MLP (shared layout) → all-reduce → next attention

What this buys you:

The cost: an all-gather at the boundary between DP-attention and TP-MLP, and KV is not shared between ranks (so RadixAttention is per-rank). The all-gather is small (one hidden dim per token), and most workloads have enough request-locality that per-rank caches still work well.

When DP attention is the right call
Use DP attention when the model uses MLA or when KV cache pressure dominates the workload. For standard GQA models with sufficient TP headroom (Llama-3-70B at TP=4), pure TP is simpler and roughly equivalent.

Expert parallelism for MoE

A Mixture-of-Experts model replaces each MLP with a router that picks top-k of E experts. DeepSeek-V3 has 256 routed experts; at k=8, every token activates 8 of them. The experts' weights are gigantic in aggregate (~600 GB for V3 at fp8) but only ~16 GB are active per token.

Naive serving: replicate the experts across all ranks. Cost: 600 GB × N replicas. Unaffordable.

Expert parallelism: distribute the experts across ranks. Each rank holds E/EP experts. At each MoE layer, tokens routed to expert i are sent (all-to-all) to the rank that owns i, computed there, and returned.

rank 0 tokens routes to 0,1,2,3,4,5,6,7 rank 1 tokens rank 2 tokens rank 3 tokens all-to-all dispatch (by routed expert id) experts 0..63 on rank 0 experts 64..127 experts 128..191 experts 192..255

The all-to-all is the dominant communication primitive for MoE serving. SGLang integrates DeepEP — DeepSeek's optimized all-to-all library — for the dispatch and combine. Without an optimized all-to-all, EP overhead can exceed 30% of the layer time; DeepEP brings it down to single-digit percentages on H800/H100 with NVLink.

Why EP and TP have to coexist for V3-class models
TP=8 alone can't hold 671B at fp8 (~600 GB > 8 × 80 GB). EP alone can't shard the dense attention path. The actual layout is TP=8 for attention + dense linears, EP=8 across the same ranks for experts. Each rank owns 1/8 of every dense weight AND 1/8 of the routed experts. The two parallelisms compose because they target different layers.

Pipeline parallelism — present but rarely needed at inference

Pipeline parallelism splits the model by layer: ranks 0–3 hold layers 0–20, ranks 4–7 hold layers 21–40, etc. Activations flow rank-to-rank. It is the standard way to fit huge dense models at training time.

At inference time PP is rarely worth it. Reasons:

SGLang supports PP but defaults to TP/DP/EP for single-node deployments.

Reading a real deployment

An H200 node serving DeepSeek-R1 today looks like:

KnobValueWhy
--tp 88Full node; saturates NVLink; dense weights fit.
--dp 88DP attention reuses the same 8 ranks (not 8 × 8 = 64). Each rank acts as TP for the MLP and DP for attention.
--enable-dp-attentiontrueActivates the DP attention path for MLA.
--ep-size 88256 experts across the same 8 ranks.
--mem-fraction-static0.9Pool size — lesson 03.
--enable-torch-compiletrueSlight extra speed on supported ops.

Reading this config, you can predict the bottlenecks: KV is bandwidth-bound (mitigated by MLA + DP attention); the MoE layer's all-to-all is bandwidth-bound (mitigated by DeepEP); the dense linears need NVLink (TP=8). Any one of these going wrong is visible in the profiler trace.

Interactive · pick your parallelism

Set model size, KV pressure, and expert count. Watch which parallelism dominates the layer time.

Layer-time decomposition under parallelism

Drag KV pressure up; DP-attention starts to look strictly better than TP for the attention term.

What lesson 10 builds

With the model fitting and the kernels fast, the remaining lever on decode throughput is how many useful tokens come out of each forward pass. Lesson 10 turns one big-model forward into multiple emitted tokens via speculative decoding — draft K tokens cheaply, verify them all in one target-model forward, accept what the target would have sampled anyway.