Why Triton?
PyTorch can't fuse the operation you want. CUDA takes a week to write and breaks when the shape changes. Triton is the lane in between. This lesson is the case: why the lane exists, what it gives up, and the one-line decision rule for when it's the right tool.
The problem in one example
You profiled a transformer block and one chain of ops is eating 18% of the step:
z = x @ w # GEMM, calls cuBLAS
z = z + b # elementwise, separate kernel
z = gelu(z) # elementwise, separate kernel
y = z * mask # elementwise, separate kernel
Four kernels, four round-trips to HBM. The GEMM is fine — cuBLAS is hand-tuned. But the three trailing elementwise ops each read all of z from HBM, do a trivial amount of math per byte, and write all of z back. That's pure memory bandwidth waste:
If you fuse the four ops into one kernel that streams z through registers, you save 5/7 of the HBM traffic and 3/4 of the launch overhead. On an H100 with ~3 TB/s HBM, that's a 2–4× speedup on this region.
Why this is hard without Triton
You have three options and they all hurt in different ways:
| Option | Pros | Cons |
|---|---|---|
torch.compile |
Zero kernel code. Often fuses elementwise chains automatically. | Opaque when it doesn't fuse. Black-box debugging. Some patterns it just refuses. |
| Hand-written CUDA | Maximum control. Hit 100% of peak when you know what you're doing. | ~1000 lines of C++ for a fused matmul. Shape-fragile. Build system. Two weeks. |
| Library call (CUTLASS, FlashInfer, …) | If a library covers it, this is best. | Libraries don't cover novel fusions. The custom op you needed last week isn't there. |
Triton is option 4: ~100 lines of Python that compile to a fused kernel, autotune across shapes, and need no build system.
The defining trade-off
Triton hides the warp. You never write threadIdx, never touch __syncthreads, never schedule async copies by hand. The compiler does that for you, and gets it right ~80–95% of the time on modern hardware.
What you give up
- Last-mile mma scheduling. CUTLASS hand-orders
mma.syncinstructions to keep the tensor cores fed every cycle. Triton's scheduler is good but not surgical. Result: cuBLAS / CUTLASS beat Triton GEMM by 5–15% on vanilla bf16 shapes. - Warp specialisation. Hopper/Blackwell kernels split warps into producer (TMA loads) and consumer (mma compute) roles. Triton has some support, but the canonical FlashAttention v3 / FlashInfer kernels are still hand-written.
- Communication primitives. Triton has no NCCL, no NVLink intrinsics, no inter-SM atomic shortcuts. Cross-GPU work belongs in higher layers.
What you gain
- ~10× less code. A tiled bf16 matmul is ~80 lines in Triton vs ~800 in raw CUDA (closer to CUTLASS's lines if you use templates, but you're inheriting their abstractions).
- Autotune for free. One decorator and Triton sweeps tile sizes,
num_warps,num_stagesand caches the winner per shape key. - Python everywhere. Your test harness, the kernel, and the wrapper are one file. No CMake, no
.cucompilation, no PTXAS flags. - Portability. The same Triton source runs on NVIDIA (PTX), AMD (ROCm), and emerging backends. CUDA does not.
The one-line decision rule
Reach for Triton when all three of these are true:
- No library covers the operation you need.
- The operation is bandwidth-bound (fusion would save HBM bytes) or custom enough that no fused library kernel exists.
- You want autotuning across multiple shapes and don't have time to hand-tune CUDA per shape.
If those don't all hold, the answer is usually:
| Situation | Use |
|---|---|
| Vanilla bf16/fp16 matmul | torch.matmul → cuBLAS |
| Attention | FlashAttention v3 or FlashInfer |
| Elementwise chain in a model | Try torch.compile first; Triton if it doesn't fuse |
| Quantised GEMM (W4/W8) | Marlin / Machete via vLLM |
| The op is <1% of the step | Don't bother |
Interactive · is Triton the right tool?
Set how much of the step the candidate region takes, how unique the op is, and how irregular the shapes are. The widget recommends Triton, a library, or "skip".
What's next
You're sold on the lane. The question now is: what does "tile-level code" actually feel like? Lesson 02 introduces the central abstraction — programs and tiles — with a live widget that shows what each warp lane is doing when you write tl.arange(0, BLOCK).