Orientation — the map before the kernels
Triton has one big idea: you write tile-level code, the compiler turns it into threads. Once that lands, the rest of the language is small. This page is the map: the one trade-off that defines Triton, the five primitives you'll lean on, and how the 14 lessons compose.
@triton.jit, tl.load, tl.dot). The other — NVIDIA Triton Inference Server — is a model-serving runtime with backends for TensorRT, PyTorch, vLLM, and others. They share zero code; only the name.
What Triton actually is
Triton is a Python-embedded DSL that compiles to GPU code (PTX on NVIDIA, ROCm IR on AMD). It exists to fill a specific gap:
The one trade-off:
You give up some last-mile control (mma scheduling, warp specialisation, async copy choreography). You gain ~10× less code, automatic shape tuning, and one Python source file instead of a CUDA build system. For 95% of fused-kernel tasks that trade is correct.
The five primitives that recur in every lesson
By lesson 06 you'll have met all of these. By lesson 12 you'll have used all of them in one kernel.
| Primitive | What it is | First lesson |
|---|---|---|
tl.program_id(axis) | "Which tile am I?" The grid coord this program owns. | 03 |
tl.load(ptr, mask, other) | Coalesced read with predicate. Mask handles the boundary tile. | 04 |
tl.dot(a, b) | Tile matmul. Hits tensor cores when shapes allow. | 05 |
tl.sum / tl.max | Reduction across a tile axis. Lowers to warp shuffles + SMEM. | 06 |
@triton.autotune | Compile several configs; pick the fastest per shape key. | 13 |
The one mental model to take with you
If you remember nothing else: "program owns a tile, load → compute → store, repeat across the grid". The 14 lessons are variations on step 3.
What each part of the curriculum teaches
| Part | Lessons | Question it answers |
|---|---|---|
| I · The model | 01–03 | Why does this language exist? What is a "program"? What does @triton.jit do under the hood? |
| II · The DSL | 04–06 | What primitives can I call? Which ones hit tensor cores? How does a tile-level reduction become warp shuffles? |
| III · Building real kernels | 07–11 | Vector add → fused activation → tiled matmul → softmax → RMSNorm. Each one a one-step elaboration on the previous. |
| IV · The flagship | 12 | Flash Attention as the synthesis of every primitive. Why blocking the attention matrix saves bandwidth, not flops. |
| V · Performance & production | 13–14 | How does the autotuner pick configs? What is num_stages actually overlapping? How do I add a backward pass? When is Triton the wrong tool? |
Two ways to read this
- Linear (~5 hours). Lesson 01 onward. Lesson 12 (Flash Attention) calls every primitive from 04–10 explicitly — skip the order and the synthesis won't land. The widgets are calibrated so the surprise of lesson n is visible only after lesson n−1.
- Targeted. Already wrote CUDA? Skim 01–03, dive at 04. Already wrote Triton vector add? Start at 08. Care only about attention? Read 02, 04, 05, 06, 10, then 12.
The minimal glossary (for the first three lessons)
- Program — one instance of your
@jitkernel. Identified bytl.program_id(axis). Equivalent to a CUDA thread block, but you write tile-level code instead of per-thread code. - Tile — a fixed-size chunk of indices (e.g.
BLOCK_M=128,BLOCK_N=64) that one program handles end-to-end. Tile sizes are known at compile time (tl.constexpr) and in practice are powers of two — required fortl.dotand reductions, strongly recommended elsewhere. - Mask — a boolean tile passed to
tl.load/tl.store. Lanes where the mask is false readother=(or skip the store). The boundary tile is what makes this non-optional. - Autotune — Triton compiles several configs (different tile sizes,
num_warps,num_stages) and benchmarks them on the first call for each shape key, then caches the winner. - num_warps — warps per program. More warps = more parallelism within one tile, fewer registers per thread. The autotuner sweeps it.
- num_stages — depth of the software pipeline Triton emits for overlapping HBM loads with compute. Lesson 13 demystifies it.
Everything else is defined as you meet it. Onward.