Orientation — the map before the kernels

Triton has one big idea: you write tile-level code, the compiler turns it into threads. Once that lands, the rest of the language is small. This page is the map: the one trade-off that defines Triton, the five primitives you'll lean on, and how the 14 lessons compose.

Wait — which Triton?

Two unrelated projects share the name. This series is OpenAI Triton, the Python-embedded DSL for writing GPU kernels (@triton.jit, tl.load, tl.dot). The other — NVIDIA Triton Inference Server — is a model-serving runtime with backends for TensorRT, PyTorch, vLLM, and others. They share zero code; only the name.

What Triton actually is

Triton is a Python-embedded DSL that compiles to GPU code (PTX on NVIDIA, ROCm IR on AMD). It exists to fill a specific gap:

The one trade-off:

The defining sentence

CUDA gives you 32 threads per warp and you tell each thread what to do. Triton gives you a tile (e.g. 128×64 elements) and you write what happens to the whole tile; the compiler decides how it maps onto warps and which loads overlap with which math.

You give up some last-mile control (mma scheduling, warp specialisation, async copy choreography). You gain ~10× less code, automatic shape tuning, and one Python source file instead of a CUDA build system. For 95% of fused-kernel tasks that trade is correct.

The five primitives that recur in every lesson

By lesson 06 you'll have met all of these. By lesson 12 you'll have used all of them in one kernel.

Primitive	What it is	First lesson
`tl.program_id(axis)`	"Which tile am I?" The grid coord this program owns.	03
`tl.load(ptr, mask, other)`	Coalesced read with predicate. Mask handles the boundary tile.	04
`tl.dot(a, b)`	Tile matmul. Hits tensor cores when shapes allow.	05
`tl.sum / tl.max`	Reduction across a tile axis. Lowers to warp shuffles + SMEM.	06
`@triton.autotune`	Compile several configs; pick the fastest per shape key.	13

The one mental model to take with you

If you remember nothing else: "program owns a tile, load → compute → store, repeat across the grid". The 14 lessons are variations on step 3.

What each part of the curriculum teaches

Part	Lessons	Question it answers
I · The model	01–03	Why does this language exist? What is a "program"? What does `@triton.jit` do under the hood?
II · The DSL	04–06	What primitives can I call? Which ones hit tensor cores? How does a tile-level reduction become warp shuffles?
III · Building real kernels	07–11	Vector add → fused activation → tiled matmul → softmax → RMSNorm. Each one a one-step elaboration on the previous.
IV · The flagship	12	Flash Attention as the synthesis of every primitive. Why blocking the attention matrix saves bandwidth, not flops.
V · Performance & production	13–14	How does the autotuner pick configs? What is `num_stages` actually overlapping? How do I add a backward pass? When is Triton the wrong tool?

Two ways to read this

Linear (~5 hours). Lesson 01 onward. Lesson 12 (Flash Attention) calls every primitive from 04–10 explicitly — skip the order and the synthesis won't land. The widgets are calibrated so the surprise of lesson n is visible only after lesson n−1.
Targeted. Already wrote CUDA? Skim 01–03, dive at 04. Already wrote Triton vector add? Start at 08. Care only about attention? Read 02, 04, 05, 06, 10, then 12.

The minimal glossary (for the first three lessons)

Program — one instance of your @jit kernel. Identified by tl.program_id(axis). Equivalent to a CUDA thread block, but you write tile-level code instead of per-thread code.
Tile — a fixed-size chunk of indices (e.g. BLOCK_M=128, BLOCK_N=64) that one program handles end-to-end. Tile sizes are known at compile time (tl.constexpr) and in practice are powers of two — required for tl.dot and reductions, strongly recommended elsewhere.
Mask — a boolean tile passed to tl.load/tl.store. Lanes where the mask is false read other= (or skip the store). The boundary tile is what makes this non-optional.
Autotune — Triton compiles several configs (different tile sizes, num_warps, num_stages) and benchmarks them on the first call for each shape key, then caches the winner.
num_warps — warps per program. More warps = more parallelism within one tile, fewer registers per thread. The autotuner sweeps it.
num_stages — depth of the software pipeline Triton emits for overlapping HBM loads with compute. Lesson 13 demystifies it.

Everything else is defined as you meet it. Onward.