all_lessons / Triton kernels / lessons / 00 · orientation ~4 min read · before lesson 01

Orientation — the map before the kernels

Triton has one big idea: you write tile-level code, the compiler turns it into threads. Once that lands, the rest of the language is small. This page is the map: the one trade-off that defines Triton, the five primitives you'll lean on, and how the 14 lessons compose.

Wait — which Triton?
Two unrelated projects share the name. This series is OpenAI Triton, the Python-embedded DSL for writing GPU kernels (@triton.jit, tl.load, tl.dot). The other — NVIDIA Triton Inference Server — is a model-serving runtime with backends for TensorRT, PyTorch, vLLM, and others. They share zero code; only the name.

What Triton actually is

Triton is a Python-embedded DSL that compiles to GPU code (PTX on NVIDIA, ROCm IR on AMD). It exists to fill a specific gap:

PyTorch ops x.sum() y = a @ b + trivially short – no fusion control Triton @triton.jit + tl.dot ~100 lines / kernel + fusion + autotune – no warp control CUDA / CUTLASS __global__ void k(...) ~1000 lines / kernel + peak control – shape-fragile Triton sits in the middle. Most "I need a fused kernel today" jobs live in this lane.

The one trade-off:

The defining sentence
CUDA gives you 32 threads per warp and you tell each thread what to do. Triton gives you a tile (e.g. 128×64 elements) and you write what happens to the whole tile; the compiler decides how it maps onto warps and which loads overlap with which math.

You give up some last-mile control (mma scheduling, warp specialisation, async copy choreography). You gain ~10× less code, automatic shape tuning, and one Python source file instead of a CUDA build system. For 95% of fused-kernel tasks that trade is correct.

The five primitives that recur in every lesson

By lesson 06 you'll have met all of these. By lesson 12 you'll have used all of them in one kernel.

PrimitiveWhat it isFirst lesson
tl.program_id(axis)"Which tile am I?" The grid coord this program owns.03
tl.load(ptr, mask, other)Coalesced read with predicate. Mask handles the boundary tile.04
tl.dot(a, b)Tile matmul. Hits tensor cores when shapes allow.05
tl.sum / tl.maxReduction across a tile axis. Lowers to warp shuffles + SMEM.06
@triton.autotuneCompile several configs; pick the fastest per shape key.13

The one mental model to take with you

1 · my tile offs = pid·BLOCK + tl.arange(...) 2 · load tl.load(p+offs, mask=offs<N) 3 · compute tl.dot, tl.sum, tl.exp, etc. 4 · store tl.store(out+offs, y, mask=...) Every Triton kernel is this loop. The differences are what's in step 3.

If you remember nothing else: "program owns a tile, load → compute → store, repeat across the grid". The 14 lessons are variations on step 3.

What each part of the curriculum teaches

PartLessonsQuestion it answers
I · The model 0103 Why does this language exist? What is a "program"? What does @triton.jit do under the hood?
II · The DSL 0406 What primitives can I call? Which ones hit tensor cores? How does a tile-level reduction become warp shuffles?
III · Building real kernels 0711 Vector add → fused activation → tiled matmul → softmax → RMSNorm. Each one a one-step elaboration on the previous.
IV · The flagship 12 Flash Attention as the synthesis of every primitive. Why blocking the attention matrix saves bandwidth, not flops.
V · Performance & production 1314 How does the autotuner pick configs? What is num_stages actually overlapping? How do I add a backward pass? When is Triton the wrong tool?

Two ways to read this

  1. Linear (~5 hours). Lesson 01 onward. Lesson 12 (Flash Attention) calls every primitive from 04–10 explicitly — skip the order and the synthesis won't land. The widgets are calibrated so the surprise of lesson n is visible only after lesson n−1.
  2. Targeted. Already wrote CUDA? Skim 01–03, dive at 04. Already wrote Triton vector add? Start at 08. Care only about attention? Read 02, 04, 05, 06, 10, then 12.

The minimal glossary (for the first three lessons)

Everything else is defined as you meet it. Onward.