Vector add — your first kernel
The "hello world" of GPU programming. We use it to walk the full Triton workflow end to end: jit, grid, launch, autotune, verify against PyTorch, benchmark with do_bench. By the end you have a kernel you can paste into a Colab.
The complete kernel — 13 lines
import torch
import triton
import triton.language as tl
@triton.jit
def add_kernel(a_ptr, b_ptr, c_ptr, N, BLOCK: tl.constexpr):
pid = tl.program_id(0)
offs = pid * BLOCK + tl.arange(0, BLOCK)
m = offs < N
x = tl.load(a_ptr + offs, mask=m, other=0.0)
y = tl.load(b_ptr + offs, mask=m, other=0.0)
tl.store(c_ptr + offs, x + y, mask=m)
That's the whole kernel. Every primitive from lesson 04 is here: program_id, arange, mask, tl.load with other, tl.store with mask.
The PyTorch-facing wrapper
def vector_add(a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
assert a.is_cuda and b.is_cuda and a.shape == b.shape
N = a.numel()
c = torch.empty_like(a)
BLOCK = 1024
grid = (triton.cdiv(N, BLOCK),)
add_kernel[grid](a, b, c, N, BLOCK=BLOCK)
return c
Two notable things:
- The launch syntax
add_kernel[grid](...)uses Python's__getitem__: the grid goes in[], the kernel args in(). Triton overloads this on the@jitobject. BLOCKis passed as a keyword. Because it's markedtl.constexprin the kernel, this triggers compilation.
Verify against PyTorch — always
a = torch.randn(1_000_000, device='cuda', dtype=torch.float32)
b = torch.randn(1_000_000, device='cuda', dtype=torch.float32)
c_triton = vector_add(a, b)
c_torch = a + b
torch.testing.assert_close(c_triton, c_torch) # exact for fp32 add
For bf16/fp16 you'd use assert_close(... atol=1e-3, rtol=1e-2) — same as you'd use for any GPU reduction comparison.
Benchmark with do_bench
Triton ships a microbenchmarker that handles warmup and the noise floor:
ms = triton.testing.do_bench(lambda: vector_add(a, b), warmup=25, rep=100)
N_bytes = 3 * a.numel() * a.element_size() # 2 loads + 1 store
bw_gbps = N_bytes / (ms * 1e-3) / 1e9
print(f"{ms:.3f} ms · {bw_gbps:.0f} GB/s")
For a 1M-element fp32 add on an H100 (peak HBM ~3 TB/s) you should see ~2700 GB/s — close to peak because the kernel is purely bandwidth-bound. If you see <50% of peak, the kernel is launch-bound (try a bigger N) or your BLOCK is too small.
Adding autotune — picking BLOCK for you
Different N's like different BLOCK sizes. Rather than hardcode 1024, let the autotuner choose:
@triton.autotune(
configs=[
triton.Config({'BLOCK': 256}, num_warps=2),
triton.Config({'BLOCK': 512}, num_warps=4),
triton.Config({'BLOCK': 1024}, num_warps=4),
triton.Config({'BLOCK': 2048}, num_warps=8),
],
key=['N'],
)
@triton.jit
def add_kernel(a_ptr, b_ptr, c_ptr, N, BLOCK: tl.constexpr):
# ... same body ...
def vector_add(a, b):
N = a.numel()
c = torch.empty_like(a)
grid = lambda meta: (triton.cdiv(N, meta['BLOCK']),)
add_kernel[grid](a, b, c, N) # no BLOCK arg — autotune fills it
return c
The first call for each N in the key triggers a benchmarking pass — Triton compiles all four configs, runs each ~5 times, and picks the fastest. Subsequent calls with the same N hit the cache.
What can go wrong on a vector add
| Symptom | Cause | Fix |
|---|---|---|
| Wrong output at N = 1_000_000 but right at N = 1024 | Forgot the mask | Add mask=offs < N, other=0. to every load/store. |
| Segfault at large N | Mask missing OR you passed a non-contiguous tensor | Mask, and a = a.contiguous() before the call. |
| Only ~10% of peak bandwidth | Launch-bound (kernel runs too fast) or BLOCK too small | Increase BLOCK (try 4096) or batch more work per kernel. |
| First call takes 200 ms | Autotune sweep on first hit | Run a warmup call before timing. |
RuntimeError: index out of bounds on launch | Grid uses an old BLOCK after autotune changed it | Use grid = lambda meta: (cdiv(N, meta['BLOCK']),) so grid reads BLOCK from the meta dict. |
Reading the roofline — is this kernel fast?
A vector add does 1 floating-point op per 3 memory accesses (2 loads, 1 store). That's an arithmetic intensity of 1/12 flop per byte for fp32 — far below the roofline knee. The kernel is bandwidth-bound, and the right ceiling is HBM peak, not flops peak.
Interactive · tune your add kernel
Pick BLOCK and num_warps; see the predicted bandwidth on a 4-million-fp32 add and where each setting hurts.
What's next
Vector add is bandwidth-bound and trivial to write. Lesson 08 picks a more typical workload — a fused linear + bias + GELU — that only wins through fusion. You'll see eager PyTorch's 3 launches collapse into 1 Triton kernel, and feel the bandwidth saving in your fingertips.