First kernel — vector add
The smallest useful CUDA program. Every line teaches something — the __global__ keyword, the launch syntax, the host/device split, the boundary check. Walk this and you can read any kernel in the rest of Part V.
The whole program
#include <cuda_runtime.h>
#include <stdio.h>
// device function — runs on the GPU
__global__ void add(const float* x, const float* y, float* z, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) {
z[i] = x[i] + y[i];
}
}
int main() {
const int N = 1 << 20; // 1 M elements
const int B = N * sizeof(float);
// 1. allocate & populate on host
float* hx = (float*) malloc(B);
float* hy = (float*) malloc(B);
float* hz = (float*) malloc(B);
for (int i = 0; i < N; i++) { hx[i] = i; hy[i] = i * 2.0f; }
// 2. allocate on device
float *dx, *dy, *dz;
cudaMalloc(&dx, B);
cudaMalloc(&dy, B);
cudaMalloc(&dz, B);
// 3. H2D copy
cudaMemcpy(dx, hx, B, cudaMemcpyHostToDevice);
cudaMemcpy(dy, hy, B, cudaMemcpyHostToDevice);
// 4. launch
int threads = 256;
int blocks = (N + threads - 1) / threads;
add<<<blocks, threads>>>(dx, dy, dz, N);
// 5. D2H copy
cudaMemcpy(hz, dz, B, cudaMemcpyDeviceToHost);
// 6. cleanup
cudaFree(dx); cudaFree(dy); cudaFree(dz);
free(hx); free(hy); free(hz);
return 0;
}
Compile and run:
nvcc add.cu -o add && ./add
Animated · the host-device dance, timeline
Every CUDA program of this shape spends time on the same six things in the same order. The widget animates the host's view: allocation on host and device, H→D copy, kernel launch, D→H copy, free. Bars are sized by realistic durations for N=1M floats.
Animated · 1D thread grid stepping through index arithmetic
For a vector of N elements with blockDim=B, each thread (blockIdx, threadIdx) computes i = blockIdx · B + threadIdx. Step through to watch a chosen thread compute its index and access its element. Thread lighting up = computing this tick; arrow = which element it touches.
2D · boundary check, visualised
When N is not a multiple of blockDim, the last block has extra threads. Without the if (i < N) guard these would write past the array. The widget shows the danger zone: tick the guard off to see writes leak into "off-array" memory cells.
Reading the kernel, line by line
__global__ void add(const float* x, const float* y, float* z, int N) {
__global__ is the keyword that says: this function is a kernel. It runs on the device, is called from the host (with the triple-bracket syntax), takes only by-value or pointer-to-device-memory arguments, and returns void (results go through pointers).
Two cousins you'll see:
__device__— function callable only from device code (other__global__or__device__functions).__host__— host function (the default; you only need it for__host__ __device__dual annotations).
int i = blockIdx.x * blockDim.x + threadIdx.x;
This is "the index of this thread in the grid", computed from the built-ins covered in lesson 20. Conceptually: which output element am I responsible for?
blockIdx.x: 0..gridDim.x-1, the block we're in.blockDim.x: e.g. 256, threads per block (set by the launch).threadIdx.x: 0..blockDim.x-1, our thread within the block.
For block 3, thread 7 with blockDim 256: i = 3 * 256 + 7 = 775.
if (i < N) { z[i] = x[i] + y[i]; }
The boundary check. We probably launched ceil(N/256) * 256 threads, which is ≥ N. The threads with i ≥ N must do nothing. Without this guard they'd write past the end of the output buffer — typically a silent stomp on whatever's allocated next in memory.
Reading the host code, line by line
cudaMalloc(&dx, B) — allocate B bytes on the device. The function takes a pointer-to-pointer because it writes the address into dx. cudaMalloc is one of the slow calls from lesson 15 — only call it at setup, not per-op.
cudaMemcpy(dx, hx, B, cudaMemcpyHostToDevice) — copy B bytes from host pointer hx to device pointer dx. cudaMemcpyHostToDevice is the direction enum. Synchronous: it blocks the host until the copy finishes. Async variants exist (cudaMemcpyAsync) and are needed for compute/copy overlap.
add<<<blocks, threads>>>(dx, dy, dz, N);
The triple-bracket launch configuration:
- First argument: number of blocks (a uint or a
dim3). - Second: threads per block (a uint or
dim3). - Optional third: shared-memory bytes per block (lesson 24).
- Optional fourth: stream (lesson 15).
This launch is asynchronous. The host call returns immediately; the kernel is queued on the default stream and runs at some point. The next cudaMemcpy implicitly synchronises (it's a blocking copy), so by the time we use hz the kernel has completed.
The asynchronous-launch trap
This is the single most common beginner-CUDA bug:
add<<<blocks, threads>>>(dx, dy, dz, N);
printf("%f\n", dz[0]); // wrong! reads device memory from host
// also wrong: error from the kernel is invisible until the next cuda* call
The kernel hasn't necessarily finished. Worse, you can't dereference a device pointer from host code. And if the kernel returned an error (out-of-bounds write, illegal memory access), you won't see it until the next CUDA API call. Use cudaDeviceSynchronize() to wait, and check cudaGetLastError() after the launch.
#define CHECK(x) do { \
cudaError_t e = (x); \
if (e != cudaSuccess) { \
fprintf(stderr, "CUDA %s: %s\n", #x, cudaGetErrorString(e)); \
exit(1); \
} \
} while (0)
Wrap every cudaMalloc, cudaMemcpy, and the post-launch cudaGetLastError() in this. Catches 80% of bugs before they become mysterious.
Why this is a "good first kernel" — and a bad real one
Pedagogically, vector add is perfect: minimum arithmetic, exposes launch syntax, indexing, and host/device split. Performance-wise, it's awful — strictly memory-bound and unfusable on its own. For each element it reads 2 floats (8 bytes) from HBM and writes 1 float (4 bytes), and does one addition. The arithmetic intensity is 1 flop / 12 bytes ≈ 0.083 FLOP/byte — far below H100's ridge point (~20 FLOP/byte for fp32 against the CUDA-core peak of ~67 TFLOPS, or ~295 FLOP/byte if you target the bf16 tensor-core peak of ~990 TFLOPS). You'll achieve a few hundred GFLOPS at most on this kernel while the GPU's bf16 tensor-core peak is ~990 TFLOPS — three orders of magnitude unused, because the kernel never touches a tensor core and never reads a byte twice.
The cure: never run a vector add alone. Fuse it with surrounding ops (lesson 16). Anytime you find yourself launching a kernel for a single elementwise op, consider whether torch.compile would catch it for free.
Two extensions that don't change the structure
1. 2D grids and 2D thread indices
For a matrix op, you might want a 2D grid:
__global__ void matadd(const float* A, const float* B, float* C, int M, int N) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < M && col < N) C[row*N + col] = A[row*N + col] + B[row*N + col];
}
dim3 threads(16, 16);
dim3 blocks((N + 15)/16, (M + 15)/16);
matadd<<<blocks, threads>>>(dA, dB, dC, M, N);
The hardware doesn't care — under the hood threads are still 1D (warps of 32 contiguous threadIdx values). The 2D shape is purely a convenience for the indexing arithmetic, with threadIdx.x being the "fast" dimension.
2. Unified memory (cudaMallocManaged)
You can allocate memory that's accessible from both host and device — the driver migrates pages on demand:
float *x;
cudaMallocManaged(&x, B);
for (int i = 0; i < N; i++) x[i] = i; // host writes
add<<<blocks, threads>>>(x, y, z, N); // device reads — pages migrate
cudaDeviceSynchronize();
printf("%f\n", z[0]); // host reads — back-migrated
Convenient for tutorial code. Not for production: page migrations are expensive and unpredictable. Real frameworks (PyTorch, JAX) use explicit device buffers + explicit copies + the PyTorch caching allocator (lesson 15).
What every CUDA programmer learns to ignore (until they can't)
- Device selection.
cudaSetDevice(i)picks which GPU. PyTorch'sdevice="cuda:0"wraps this. - Streams and events. Lesson 15 has them; for the simple program, the default stream suffices.
- Synchronisation.
cudaDeviceSynchronize()blocks until everything queued is done. Useful for timing; production code avoids it (it serialises CPU and GPU). - Async copies.
cudaMemcpyAsync+cudaStreamSynchronizefor overlapping copies with compute. - Pinned host memory.
cudaMallocHostfor the source/destination of fast async H2D/D2H — lesson 15 again.
Interactive · trace a vector-add launch
Pick N and threads-per-block. The widget shows you which threads do real work and which fall into the "no-op" zone above N. Watch the boundary check matter — without it, the last partial warp would write into unallocated memory.