First kernel — vector add

The smallest useful CUDA program. Every line teaches something — the __global__ keyword, the launch syntax, the host/device split, the boundary check. Walk this and you can read any kernel in the rest of Part V.

The whole program

#include <cuda_runtime.h>
#include <stdio.h>

// device function — runs on the GPU
__global__ void add(const float* x, const float* y, float* z, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) {
        z[i] = x[i] + y[i];
    }
}

int main() {
    const int N = 1 << 20;          // 1 M elements
    const int B = N * sizeof(float);

    // 1. allocate & populate on host
    float* hx = (float*) malloc(B);
    float* hy = (float*) malloc(B);
    float* hz = (float*) malloc(B);
    for (int i = 0; i < N; i++) { hx[i] = i; hy[i] = i * 2.0f; }

    // 2. allocate on device
    float *dx, *dy, *dz;
    cudaMalloc(&dx, B);
    cudaMalloc(&dy, B);
    cudaMalloc(&dz, B);

    // 3. H2D copy
    cudaMemcpy(dx, hx, B, cudaMemcpyHostToDevice);
    cudaMemcpy(dy, hy, B, cudaMemcpyHostToDevice);

    // 4. launch
    int threads = 256;
    int blocks  = (N + threads - 1) / threads;
    add<<<blocks, threads>>>(dx, dy, dz, N);

    // 5. D2H copy
    cudaMemcpy(hz, dz, B, cudaMemcpyDeviceToHost);

    // 6. cleanup
    cudaFree(dx); cudaFree(dy); cudaFree(dz);
    free(hx); free(hy); free(hz);
    return 0;
}

Compile and run:

nvcc add.cu -o add && ./add

Animated · the host-device dance, timeline

Every CUDA program of this shape spends time on the same six things in the same order. The widget animates the host's view: allocation on host and device, H→D copy, kernel launch, D→H copy, free. Bars are sized by realistic durations for N=1M floats.

Animated · 1D thread grid stepping through index arithmetic

For a vector of N elements with blockDim=B, each thread (blockIdx, threadIdx) computes i = blockIdx · B + threadIdx. Step through to watch a chosen thread compute its index and access its element. Thread lighting up = computing this tick; arrow = which element it touches.

2D · boundary check, visualised

When N is not a multiple of blockDim, the last block has extra threads. Without the if (i < N) guard these would write past the array. The widget shows the danger zone: tick the guard off to see writes leak into "off-array" memory cells.

Reading the kernel, line by line

__global__ void add(const float* x, const float* y, float* z, int N) {

__global__ is the keyword that says: this function is a kernel. It runs on the device, is called from the host (with the triple-bracket syntax), takes only by-value or pointer-to-device-memory arguments, and returns void (results go through pointers).

Two cousins you'll see:

__device__ — function callable only from device code (other __global__ or __device__ functions).
__host__ — host function (the default; you only need it for __host__ __device__ dual annotations).

int i = blockIdx.x * blockDim.x + threadIdx.x;

This is "the index of this thread in the grid", computed from the built-ins covered in lesson 20. Conceptually: which output element am I responsible for?

blockIdx.x: 0..gridDim.x-1, the block we're in.
blockDim.x: e.g. 256, threads per block (set by the launch).
threadIdx.x: 0..blockDim.x-1, our thread within the block.

For block 3, thread 7 with blockDim 256: i = 3 * 256 + 7 = 775.

if (i < N) { z[i] = x[i] + y[i]; }

The boundary check. We probably launched ceil(N/256) * 256 threads, which is ≥ N. The threads with i ≥ N must do nothing. Without this guard they'd write past the end of the output buffer — typically a silent stomp on whatever's allocated next in memory.

Reading the host code, line by line

cudaMalloc(&dx, B) — allocate B bytes on the device. The function takes a pointer-to-pointer because it writes the address into dx. cudaMalloc is one of the slow calls from lesson 15 — only call it at setup, not per-op.

cudaMemcpy(dx, hx, B, cudaMemcpyHostToDevice) — copy B bytes from host pointer hx to device pointer dx. cudaMemcpyHostToDevice is the direction enum. Synchronous: it blocks the host until the copy finishes. Async variants exist (cudaMemcpyAsync) and are needed for compute/copy overlap.

add<<<blocks, threads>>>(dx, dy, dz, N);

The triple-bracket launch configuration:

First argument: number of blocks (a uint or a dim3).
Second: threads per block (a uint or dim3).
Optional third: shared-memory bytes per block (lesson 24).
Optional fourth: stream (lesson 15).

This launch is asynchronous. The host call returns immediately; the kernel is queued on the default stream and runs at some point. The next cudaMemcpy implicitly synchronises (it's a blocking copy), so by the time we use hz the kernel has completed.

The asynchronous-launch trap

This is the single most common beginner-CUDA bug:

add<<<blocks, threads>>>(dx, dy, dz, N);
printf("%f\n", dz[0]);  // wrong! reads device memory from host
// also wrong: error from the kernel is invisible until the next cuda* call

The kernel hasn't necessarily finished. Worse, you can't dereference a device pointer from host code. And if the kernel returned an error (out-of-bounds write, illegal memory access), you won't see it until the next CUDA API call. Use cudaDeviceSynchronize() to wait, and check cudaGetLastError() after the launch.

The debugging boilerplate every CUDA programmer types

#define CHECK(x) do { \
    cudaError_t e = (x); \
    if (e != cudaSuccess) { \
        fprintf(stderr, "CUDA %s: %s\n", #x, cudaGetErrorString(e)); \
        exit(1); \
    } \
} while (0)

Wrap every cudaMalloc, cudaMemcpy, and the post-launch cudaGetLastError() in this. Catches 80% of bugs before they become mysterious.

Why this is a "good first kernel" — and a bad real one

Pedagogically, vector add is perfect: minimum arithmetic, exposes launch syntax, indexing, and host/device split. Performance-wise, it's awful — strictly memory-bound and unfusable on its own. For each element it reads 2 floats (8 bytes) from HBM and writes 1 float (4 bytes), and does one addition. The arithmetic intensity is 1 flop / 12 bytes ≈ 0.083 FLOP/byte — far below H100's ridge point (~20 FLOP/byte for fp32 against the CUDA-core peak of ~67 TFLOPS, or ~295 FLOP/byte if you target the bf16 tensor-core peak of ~990 TFLOPS). You'll achieve a few hundred GFLOPS at most on this kernel while the GPU's bf16 tensor-core peak is ~990 TFLOPS — three orders of magnitude unused, because the kernel never touches a tensor core and never reads a byte twice.

The cure: never run a vector add alone. Fuse it with surrounding ops (lesson 16). Anytime you find yourself launching a kernel for a single elementwise op, consider whether torch.compile would catch it for free.

Two extensions that don't change the structure

1. 2D grids and 2D thread indices

For a matrix op, you might want a 2D grid:

__global__ void matadd(const float* A, const float* B, float* C, int M, int N) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    if (row < M && col < N) C[row*N + col] = A[row*N + col] + B[row*N + col];
}

dim3 threads(16, 16);
dim3 blocks((N + 15)/16, (M + 15)/16);
matadd<<<blocks, threads>>>(dA, dB, dC, M, N);

The hardware doesn't care — under the hood threads are still 1D (warps of 32 contiguous threadIdx values). The 2D shape is purely a convenience for the indexing arithmetic, with threadIdx.x being the "fast" dimension.

2. Unified memory (`cudaMallocManaged`)

You can allocate memory that's accessible from both host and device — the driver migrates pages on demand:

float *x;
cudaMallocManaged(&x, B);
for (int i = 0; i < N; i++) x[i] = i;  // host writes
add<<<blocks, threads>>>(x, y, z, N);  // device reads — pages migrate
cudaDeviceSynchronize();
printf("%f\n", z[0]);                    // host reads — back-migrated

Convenient for tutorial code. Not for production: page migrations are expensive and unpredictable. Real frameworks (PyTorch, JAX) use explicit device buffers + explicit copies + the PyTorch caching allocator (lesson 15).

What every CUDA programmer learns to ignore (until they can't)

Device selection. cudaSetDevice(i) picks which GPU. PyTorch's device="cuda:0" wraps this.
Streams and events. Lesson 15 has them; for the simple program, the default stream suffices.
Synchronisation. cudaDeviceSynchronize() blocks until everything queued is done. Useful for timing; production code avoids it (it serialises CPU and GPU).
Async copies. cudaMemcpyAsync + cudaStreamSynchronize for overlapping copies with compute.
Pinned host memory. cudaMallocHost for the source/destination of fast async H2D/D2H — lesson 15 again.

Interactive · trace a vector-add launch

Pick N and threads-per-block. The widget shows you which threads do real work and which fall into the "no-op" zone above N. Watch the boundary check matter — without it, the last partial warp would write into unallocated memory.

Takeaway

Every CUDA kernel has the same skeleton: compute a per-thread global index, bound-check it, do work for that index. Every launch has the same envelope: allocate device buffers, copy data in, launch with a (blocks, threads) shape, copy data out. Master this template and the rest of Part V is "what to do in the kernel body" — coalesce reads, use shared memory, manage divergence, exploit tensor cores.