Positional encodings — sinusoidal, learned, RoPE, ALiBi
Attention is permutation-equivariant. To give it a sense of "first", "second", "next to", you have to inject position. Five strategies; each makes a different bet about what "position" means. The choice decides whether your model extrapolates beyond its training length.
Why this is a problem at all
From lesson 04: Y = softmax(QK^⊤/√d_k) V. Permute the tokens: X' = P · X. Then Q' = PQ, K' = PK, V' = PV, and the output is just PY. The function ignores position. If your input is "the cat sat on the mat", the model can't tell it apart from "mat the on sat cat the".
To fix this, the position information must enter before attention is applied. Three places it can enter:
- Added to the input embeddings. Sinusoidal, learned. The token at position i sees x_i + p_i.
- Multiplied into Q, K only. RoPE. The dot product Q_i · K_j becomes a function of the position difference i - j.
- Added as a bias to attention scores. ALiBi, T5 relative bias. The score S_{ij} gets an extra term that depends on i - j.
Each is a different bet about (a) what "position" means and (b) what happens at sequences longer than you trained on.
Sinusoidal — the original, and the Fourier intuition
"Attention Is All You Need" used:
For each position pos, a d-dim vector with sin/cos pairs at different frequencies. The lowest-index dimensions oscillate fast (capture local position); the highest-index ones oscillate slowly (capture coarse position). It's a Fourier basis for the integers, evaluated at the token positions.
Why this design:
- It's a smooth function of position. Nearby positions get similar encodings. The model can learn "attend to nearby tokens" by picking dimensions where neighbouring positions agree.
- Linear combinations encode relative position. Because of the angle-sum identity, sin(pos + Δ) = sin(pos)cos(Δ) + cos(pos)sin(Δ). So PE(pos + Δ) is a linear function of PE(pos). The model can learn relative position from the absolute encoding by learning the right linear projection.
- It extrapolates "smoothly" (in principle). The formula works for any pos, not just integers up to N_train. Empirically, this extrapolation is weak — performance degrades past training length — but better than learned encodings, which have no defined value at unseen positions.
Learned — the "let the model figure it out"
Drop the formula. Define P ∈ ℝ^{N_max × d} as a learnable parameter. Position i gets row i of P. Used by BERT, GPT-2, and GPT-3.
Pros:
- One fewer hand-designed thing in the architecture.
- For short, fixed-length sequences, learns the most useful encoding for the task.
Cons:
- Cannot extrapolate beyond N_max. Position 2049 doesn't exist in P if you trained with N_max = 2048.
- N_max · d parameters wasted if you don't use the full length. For d=8192 and N_max=8192, that's 64M parameters just for positions.
- Subject to overfit at high positions. The last few rows of P see few training examples.
Modern verdict: learned positions are rare in modern LLMs. BERT, GPT-2, and GPT-3 used learned absolute positions; the entire post-2022 LLM stack has moved to RoPE or ALiBi.
RoPE — rotation as relative position
Rotary Position Embedding (Su et al. 2021, RoFormer paper). The key idea: position is encoded by rotating Q and K in 2D subspaces. The angle of rotation is proportional to position.
For a 2D pair of feature dimensions (x_{2k}, x_{2k+1}) at position m, rotate by angle mθ_k where θ_k = 10000^{-2k/d}:
Apply to Q and K (not V) before computing the dot product. The reason this is interesting:
Three consequences:
- No added embedding. RoPE doesn't add anything to x; it rotates Q, K. So you don't lose room in the residual stream for position info.
- Extrapolation works better than sinusoidal. The rotation is well-defined for any m, and the inner-product structure naturally gives relative position even at unseen positions. Still degrades at very long lengths, but better.
- Position-extension tricks compose with RoPE. NTK scaling, YaRN, Position Interpolation all rescale θ_k to extend context with light fine-tuning.
RoPE is the default in LLaMA, Mistral, DeepSeek, Qwen, Yi, virtually every modern LLM.
ALiBi — bias the attention scores directly
Attention with Linear Biases (Press et al. 2022). The simplest possible relative-position encoding: add a linear bias to attention scores based on distance.
For autoregressive (causal) attention, j ≤ i, so j - i ≤ 0, and the bias is negative — distant tokens get downweighted. Each head has its own slope m_h; standard recipe: m_h = 2^{-8/h}, 2^{-16/h}, …, 2^{-h·8/h} — a geometric series of decay rates so that some heads attend far, others attend close.
Why this is interesting:
- Zero parameters. No learned encoding, no embedding table.
- Extrapolates well. Press et al. showed ALiBi at train length 1024 evaluates well at length 2048+. The bias is just a function of position difference; no "out of distribution" inputs.
- No QK rotation; works in 1 line of attention code. Easier to add to legacy attention implementations than RoPE.
The trade-off: ALiBi is a hard-coded inductive bias. RoPE lets the model learn whatever position pattern it wants within the rotation budget. Empirically RoPE wins on most downstream tasks; ALiBi wins on length extrapolation, where its hard-coded "distant = less weight" bias holds robustly.
The extension tricks — NTK / YaRN / Position Interpolation
You trained a RoPE model at N_max = 4096. You want to serve it at 32k context. Options:
| Method | What it changes | Fine-tuning needed | Trade-off |
|---|---|---|---|
| Position Interpolation (Chen et al. 2023) | Divide position index by N_new / N_train before applying RoPE. Equivalent to compressing the position axis. | ~1B tokens of fine-tuning at new length. | Cheap. Loses some fine-grained position resolution. |
| NTK-aware scaling | Scale the base frequency 10000 in RoPE to 10000 · α^{d/(d-2)} where α = N_new / N_train. Keeps high-frequency dims unchanged, scales low-frequency ones. | None or minimal. | Good zero-shot; reasonable up to 2–4× context. |
| YaRN (Peng et al. 2024) | Per-dim scaling with a "ramp" function so that high-frequency dims aren't compressed and low-frequency dims are. | ~50M tokens. | Current SOTA for context extension. 4–8× extension with small fine-tune. |
| Long-context pretraining | Just train on long sequences from scratch. | Most expensive. | Best quality if compute allows. |
The whole table — side-by-side
| Method | Mechanism | Params | Extrapolation | Where applied |
|---|---|---|---|---|
| Sinusoidal | + fixed sin/cos to input | 0 | OK | Embedding (added to x) |
| Learned | + learnable P to input | N_max · d | None (fails past N_max) | Embedding (added to x) |
| RoPE | rotate (Q, K) by angle ∝ position | 0 | OK natively, great with NTK/YaRN | Q and K before dot product |
| ALiBi | + linear bias to attention scores | 0 (slopes fixed) | Best | Attention scores S |
| T5 relative | + learnable bias to S, bucketed by distance | ~32 × h | OK | Attention scores S |
Interactive · see how each behaves at long context
The interview probes
- Why does attention need positional encoding but a CNN doesn't? CNN convolutions are translation-equivariant by construction — the position information is baked in by the kernel applying to its local window. Attention is permutation-equivariant, so you must inject position explicitly.
- Why is RoPE applied to Q, K but not V? The point of RoPE is to make the attention score Q^⊤ K depend on relative position. If you rotated V, the output would have a position-dependent rotation that downstream layers would have to undo. RoPE is precisely about position-into-similarity, not position-into-content.
- What's the inductive bias of ALiBi? "Distant tokens are less important than nearby ones, and the slope of decay is per-head learnable." This is roughly true for natural language but wrong for tasks where distant context matters (e.g., needle-in-a-haystack). RoPE is more flexible — the model can learn arbitrary position patterns within its rotation budget.
- Can you mix positional encodings? Yes. Some models use both RoPE and a small absolute embedding for special tokens like <BOS> or system messages. The compositional rule is: position info flows through any of the three injection points, and downstream layers blend them as needed.
- Why does extending context length hurt accuracy on short sequences? If you scale RoPE's base to handle longer contexts (NTK / YaRN), the high-frequency dims that distinguished positions 1, 2, 3 now have a different scale. The model has to either re-learn local position (fine-tuning) or accept some loss of local-position resolution.
Where things go subtly wrong
| Bug | Symptom | Diagnosis |
|---|---|---|
| Forgetting to rotate K at the cached positions | KV cache + RoPE: model degrades quickly after a few tokens. | The K cache should store rotated K, so it's reused without re-rotation. New K's are rotated by the new position; older K's are already rotated by their original position. Don't re-rotate the whole cache. |
| RoPE applied to wrong head dim | Loss is fine for short sequences, gets worse for long. | RoPE expects pairs of dimensions. If d_h is odd, you have an unpaired dim. Implementations either split d_h in half and rotate each half (the standard convention), or interleave pairs. Pick one and be consistent. |
| Sinusoidal added at the wrong layer | Model trains but lower quality than expected. | Sinusoidal must be added to the input embeddings, before the first attention. Adding it mid-stack or after some layers loses its effect because the model has already mixed position info. |
| ALiBi slope applied unmasked | Future tokens get the bias too, breaks causal mask. | The bias should respect the mask: only applied within the visible region. Implementations differ; bug is to compute the bias matrix once and reuse without re-masking. |
| Learned positions overflow at inference | Inference past N_max returns NaN or random outputs. | Learned positions have no value past N_max. Add a guard: truncate input, or pad with zeros (model will degrade but won't NaN), or switch to RoPE if you can retrain. |
Interview prompts you should be ready for
- "Why does RoPE give relative position 'for free'?" (Because rotations compose: R_m^⊤ R_n = R_{n-m}. So (R_m Q)^⊤ (R_n K) = Q^⊤ R_{n-m} K depends only on the difference of positions. Beautiful algebra.)
- "You're shipping a 7B model trained at 4k context. Customer needs 32k. Options?" (1) Position Interpolation + 1B tokens of fine-tuning — cheapest. 2) NTK-aware scaling — zero-shot, ok up to 8k. 3) YaRN + 50M tokens fine-tune — current best. 4) From-scratch retrain at 32k — most expensive.)
- "ALiBi vs RoPE — when does each win?" (ALiBi: extreme length extrapolation (because the bias is just a function of distance, no parameters drift). RoPE: average task quality at training length, especially when the task has non-monotone position patterns. Modern LLMs use RoPE because the extrapolation gap is closed by NTK/YaRN.)
- "What's wrong with adding sinusoidal to the residual stream every layer instead of just at input?" (The model has to "subtract out" the position info to do non-positional computation. The information is already in the residual stream from layer 0; adding it every layer is at best redundant and at worst overwrites learned position-free representations.)
- "How would you adapt RoPE for cross-attention (different sequence lengths for Q and K)?" (Use the relevant position index for each: Q gets the decoder position, K gets the encoder position. The relative-position invariance still works because R_m^⊤ R_n = R_{n-m} holds for any m, n from any sequences.)
- "Why is RoPE applied before attention, not as a post-processing of the attention output?" (The whole point of RoPE is to make the inner product Q·K position-dependent. Applying it to the output would rotate the value, which has no semantic meaning — V is content, not position. Position must enter the score.)