Positional Encoding (AI Glossary)

How transformers know token order when attention is position-agnostic

Ad placeholder (leaderboard)

Definition

Positional encoding is the technique transformers use to inject information about the order of tokens into a model that, by default, has no sense of sequence. The self-attention mechanism at the heart of a transformer is permutation-invariant: it computes the same output regardless of how the input tokens are arranged. Positional encoding fixes this by adding (or otherwise combining) a position-dependent signal to each token’s representation, so the model can distinguish “the cat sat on the mat” from a random shuffle of the same words.

Why attention needs it

Attention works by comparing every token to every other token through query-key-value dot products. Those comparisons depend only on the content of the vectors, not their location in the sequence. Without an extra signal, the phrases “dog bites man” and “man bites dog” would produce identical attention patterns and identical meaning to the model. Language is deeply order-dependent, so some mechanism must restore positional awareness — that mechanism is positional encoding.

Absolute encodings: sinusoidal and learned

The original 2017 transformer used sinusoidal positional encoding: a fixed pattern of sine and cosine waves at different frequencies, added to the token embeddings. Each position gets a unique vector, and the smooth, periodic structure lets the model reason about relative offsets. Because it is computed rather than trained, it can in principle extend to positions never seen in training.

The alternative is a learned positional embedding — a trainable lookup table with one vector per position. Models like the original BERT and GPT-2 used this. It can fit the training distribution well but does not naturally extend beyond the maximum length it was trained on.

Relative and rotary methods: RoPE and ALiBi

Modern long-context models often prefer methods that encode relative position. RoPE (Rotary Position Embedding) rotates the query and key vectors by an angle proportional to their absolute position; when the dot product is taken, the result depends on the difference between positions. This gives clean relative-distance behaviour and underpins Llama, Mistral, and many recent models.

ALiBi (Attention with Linear Biases) skips embeddings entirely and instead adds a linear penalty to attention scores that grows with the distance between tokens. Its main advantage is length extrapolation — it can handle sequences longer than those seen during training, making it attractive for very long context windows.

Why it matters

Positional encoding choices directly affect how well a model handles long documents, how gracefully it extrapolates beyond its trained context length, and how it captures relative versus absolute structure. When you read that a model supports a “1M token context” or uses “RoPE scaling,” you are reading about positional-encoding engineering. It is one of the quietest but most consequential design decisions in a transformer.

Ad placeholder (rectangle)