Definition
Self-attention is the core mechanism of the transformer, the architecture behind virtually every modern large language model. It lets each token in a sequence look at — or “attend to” — every other token in the same sequence and decide how much weight to give each one when building its updated representation. The “self” in the name signals that the queries, keys, and values all come from a single input sequence, so the sequence is attending to itself. This is what allows a model to resolve a pronoun, connect a verb to its subject, or pull a relevant fact from earlier in a paragraph.
Self-attention vs cross-attention
It helps to contrast self-attention with general (cross) attention. In cross-attention, the queries come from one sequence and the keys and values from another — for example, a translation decoder attending to the encoded source sentence. In self-attention, the queries, keys, and values are all projected from the same set of tokens, so each token relates to its neighbours within one sequence. Self-attention is what gives transformers their flexible, content-based sense of context.
Queries, keys, and values
Each token is linearly projected into three vectors:
- Query (Q) — what this token is looking for.
- Key (K) — what this token offers to others.
- Value (V) — the information this token will contribute if attended to.
The model compares every query with every key via a scaled dot product to produce attention scores, applies a softmax so the scores sum to one, and then takes a weighted sum of the values. The weights say “how much of each other token’s information should flow into this one.” A useful analogy: queries and keys are like a search — matching what you want against what is available — and values are the content you retrieve.
Multi-head attention
A single attention computation captures one kind of relationship. Transformers run several in parallel, called multi-head attention. Each head learns to focus on a different pattern — one might track syntax, another long-range dependencies, another local word order. Their outputs are concatenated and combined, giving the model a richer, multi-faceted view of how tokens relate.
The quadratic cost
Self-attention’s power comes from comparing every token to every other token, which means its compute and memory cost scales with the square of the sequence length. Doubling the context roughly quadruples the work. This quadratic bottleneck is the central reason long-context models are hard to build, and it motivates research into sparse attention, sliding windows, and other efficient attention variants.
Why it matters
Self-attention is the single idea that replaced recurrent networks and enabled the LLM era. Because it processes all tokens in parallel and models arbitrary long-range relationships directly, it trains efficiently on huge datasets and captures context that earlier architectures missed. Understanding self-attention is understanding the engine inside every transformer.