Definition
Attention is the mechanism that lets a neural network weigh how much each part of the input matters when processing a given element. Rather than treating every token in a sentence as equally relevant, an attention layer learns, for each token, a set of weights over all the other tokens — focusing computation on what is contextually important. It is the core operation inside the transformer architecture and the reason modern language models can keep track of meaning across long stretches of text.
The query-key-value formulation
Attention is usually described with three learned projections of each token. The query represents what the current token is “looking for.” Each token also exposes a key, representing what it can offer, and a value, the actual content to be retrieved. To process a token, the model compares its query against every key to produce a similarity score, normalises those scores into weights (via softmax), and then takes a weighted sum of the values. The result is a new representation of that token that has absorbed information from the tokens most relevant to it. Done for every token at once, this is a fast, parallel matrix operation.
Self-attention vs cross-attention
Self-attention applies this process within a single sequence: every token attends to every other token in the same input, building a context-aware understanding of the sentence — letting the word “it” pull meaning from the noun it refers to, for instance. Cross-attention instead connects two different sequences: a decoder generating a translation can attend to the encoded source sentence, linking the output it is producing to the input it is reading. Self-attention builds internal context; cross-attention bridges separate information streams.
Multi-head attention
A single attention pass captures one kind of relationship. Transformers run several in parallel — multi-head attention — where each “head” learns its own query, key, and value projections and can specialise in a different pattern: one head might track grammatical subjects, another long-range references, another local word order. Their outputs are combined, giving the model a richer, multi-faceted view of how tokens relate. This redundancy and specialisation is a large part of why transformers are so capable.
Why it mattered
Before attention, sequence models were mostly recurrent, processing tokens one at a time and passing a running state forward. That made them slow and poor at linking words far apart, since information had to survive many sequential steps. Attention compares all tokens directly and in parallel, capturing long-range dependencies in a single operation and mapping cleanly onto the parallel hardware that trains large models. The 2017 paper “Attention Is All You Need” showed that attention alone — no recurrence — could outperform prior approaches, and the transformers it introduced became the foundation of virtually every modern large language model.