Definition
A transformer is a neural network architecture, introduced in the 2017 paper Attention Is All You Need, that processes an entire sequence of tokens in parallel using a mechanism called self-attention. It replaced recurrent and convolutional designs for language tasks and is now the foundation of every modern large language model — GPT, Claude, Gemini, Llama, and the rest.
Self-attention: the core idea
The defining feature of a transformer is self-attention. For each token, the model computes three vectors — a query, a key, and a value. It compares the query of one token against the keys of all tokens to produce attention weights, then uses those weights to blend the value vectors. The result is that every token’s representation is enriched with information from the tokens most relevant to it, no matter how far apart they sit in the sequence. Multi-head attention runs several of these attention operations in parallel, each able to focus on a different kind of relationship.
The components of a transformer block
A transformer is a stack of identical blocks, each containing:
- Multi-head self-attention — the mechanism described above.
- A feed-forward network — a small position-wise MLP applied to each token independently, adding non-linear processing capacity.
- Residual connections — shortcuts that add a layer’s input to its output, helping gradients flow through very deep stacks.
- Layer normalisation — stabilises the scale of activations so training stays well-behaved.
Because the architecture has no built-in notion of order, positional encodings are added to the input embeddings so the model knows which token came first.
Why it matters
Transformers process every token at once rather than one after another, which makes them highly parallelisable and ideal for training on GPUs at massive scale. Their attention mechanism captures long-range dependencies that older recurrent networks struggled with. This combination of scalability and expressiveness is precisely what unlocked the era of large language models, and nearly every state-of-the-art AI system today is a transformer variant.