Multi-Head Attention (AI Glossary)

Running attention in parallel across subspaces for richer representations

Ad placeholder (leaderboard)

Definition

Multi-head attention is the core building block of the transformer. Instead of computing a single attention pattern over a sequence, it runs several attention operations in parallel — the “heads” — each with its own learned set of projections. The heads’ outputs are concatenated and passed through a final linear layer, producing a representation that captures many kinds of relationship at once rather than collapsing everything into a single weighted average.

Queries, keys, and values

Each attention head works on three projected versions of the input: queries, keys, and values. For every token, the head compares its query against all the keys to produce attention scores, scales and softmaxes them into weights, and uses those weights to take a weighted sum of the values. This is scaled dot-product attention. Crucially, every head learns its own projection matrices for Q, K, and V, so each head looks at the sequence through a different lens.

Why multiple heads matter

A single head must squeeze all the dependencies in a sentence — subject-verb agreement, coreference, positional cues — into one set of attention weights. Multiple heads divide this labour. Empirically, different heads specialise: some attend to the previous token, some to syntactic dependents, some to semantically related words far away. Running them in parallel and then combining gives the model a far richer joint representation than one head could.

Concatenate and project

After each head computes its output, the heads are concatenated along the feature dimension and multiplied by a learned output projection matrix. This mixes the per-head results back into a single vector of the model’s hidden size, ready for the next layer. The total compute is comparable to one large attention operation, because each head works in a smaller subspace — hidden size divided by the number of heads.

Practical considerations

The number of heads is a hyperparameter, typically chosen so the hidden size divides evenly (for example 96 heads in a large model). More heads is not always better: each head receives a thinner slice of the dimension, and studies have shown many heads can be pruned with little loss. Variants such as grouped-query attention and multi-query attention share keys and values across heads to cut memory and speed up inference, a common optimisation in modern LLMs.

Ad placeholder (rectangle)