Question 1

What is multi-head attention?

Accepted Answer

Multi-head attention runs several scaled dot-product attention operations in parallel, each with its own learned projections of the queries, keys, and values. Their outputs are concatenated and passed through a final linear layer, letting the model attend to several types of relationship at once.

Question 2

Why use multiple heads instead of one?

Accepted Answer

A single attention head must blend all the relationships in a sequence into one weighted average. Multiple heads each focus on a different subspace — one might track syntax, another long-range dependencies — giving the model a richer, more expressive view of the input.

Question 3

How are the heads combined?

Accepted Answer

Each head produces its own output vector. These are concatenated along the feature dimension and then multiplied by a learned output projection matrix, mixing the heads back into a single representation of the model's hidden size.

Question 4

Does adding more heads always help?

Accepted Answer

Not unboundedly. Each head gets a smaller slice of the hidden dimension, so beyond a point heads become too narrow to be useful and some end up redundant. The number of heads is a hyperparameter tuned against model size and task.

Multi-Head Attention (AI Glossary)

Definition

Queries, keys, and values

Why multiple heads matter

Concatenate and project

Practical considerations