Question 1

What does attention do in a language model?

Accepted Answer

Attention lets the model decide, for each word it is processing, which other words in the text are most relevant and how much to weight them. This is how it resolves what a pronoun refers to or how words far apart relate. Instead of reading strictly left to right, the model looks at the whole context at once and focuses where it matters.

Question 2

What are queries, keys, and values?

Accepted Answer

Each word produces three vectors. The query represents what that word is looking for; keys represent what every word offers; values carry the actual content. The model compares a query against all keys to get relevance scores, then uses those scores to take a weighted blend of the values. It is a learned, content-based lookup.

Question 3

Why is it called multi-head attention?

Accepted Answer

Rather than computing attention once, the model runs several attention operations in parallel, called heads. Each head can learn to focus on a different kind of relationship — grammar, meaning, position — and their results are combined. Multiple heads let the model capture many types of dependency at the same time.

Question 4

Why was 'Attention Is All You Need' such a breakthrough?

Accepted Answer

The 2017 paper showed you could drop recurrence and convolutions entirely and build a powerful sequence model from attention alone. Because attention processes all positions in parallel rather than step by step, transformers train far faster on modern hardware, which made today's large language models practical to build.

The Attention Mechanism Explained (Plain English + Visuals)

The problem attention solves

Queries, keys, and values

Self-attention in practice

Multi-head attention

Why it changed everything