How Attention Mechanisms Work in Neural Networks

Why "attention" lets AI focus on what matters in a sentence

Ad placeholder (leaderboard)

Attention in one sentence

Attention is a mechanism that lets a model decide, for each word, which other words matter most when interpreting it. Instead of squeezing a whole sentence into one fixed summary, attention lets every token pull information directly from the tokens most relevant to it. This is the engine inside every Transformer, and the interactive below lets you pick a query word and watch the weights form.

Query, key, and value

Every token is projected into three vectors. The query asks a question (“what context do I need?”), each token’s key advertises what it can offer, and the value carries the actual information. To compute attention for a token, the model takes its query and compares it against the keys of all tokens using a dot product. A large dot product means high relevance. This QKV framing is the heart of the whole mechanism.

Scaled dot-product attention

The raw scores are divided by the square root of the key dimension — this is the scaling that prevents large dot products from saturating the next step. The scaled scores then pass through a softmax, which converts them into positive weights that sum to one. Finally, those weights are used to take a weighted average of all the value vectors. The output is a new representation of the token that blends in information from wherever it attended.

A worked example

Take “The cat sat because it was tired.” When the model processes “it,” its query scores highest against the key for “cat,” moderately against “tired,” and low against function words. After softmax, perhaps 0.7 of the weight lands on “cat.” The resulting vector for “it” is therefore mostly the value of “cat,” so the model has effectively resolved the pronoun. The explorer reproduces this kind of weighting on a sample sentence.

Multi-head attention

One set of QKV projections captures a single view of the relationships. Multi-head attention runs several of these in parallel, each with independent learned weights, so different heads can specialise — one on grammatical subject, one on long-range reference, one on adjacent words. Their outputs are concatenated and projected back down. Stacking many multi-head layers is what gives Transformers their depth of understanding, all built on this same simple weighting idea.

Ad placeholder (rectangle)