What are query, key, and value vectors?

Each token is projected into three vectors. The query represents what a token is looking for, the key represents what each token offers, and the value is the information that gets passed along. Comparing one token's query to all keys decides how much of each value to mix in.

Why is dot-product attention "scaled"?

Attention scores are dot products of queries and keys. For large vector dimensions these dots grow big and push the softmax into regions with tiny gradients. Dividing by the square root of the key dimension keeps the scores in a stable range, which is why it is called scaled dot-product attention.

What does softmax do in attention?

Softmax turns the raw similarity scores into a set of positive weights that sum to one. This makes them a clean probability-like distribution describing how much each token contributes, so the resulting blend of values is a proper weighted average.

What is the point of multiple heads?

A single attention head can only emphasise one type of relationship. Multi-head attention runs several heads in parallel, each with its own learned projections, so the model can simultaneously track syntax, coreference, and meaning, then concatenate the results.

How Attention Mechanisms Work in Neural Networks

Attention in one sentence

Attention is a mechanism that lets a model decide, for each word, which other words matter most when interpreting it. Instead of squeezing a whole sentence into one fixed summary, attention lets every token pull information directly from the tokens most relevant to it. This is the engine inside every Transformer, and the interactive below lets you pick a query word and watch the weights form.

Query, key, and value

Every token is projected into three vectors. The query asks a question (“what context do I need?”), each token’s key advertises what it can offer, and the value carries the actual information. To compute attention for a token, the model takes its query and compares it against the keys of all tokens using a dot product. A large dot product means high relevance. This QKV framing is the heart of the whole mechanism.

Scaled dot-product attention

The raw scores are divided by the square root of the key dimension — this is the scaling that prevents large dot products from saturating the next step. The scaled scores then pass through a softmax, which converts them into positive weights that sum to one. Finally, those weights are used to take a weighted average of all the value vectors. The output is a new representation of the token that blends in information from wherever it attended.

A worked example

Take “The cat sat because it was tired.” When the model processes “it,” its query scores highest against the key for “cat,” moderately against “tired,” and low against function words. After softmax, perhaps 0.7 of the weight lands on “cat.” The resulting vector for “it” is therefore mostly the value of “cat,” so the model has effectively resolved the pronoun. The explorer reproduces this kind of weighting on a sample sentence.

Multi-head attention

One set of QKV projections captures a single view of the relationships. Multi-head attention runs several of these in parallel, each with independent learned weights, so different heads can specialise — one on grammatical subject, one on long-range reference, one on adjacent words. Their outputs are concatenated and projected back down. Stacking many multi-head layers is what gives Transformers their depth of understanding, all built on this same simple weighting idea.