Question 1

What is attention in simple terms?

Accepted Answer

Attention lets a model decide which other words in the input are most relevant when processing a given word, and weight them accordingly. Instead of treating every word equally, it focuses — for the word "it" in a sentence, attention can learn to look back at the noun it refers to.

Question 2

What are query, key, and value?

Accepted Answer

For each token the model computes three vectors: a query (what this token is looking for), a key (what each token offers), and a value (the content to retrieve). The query is matched against all keys to produce weights, and those weights blend the values into the token's new, context-aware representation.

Question 3

What is the difference between self-attention and cross-attention?

Accepted Answer

Self-attention relates tokens within the same sequence to each other, which is how a model builds context for a sentence. Cross-attention relates one sequence to another — for example, a decoder attending to an encoder's output in translation — connecting two different streams of information.

Question 4

Why did attention replace recurrent networks?

Accepted Answer

Recurrent networks process tokens one after another, which is slow and struggles to connect distant words. Attention compares all tokens in parallel, so it captures long-range relationships directly and trains far more efficiently on modern hardware — the breakthrough that made large transformers practical.

Attention (AI Glossary)

Definition

The query-key-value formulation

Self-attention vs cross-attention

Multi-head attention

Why it mattered