Question 1

What is self-attention?

Accepted Answer

Self-attention is a mechanism where each token in a sequence computes how much it should focus on every other token in the same sequence. The queries, keys, and values are all derived from one input, so the sequence attends to itself — letting every token gather relevant context.

Question 2

How is self-attention different from regular attention?

Accepted Answer

In general (cross) attention, the queries come from one sequence and the keys and values from another — for example a decoder attending to an encoder. In self-attention, all three come from the same sequence, so tokens relate to each other rather than to a separate source.

Question 3

What are queries, keys, and values?

Accepted Answer

Each token is projected into three vectors. The query represents what a token is looking for, the key represents what a token offers, and the value is the information it passes on. Attention scores come from query-key similarity and are used to take a weighted sum of the values.

Question 4

Why is self-attention computationally expensive?

Accepted Answer

Self-attention compares every token with every other token, so cost grows with the square of the sequence length. Doubling the context roughly quadruples the attention computation and memory, which is why long-context efficiency is an active area of research.

Self-Attention (AI Glossary)

Definition

Self-attention vs cross-attention

Queries, keys, and values

Multi-head attention

The quadratic cost

Why it matters