Question 1

What is 'Attention Is All You Need'?

Accepted Answer

It is a 2017 research paper by Vaswani and colleagues at Google that introduced the transformer architecture. The title makes its central claim: you can build a powerful sequence model using only the attention mechanism, without the recurrent or convolutional layers that dominated at the time. It is one of the most influential AI papers ever published and underpins virtually every modern large language model.

Question 2

Why was discarding recurrence such a big deal?

Accepted Answer

Earlier sequence models like RNNs and LSTMs processed text one token at a time, which made training slow and hard to parallelise across modern hardware. By relying only on attention, the transformer can process all tokens in a sequence simultaneously. This made training dramatically faster on GPUs and allowed models to scale to the sizes we see today.

Question 3

What is self-attention in this paper?

Accepted Answer

Self-attention lets each token in a sentence look at every other token and decide how much each one matters for understanding it. Using query, key, and value vectors, the model computes weighted relationships across the whole sequence at once. Multi-head attention runs several of these in parallel so the model can capture different kinds of relationships simultaneously.

Question 4

Did the transformer immediately lead to ChatGPT?

Accepted Answer

Not directly. The 2017 paper introduced the architecture for machine translation. Researchers then scaled and adapted it: BERT used the encoder for understanding tasks, and the GPT series used the decoder for generation. Years of scaling, data, and alignment work turned that original idea into the chat assistants people use today, but every one of them traces back to this paper.

'Attention Is All You Need': The Paper That Changed AI

The paper that started the modern era

Throwing out recurrence

Self-attention and multiple heads

Why it changed everything