What does "attention is all you need" mean?

It is the title of the 2017 Google paper that introduced the Transformer. The claim is that you can drop recurrence and convolutions entirely and rely on the attention mechanism alone to model relationships between words. That bet proved correct and now underpins nearly every large language model.

How is a Transformer different from an RNN?

An RNN reads text one token at a time, carrying a hidden state forward, which is slow and forgets distant context. A Transformer processes the whole sequence in parallel and lets every token directly attend to every other token, so it trains far faster and captures long-range dependencies better.

What are positional encodings for?

Because attention looks at all tokens simultaneously, the raw model has no built-in sense of order. Positional encodings add a unique signal to each position so the model can tell "dog bites man" from "man bites dog." Original Transformers used fixed sine waves; modern ones often use learned or rotary positions.

Do GPT and BERT both use Transformers?

Yes, but different halves. BERT is an encoder-only Transformer built for understanding tasks like classification. GPT is a decoder-only Transformer built for generating text left to right. The original paper used the full encoder-decoder stack for translation.

What Is a Transformer? The Architecture Behind Modern AI

The architecture that changed everything

The Transformer is the neural network design behind virtually every modern large language model — GPT, Claude, Gemini, and BERT all descend from it. It was introduced in 2017 in a Google paper titled “Attention Is All You Need.” Before it, sequence models relied on recurrence (RNNs and LSTMs) that read text word by word. The Transformer threw that out and processed entire sequences at once using a mechanism called self-attention. The explorer below lets you step through the data flow one stage at a time.

Self-attention: the core idea

Self-attention lets each word in a sentence look at every other word and decide how much each one matters for understanding it. When the model processes the word “it” in “the trophy didn’t fit in the suitcase because it was too big,” self-attention helps it link “it” to “trophy” rather than “suitcase.” Each token produces a query, a key, and a value vector; the query of one token is compared against the keys of all tokens to produce attention weights, which then blend the values together. This happens for every token in parallel.

Multi-head attention and depth

A single attention pass captures one kind of relationship. Transformers use multi-head attention — several attention computations running side by side — so the model can simultaneously track grammar, meaning, and reference. These attention layers are stacked dozens of times, each followed by a small feed-forward network, with residual connections and layer normalisation keeping training stable. Depth is what lets the model build up increasingly abstract representations of the text.

Why order needs encoding

Because attention treats the input as an unordered set, the Transformer needs a way to know word positions. Positional encodings inject this information by adding a distinct pattern to each token’s embedding based on where it sits in the sequence. The original paper used fixed sine and cosine waves; later models use learned embeddings or rotary positional encodings (RoPE). Without them, the model could not distinguish “dog bites man” from “man bites dog.”

Encoder, decoder, or both

The original Transformer had two halves: an encoder that reads and understands input, and a decoder that generates output. Modern models specialise. Encoder-only models like BERT excel at understanding tasks such as classification and search. Decoder-only models like GPT and Claude generate text one token at a time and dominate today’s chat assistants. Translation systems still use the full encoder-decoder pair. Understanding which half a model uses tells you what it is built to do.