What Is a Transformer? The Architecture Behind Modern AI

"Attention is all you need"—decoded for curious non-researchers

Ad placeholder (leaderboard)

The architecture that changed everything

The Transformer is the neural network design behind virtually every modern large language model — GPT, Claude, Gemini, and BERT all descend from it. It was introduced in 2017 in a Google paper titled “Attention Is All You Need.” Before it, sequence models relied on recurrence (RNNs and LSTMs) that read text word by word. The Transformer threw that out and processed entire sequences at once using a mechanism called self-attention. The explorer below lets you step through the data flow one stage at a time.

Self-attention: the core idea

Self-attention lets each word in a sentence look at every other word and decide how much each one matters for understanding it. When the model processes the word “it” in “the trophy didn’t fit in the suitcase because it was too big,” self-attention helps it link “it” to “trophy” rather than “suitcase.” Each token produces a query, a key, and a value vector; the query of one token is compared against the keys of all tokens to produce attention weights, which then blend the values together. This happens for every token in parallel.

Multi-head attention and depth

A single attention pass captures one kind of relationship. Transformers use multi-head attention — several attention computations running side by side — so the model can simultaneously track grammar, meaning, and reference. These attention layers are stacked dozens of times, each followed by a small feed-forward network, with residual connections and layer normalisation keeping training stable. Depth is what lets the model build up increasingly abstract representations of the text.

Why order needs encoding

Because attention treats the input as an unordered set, the Transformer needs a way to know word positions. Positional encodings inject this information by adding a distinct pattern to each token’s embedding based on where it sits in the sequence. The original paper used fixed sine and cosine waves; later models use learned embeddings or rotary positional encodings (RoPE). Without them, the model could not distinguish “dog bites man” from “man bites dog.”

Encoder, decoder, or both

The original Transformer had two halves: an encoder that reads and understands input, and a decoder that generates output. Modern models specialise. Encoder-only models like BERT excel at understanding tasks such as classification and search. Decoder-only models like GPT and Claude generate text one token at a time and dominate today’s chat assistants. Translation systems still use the full encoder-decoder pair. Understanding which half a model uses tells you what it is built to do.

Ad placeholder (rectangle)