Why the transformer matters
Almost every large language model in use today — GPT, Claude, Gemini, Llama — is built on the transformer, an architecture introduced in the 2017 paper Attention Is All You Need. Before it, the dominant approach for processing language was the recurrent neural network (RNN) and its variants like the LSTM, which read text one word at a time and carried a hidden state forward. That sequential nature made them slow to train and poor at connecting words far apart in a sentence. The transformer’s breakthrough was to drop recurrence entirely and let the model look at the whole sequence at once through a mechanism called self-attention. That single change unlocked massive parallel training on GPUs and far better handling of long-range context — the two ingredients that made today’s huge models possible.
Self-attention: the core idea
Self-attention is how a transformer decides which words matter to which. For each token, the model creates three vectors: a query, a key, and a value. To build a new representation of a given token, its query is compared against the keys of every token in the sequence, producing a set of attention scores. Those scores are normalised into weights, and the weights are used to take a blended sum of all the value vectors. In plain terms: every word asks “which other words should I pay attention to?” and pulls in information from them accordingly. This is why a transformer can correctly link “it” to the right noun several words back, or understand that “bank” means a riverbank rather than a financial institution based on surrounding context.
Multi-head attention, positional encoding, and the feed-forward layer
A single attention operation captures one kind of relationship, so transformers use multi-head attention — several attention computations running in parallel, each free to focus on a different pattern. Their outputs are concatenated and projected back into a single representation. Because attention treats the input as an unordered set, the model needs a way to know word order; this comes from positional encoding, extra signals added to each token’s embedding that encode its position in the sequence. After attention, each token passes through a small feed-forward network, and the whole thing is wrapped in residual connections and layer normalisation that keep training stable. Stack dozens of these attention-plus-feed-forward blocks and you have the depth that gives large models their capability.
Encoder, decoder, and the GPT/BERT split
The original transformer had two halves: an encoder that reads and understands an input sequence, and a decoder that generates an output sequence while attending to the encoder’s output — a design built for translation. Modern models usually keep just one half. BERT uses only the encoder and is trained to understand text bidirectionally, making it strong for classification, search, and embeddings. GPT uses only the decoder with causal (masked) attention, so each token can only see earlier tokens; it is trained to predict the next token, which is exactly what open-ended text generation needs. Both families share the same fundamental machinery — attention plus feed-forward layers — which is why understanding the transformer once explains the architecture behind nearly every LLM you will encounter.