GPT Architecture Explained: How OpenAI's Models Are Built

Decoder-only transformers, causal masking, and the design choices behind GPT

Ad placeholder (leaderboard)

The one-sentence answer

GPT is a decoder-only transformer trained to predict the next token of text. It takes the words so far, runs them through a deep stack of identical transformer blocks, and outputs a probability distribution over what the next token should be. Sampling from that distribution repeatedly is how it writes paragraphs, code, and answers. Everything else in the architecture exists to make that next-token prediction accurate and the network deep enough to be trainable at scale.

Decoder-only, and why

The original 2017 transformer had two stacks: an encoder to read input and a decoder to produce output, used together for tasks like translation. GPT throws away the encoder and keeps only the decoder. For an open-ended generator that just continues text, a single left-to-right stack is all you need, and dropping the encoder makes the design simpler and easier to scale to hundreds of billions of parameters. This decoder-only choice is the defining architectural decision of the GPT family.

Causal attention masking

Inside each block, self-attention lets every token look at other tokens to gather context. GPT applies a causal mask so a token can only attend to positions before it, never after. This matters because the model is trained to predict the next token: if it could peek at future tokens it would simply copy the answer. The mask enforces honest left-to-right prediction while still letting the whole sequence train in parallel, which is what makes large-scale training feasible.

Stacking blocks: residuals and normalisation

A GPT model is dozens of identical blocks stacked on top of each other. Each block has a multi-head self-attention layer and a feed-forward network. Two ingredients keep this deep stack trainable. Residual connections add each layer’s input back to its output, so gradients flow cleanly through many layers without vanishing. Layer normalisation rescales activations to keep their statistics stable. Without these, deep transformers would be nearly impossible to train — they are the quiet machinery behind GPT being able to go so deep.

From architecture to capability

The same architecture, scaled up with more layers, wider hidden dimensions, more attention heads, and far more training data, is what separates GPT-2 from GPT-4o. The blueprint barely changes; the scale and training refinements do. After pre-training on next-token prediction, GPT models are aligned with techniques like RLHF so they follow instructions and behave safely. Understanding the decoder-only core makes it clear why GPT is so general and why simply making it bigger and better-trained has driven most of its progress.

Ad placeholder (rectangle)