Question 1

What does decoder-only mean in GPT?

Accepted Answer

The original transformer had two halves: an encoder that reads input and a decoder that generates output. GPT keeps only the decoder stack. It processes text left to right, predicting the next token from everything before it, which is exactly what a generative chatbot needs. Removing the encoder simplifies the architecture and scales cleanly to huge sizes.

Question 2

What is causal attention masking?

Accepted Answer

Causal masking stops each position from attending to tokens that come after it. During training the model sees a whole sentence at once, so the mask forces every position to predict the next token using only earlier tokens, never future ones. This makes training parallel and fast while preserving the left-to-right generation behaviour used at inference.

Question 3

Why does GPT use residual connections and layer normalisation?

Accepted Answer

Residual connections add a layer's input back to its output, giving gradients a clean path through dozens of layers so very deep networks can train stably. Layer normalisation rescales activations to keep their distribution steady across layers. Together they are what makes stacking many transformer blocks — the core of scaling GPT — actually trainable.

Question 4

Is GPT a transformer?

Accepted Answer

Yes. GPT stands for Generative Pre-trained Transformer, and it is a transformer using only the decoder portion of the original 2017 architecture. Every GPT model — from GPT-2 to GPT-4o — is a stack of decoder transformer blocks trained to predict the next token, differing mainly in size, data, and training refinements.

GPT Architecture Explained: How OpenAI's Models Are Built

The one-sentence answer

Decoder-only, and why

Causal attention masking

Stacking blocks: residuals and normalisation

From architecture to capability