Question 1

What problem did the transformer solve that earlier models could not?

Accepted Answer

Earlier sequence models like RNNs and LSTMs processed text one token at a time, which made them slow to train and bad at relating distant words. The transformer replaced recurrence with self-attention, letting every token look at every other token in parallel. That made training far more parallelisable on GPUs and dramatically improved how well models handle long-range relationships in text.

Question 2

What is self-attention in simple terms?

Accepted Answer

Self-attention lets each token decide how much to focus on every other token when building its own representation. Each token produces a query, a key, and a value vector; the query is compared against all keys to produce attention weights, and those weights blend the values together. The result is a representation of each word that is informed by the words most relevant to it in context.

Question 3

Why is it called multi-head attention?

Accepted Answer

Instead of computing attention once, a transformer runs several attention operations in parallel, each called a head. Different heads can learn to track different relationships — one might follow grammatical subjects, another might link pronouns to their referents. Their outputs are concatenated and combined, giving the model multiple complementary views of how tokens relate.

Question 4

How do GPT and BERT both come from the transformer?

Accepted Answer

The original 2017 transformer had an encoder and a decoder. BERT keeps only the encoder and is trained to understand text bidirectionally, which suits classification and search. GPT keeps only the decoder and is trained to predict the next token left-to-right, which suits open-ended generation. Both reuse the same core building block — attention plus feed-forward layers — just wired and trained differently.

Transformer Architecture Explained: The Model Behind All LLMs

Why the transformer matters

Self-attention: the core idea

Multi-head attention, positional encoding, and the feed-forward layer

Encoder, decoder, and the GPT/BERT split