Question 1

What problem did transformers solve that older models could not?

Accepted Answer

Earlier sequence models like RNNs and LSTMs processed words one at a time, which made them slow to train and bad at connecting distant words. Transformers process the whole sequence in parallel and let any word directly attend to any other, capturing long-range relationships and training far faster on modern hardware.

Question 2

What is self-attention in plain terms?

Accepted Answer

Self-attention lets each word look at every other word in the sentence and decide how much each one matters for understanding it. For the word "it" in a sentence, attention figures out which earlier noun it refers to by weighting that noun heavily. Every word builds a context-aware representation this way.

Question 3

Why do transformers need positional encoding?

Accepted Answer

Because attention looks at all words at once, the model has no built-in sense of order — "dog bites man" and "man bites dog" would look identical. Positional encoding adds a position signal to each token's representation so the model knows where each word sits in the sequence.

Question 4

What are the feed-forward layers for?

Accepted Answer

After attention mixes information between words, a feed-forward network processes each position independently, transforming the gathered context into richer features. Stacking attention and feed-forward blocks dozens of times lets the model build increasingly abstract representations, from grammar to meaning.

Question 5

Is the transformer in GPT the same as the original 2017 one?

Accepted Answer

The core idea is the same, but modern LLMs use the decoder-only half of the original encoder-decoder design, scaled to far more layers and parameters, with refinements like rotary positional embeddings and grouped-query attention. The 2017 "Attention Is All You Need" architecture is still recognisable underneath.

Introduction to Transformer Architecture (No PhD Required)

Tokens: breaking text into pieces

Self-attention: words looking at words

Positional encoding: restoring word order

Feed-forward layers and stacking