Question 1

What is an encoder-decoder architecture?

Accepted Answer

It is a neural network design with two parts: an encoder that reads an input sequence and compresses it into a rich internal representation, and a decoder that uses that representation to generate an output sequence. It was the original transformer design from the 2017 'Attention Is All You Need' paper.

Question 2

What is the difference between encoder-only and decoder-only models?

Accepted Answer

Encoder-only models like BERT read the entire input bidirectionally and are best for understanding tasks such as classification. Decoder-only models like GPT generate text left to right and are best for open-ended generation. Encoder-decoder models combine both for sequence-to-sequence tasks.

Question 3

Why is GPT decoder-only but BERT encoder-only?

Accepted Answer

GPT generates new text, so it uses a causal decoder that predicts each token from previous ones. BERT only needs to build representations for understanding, so it uses a bidirectional encoder that can attend to the whole input at once but does not generate sequences.

Question 4

What is cross-attention in an encoder-decoder model?

Accepted Answer

Cross-attention is the mechanism the decoder uses to look back at the encoder's output while generating. Each decoder step attends to the full encoded input, letting the output stay grounded in the source — essential for translation and summarisation.

Encoder-Decoder Architecture (AI Glossary)

What is an encoder-decoder architecture?

Three transformer variants

Which design fits which task?

Why decoder-only models came to dominate