The paper that started the modern era
In June 2017 a team of researchers at Google published a paper with a deliberately bold title: “Attention Is All You Need.” It introduced the transformer, the neural network architecture that now underpins essentially every large language model — GPT, Claude, Gemini, Llama, and the rest. It is one of the most cited and most consequential papers in the history of machine learning, and understanding its core ideas explains a great deal about how today’s AI works.
The authors (Vaswani et al.) were not trying to build a chatbot. Their immediate goal was better machine translation. But the architecture they proposed turned out to generalise far beyond that single task.
Throwing out recurrence
Before the transformer, the best sequence models were recurrent neural networks (RNNs) and their variants like LSTMs. These process text one word at a time, carrying a hidden state forward step by step. That sequential nature was their fatal weakness: it was slow, hard to parallelise on GPUs, and struggled to connect words far apart in a long passage.
The paper’s radical move was to remove recurrence entirely. Instead of stepping through a sequence token by token, the transformer looks at the whole sequence at once. This single change is what made it possible to train enormous models efficiently on modern parallel hardware — the practical key that unlocked scaling.
Self-attention and multiple heads
The mechanism doing the heavy lifting is self-attention. For each word, the model asks: “which other words in this sentence matter for understanding me, and how much?” It answers using three learned vectors per token — query, key, and value — and computes weighted relationships across every pair of tokens simultaneously. This lets the model link “it” to the noun it refers to, or connect a verb to its distant subject, in a single step.
The paper went further with multi-head attention: running several attention computations in parallel, each free to focus on a different kind of relationship — syntax in one head, meaning in another. Because attention alone has no sense of word order, the authors added positional encodings to inject information about where each token sits in the sequence.
Why it changed everything
The original transformer had an encoder (for reading input) and a decoder (for generating output). Later researchers split these apart: BERT used the encoder for understanding tasks like search and classification, while the GPT family used the decoder for text generation. Scaling those designs to billions of parameters, training them on vast text corpora, and aligning them with human feedback produced the assistants we use now.
So while “Attention Is All You Need” did not directly produce ChatGPT, it provided the blueprint. Every modern LLM is, at heart, a scaled-up descendant of the architecture described in this one 2017 paper — which is why it is remembered as the work that changed the direction of AI.