What next-token prediction is
Next-token prediction (NTP) is the training objective at the heart of nearly every large language model. The model is fed a sequence of tokens and asked a single question: given everything so far, what is the most likely next token? It outputs a probability distribution over the whole vocabulary, that prediction is compared against the token that actually came next in the training text, and the model’s weights are nudged to make the correct token more probable. Run this over trillions of tokens and the model gradually becomes very good at continuing text.
Why it counts as self-supervised learning
NTP is a form of self-supervised learning, which is what makes it so powerful at scale. There is no need for humans to label data, because the label for each position is simply the next token that already exists in the text. Any document, web page, or code file is automatically a set of training examples — every position is a prediction task with a built-in answer. This is why LLMs can be trained on essentially the entire public web: the supervision signal is free and effectively unlimited.
Why a word-guessing game produces world knowledge
It seems implausible that “guess the next word” could yield a system that explains physics or writes code. The reason is that to predict the next token accurately across arbitrary text, a model is forced to learn the structure behind that text. Completing “The capital of France is ___” requires geographic fact; completing a proof requires logical structure; completing code requires syntax and intent. Because the objective rewards getting these right, the model absorbs grammar, facts, relationships, and reasoning patterns as a side effect of minimising prediction error. The knowledge is not stored as a database; it is compressed into the weights as whatever makes the next token predictable.
How it shows up at inference time
The same mechanism runs when you chat with a model. Generation is autoregressive: the model predicts one token, appends it to the sequence, and predicts the next, repeating until it produces a stop token. Answering a question, summarising a document, or writing a function are all the same operation — continuing your prompt in the most probable way. This is also why prompting matters: the words you provide are the context the model conditions on, so a clearer prompt steers the continuation toward a better answer.
Limits baked into the objective
Because the model is optimised purely for plausible continuations, fluent text and true text can look the same to it — which is one root of hallucination. The base objective also has no notion of helpfulness or safety; those come later, from instruction tuning and techniques like RLHF layered on top of the next-token-prediction foundation.