Large language models can feel like magic, but the mechanism underneath is surprisingly simple to describe. This explainer gives developers and product people a solid mental model — enough to reason about why models behave as they do — without a single equation.
It is all next-token prediction
An LLM does exactly one thing: given a stretch of text, it predicts the next token. Internally it outputs a probability for every token it knows, one is selected, appended to the text, and the model runs again on the slightly longer text. Repeat a few hundred times and you get an essay. Everything you experience — answering questions, writing code, holding a conversation — is this loop dressed up. The model is not retrieving stored answers; it is generating the most plausible continuation, token by token.
How it learned: pre-training then alignment
That ability comes from pre-training. The model is shown trillions of words of text and repeatedly asked to predict the next token, with its internal weights nudged whenever it guesses wrong. Over enormous scale this forces it to absorb grammar, facts, reasoning patterns, and style — not because anyone programmed them, but because predicting text well requires them.
Raw pre-trained models are knowledgeable but unruly. A second, much smaller phase — instruction tuning and RLHF (reinforcement learning from human feedback) — shapes the model to follow instructions, answer helpfully, and refuse harmful requests. Human raters compare responses, and the model is tuned toward the preferred ones. This is why ChatGPT behaves like an assistant rather than an autocomplete engine, even though the underlying machinery is the same.
Context windows, temperature, and other dials
Two concepts explain most day-to-day behaviour. The context window is the maximum number of tokens the model can see at once — your prompt plus its output. Anything beyond it simply does not exist for the model, which is why long conversations eventually “forget” the start and why cost scales with how much you stuff into the window.
Temperature governs randomness. At low temperature the model almost always picks its top-ranked next token, producing focused, repeatable answers ideal for facts and code. Raise it and the model samples less likely tokens more often, yielding creative, varied — and less reliable — text. Choosing temperature is choosing where you want to sit on the precision-versus-creativity dial.
Why they hallucinate
The most important thing to internalise is why models make things up. Because the model generates the most plausible next token rather than the most factual one, and because it has no internal fact-checker, it will confidently invent a citation, a date, or an API that does not exist whenever its training left a gap. Hallucination is not a bug bolted onto an otherwise truthful system — it is the very same fluency mechanism operating where the model lacks grounding. This is why retrieval (feeding real documents into the context) and verification matter so much in serious applications.
With that mental model, the rest follows. To see the engine that makes next-token prediction possible, read the transformer architecture, and to understand the unit everything is measured in, see what a token is.