What pre-training actually is
Pre-training is the foundational stage of building a large language model. During it, the model is exposed to an enormous corpus of text — often trillions of tokens scraped from books, websites, code, and articles — and learns a single deceptively simple skill: predicting the next token given everything before it. Each time the model guesses wrong, the error is used to nudge its billions of weights in a slightly better direction. Repeat this across a vast dataset and the model gradually internalises grammar, facts, reasoning chains, and writing styles. The result is a foundation model: a broadly capable but not yet task-specialised network. Everything that comes later — instruction tuning, alignment, safety training — builds on this base.
Next-token prediction as the objective
The training objective is next-token prediction, and it is the heart of why these models work. Given the sequence “The capital of France is”, the model must assign a high probability to “Paris”. To get good at this across billions of varied examples, the model is forced to learn an astonishing amount: syntax, world facts, cause and effect, even rudimentary arithmetic and logic, because all of these help predict what comes next in real text. There is no separate “knowledge module” — the knowledge is an emergent byproduct of relentlessly minimising prediction error over a huge dataset.
Why it is self-supervised
Pre-training is self-supervised, meaning it needs no human-written labels. The correct answer for every prediction is already sitting in the text: it is simply the next word. This is the crucial property that lets pre-training scale. Supervised learning requires expensive human annotation for every example, which caps dataset size. Self-supervision removes that bottleneck — any raw text becomes training data — so labs can feed models the equivalent of a large slice of the public internet. Scale of data, scale of model parameters, and scale of compute together produce the capabilities we associate with modern LLMs.
The compute cost of pre-training
Pre-training a frontier model is one of the most expensive computations performed in industry. It requires thousands of high-end GPUs or TPUs running in parallel for weeks or months, consuming megawatts of power and costing anywhere from millions to hundreds of millions of dollars. The cost scales roughly with the number of parameters multiplied by the number of training tokens, a relationship captured in scaling laws that predict how performance improves as you add more of each. This economics is why pre-training frontier models from scratch is concentrated among a small number of well-resourced labs, while most companies instead fine-tune or prompt an existing pre-trained base.
Where pre-training sits in the pipeline
Pre-training produces a powerful but raw model that predicts plausible text rather than following instructions or behaving safely. To turn it into a usable assistant, labs add later stages: supervised fine-tuning on curated instruction-response pairs, and alignment techniques such as RLHF that shape the model to be helpful, harmless, and honest. Understanding pre-training matters because it is where a model’s knowledge and core capabilities come from — later stages refine behaviour, but they cannot add knowledge the base model never learned during this expensive, foundational phase.