Definition
Pre-training is the first and most computationally expensive phase of building a large language model. The model is trained on web-scale amounts of unlabelled text — books, code, articles, and crawled web pages — by learning to predict the next token given everything before it. Out of this single, simple objective the model acquires grammar, facts, reasoning patterns, and broad world knowledge. The result is a base model (or foundation model) that is general-purpose but not yet shaped into a helpful assistant.
Self-supervised learning
Pre-training is self-supervised: the data provides its own labels. For any span of text, the “correct answer” for each position is simply the token that actually comes next, so no human annotation is required. This is the key reason pre-training can scale — there is effectively unlimited training signal in raw text, allowing models to learn from trillions of tokens without a labelling budget. Next-token prediction is the dominant objective, though encoder models like BERT instead use masked-token prediction.
Training compute and FLOPs
The cost of pre-training is usually expressed in training FLOPs — floating-point operations — which scale roughly with model size multiplied by the number of training tokens. Frontier models consume enormous amounts of compute, translating into thousands of accelerators running for weeks and budgets in the millions of dollars. Scaling laws (notably the Chinchilla results) describe how loss falls predictably as model size, data, and compute increase, and guide how a fixed compute budget should be split between parameters and tokens.
What pre-training produces
The output of pre-training is a base model: powerful at completing text and rich in knowledge, but not aligned to follow instructions or behave safely. Ask a raw base model a question and it may continue with more questions rather than answer. Turning it into a usable assistant takes further stages — instruction/supervised fine-tuning and RLHF — that build on the foundation pre-training laid.
Pre-training vs. fine-tuning
The two phases play very different roles. Pre-training is generic, one-time, and astronomically expensive; it endows the model with broad ability. Fine-tuning is specific, repeatable, and cheap; it adapts that ability to a task, tone, or format on a much smaller dataset. Almost every deployed model today follows this pre-train-then-adapt recipe, because reusing one expensive pre-trained base across many cheap fine-tunes is far more efficient than training each task model from scratch.