Question 1

What does pre-training mean for a language model?

Accepted Answer

Pre-training is the first and largest stage of building a language model, where it learns to predict the next token across a huge body of text. The model is shown trillions of words and repeatedly guesses what comes next, adjusting its weights when it is wrong. By the end it has absorbed grammar, facts, reasoning patterns, and writing styles — a broad base of world knowledge it can later be specialised on.

Question 2

Why is pre-training called self-supervised?

Accepted Answer

Self-supervised means the training signal comes from the data itself, with no human labelling. The text already contains the answer to every prediction — the next word — so the model can be trained on raw internet text without anyone hand-annotating it. This is what makes pre-training scalable to trillions of tokens, since the only requirement is more text.

Question 3

How much does it cost to pre-train a frontier model?

Accepted Answer

Pre-training a frontier model can cost tens to hundreds of millions of dollars in compute, running thousands of GPUs for weeks or months. The expense comes from the sheer scale: trillions of tokens processed many times over a network with hundreds of billions of parameters. This is why only a handful of well-funded labs train the largest foundation models from scratch.

Question 4

Is pre-training the same as fine-tuning?

Accepted Answer

No. Pre-training is the broad, general stage that teaches the model language and world knowledge from web-scale data. Fine-tuning is a much smaller follow-on stage that adapts that base model to a specific task or behaviour using far less data. A typical assistant is pre-trained once, then fine-tuned and aligned with techniques like RLHF on top.

What Is Pre-Training? How LLMs Learn From the Whole Internet

What pre-training actually is

Next-token prediction as the objective

Why it is self-supervised

The compute cost of pre-training

Where pre-training sits in the pipeline