Definition
Scaling laws are empirical relationships in machine learning showing that a language model’s prediction error (its loss) decreases predictably as a power law of three quantities: the number of model parameters, the size of the training dataset, and the total compute spent training. Discovered through systematic experiments, they let researchers forecast how a model will perform at a given scale before committing to an expensive training run — turning model development from guesswork into something closer to engineering.
The Kaplan scaling laws
The first influential formulation came from Kaplan et al. (OpenAI, 2020). They found that test loss falls smoothly as a power law in model size, dataset size, and compute, across many orders of magnitude. Crucially, the relationships were clean enough to extrapolate: you could train a series of small models, fit the curve, and predict the loss of a much larger one. These results helped justify the push toward ever-larger models like GPT-3, whose strong performance seemed to confirm that scale was the dominant lever.
The Chinchilla correction
In 2022, DeepMind’s Hoffmann et al. — the Chinchilla paper — refined the picture significantly. They showed that for a fixed compute budget, the earlier generation of large models had been badly undertrained: they were too big for the amount of data they saw. The compute-optimal recipe, they found, is to scale model size and training data roughly in proportion, using about 20 training tokens per parameter. A 70-billion-parameter Chinchilla trained on far more data outperformed much larger models trained on less, redirecting the field toward data-rich training.
Why bigger is usually better
Scaling laws explain why the field kept building larger models: within the regime they describe, adding parameters (matched with enough data and compute) reliably lowers loss and, empirically, unlocks new capabilities such as in-context learning and improved reasoning. The performance gains are smooth and predictable enough that labs can plan multi-year roadmaps around them. The essential caveat is balance — extra parameters are wasted unless paired with sufficient high-quality training tokens, which is exactly the lesson Chinchilla delivered.
Limits and open questions
Scaling laws are robust but not unbounded. They presuppose an abundant supply of high-quality data, yet fresh, high-quality text is finite, prompting interest in synthetic data and better data curation. Returns also diminish, and some abilities appear to emerge suddenly at certain scales rather than improving smoothly, which a single loss curve cannot capture. Scaling laws remain the best available map of how AI capability grows with resources — but they describe a trend, not a guarantee.