Question 1

What are scaling laws in AI?

Accepted Answer

Scaling laws are empirical relationships showing that a language model's loss falls predictably as a power law of three factors: the number of parameters, the size of the training dataset, and the amount of compute used. They let researchers forecast how much better a model will get before training it.

Question 2

What did the Chinchilla paper change?

Accepted Answer

DeepMind's Chinchilla paper (Hoffmann et al., 2022) showed that earlier models like GPT-3 were badly undertrained — too large for the data they saw. For a fixed compute budget, you get a lower loss by training a smaller model on far more tokens, roughly 20 training tokens per parameter.

Question 3

Why are bigger models usually better?

Accepted Answer

Scaling laws show that, given enough data and compute, increasing model size reliably lowers loss and unlocks new capabilities. Larger models capture more patterns and generalise better. The caveat is that size must be matched to sufficient training data, or the extra parameters go to waste.

Question 4

Do scaling laws ever break down?

Accepted Answer

Scaling laws are remarkably robust but not infinite. They assume high-quality data is available, and the supply of fresh, high-quality text is finite. Returns also diminish, and some capabilities emerge suddenly rather than smoothly, so loss curves do not capture everything that matters.

Scaling Laws in AI: Why Bigger Models Are (Usually) Better

Definition

The Kaplan scaling laws

The Chinchilla correction

Why bigger is usually better

Limits and open questions