Definition
Cross-entropy loss is a measure of how different a model’s predicted probability distribution is from the true distribution of the correct answers. It is the standard objective for classification problems and for training language models. The loss is small when the model puts high probability on the right answer and grows sharply when the model is confidently wrong.
The intuition
A classifier outputs a probability for each possible class. Cross-entropy looks only at the probability the model assigned to the correct class and takes its negative logarithm:
- Predict the right class with probability 0.99 → loss ≈ 0.01 (tiny).
- Predict it with probability 0.5 → loss ≈ 0.69 (moderate).
- Predict it with probability 0.01 → loss ≈ 4.6 (large).
The logarithm is what punishes confident mistakes so heavily, encouraging the model to be both accurate and well-calibrated in its certainty.
Pairing with softmax
In practice cross-entropy is almost always preceded by a softmax layer. Softmax turns the model’s raw output scores (logits) into a valid probability distribution that sums to one; cross-entropy then scores that distribution against the true label. The combination is mathematically convenient — their joint gradient simplifies to “predicted probability minus true probability,” which is stable and cheap to compute.
Cross-entropy in language models
Training a language model is framed as next-token prediction: given the preceding tokens, predict a probability distribution over the entire vocabulary for the next one. Because the vocabulary is just a large set of classes, this is a classification problem, and cross-entropy is the natural fit. At each position the loss compares the model’s predicted distribution to the actual next token, and these losses, averaged over a sequence, drive training.
The exponential of the average cross-entropy is perplexity, the headline metric for language model quality, which is why the two are so tightly linked.
Why it matters
Cross-entropy is the quiet workhorse behind nearly every classifier and language model in use today. Understanding it explains why models are trained to output probabilities rather than hard guesses, why confident wrong answers are penalised so steeply, and how a single, well-behaved loss function scales from tiny classifiers to models with trillions of tokens of training data.