Loss Function (AI Glossary)

The objective that measures how wrong a model's predictions are

Ad placeholder (leaderboard)

Definition

A loss function (also called a cost or objective function) is a scalar value that measures how wrong a model’s predictions are on a given example or batch. Training a neural network means repeatedly adjusting its weights to make this number as small as possible. The loss turns the vague goal of “be accurate” into a single, differentiable quantity that an optimisation algorithm such as gradient descent can systematically reduce.

Cross-entropy loss for language models

Large language models are trained with cross-entropy loss on the next-token prediction task. At each position the model outputs a probability distribution over the vocabulary, and cross-entropy measures the negative log-probability the model assigned to the token that actually came next. If the model was confident and correct, the loss is near zero; if it was confident and wrong, the loss is large. Averaged over a corpus, this is closely related to perplexity, the standard way of reporting how well a language model fits text.

Mean squared error for regression

For tasks that predict a continuous number — house prices, temperatures, sensor readings — the typical choice is mean squared error (MSE): the average of the squared differences between predicted and true values. Squaring makes large errors disproportionately costly and keeps the function smooth, so its gradient is easy to compute. A close relative, mean absolute error, is less sensitive to outliers because it does not square the differences.

How the loss drives gradient descent

The loss function is the thing gradient descent climbs down. During back-propagation the network computes the gradient of the loss with respect to every weight — the direction in which each weight should move to reduce the loss most quickly. The optimiser then nudges the weights a small step in that direction, scaled by the learning rate. Because of this, the loss must be differentiable: it has to have a well-defined slope everywhere the optimiser might look.

Choosing the right loss

The loss encodes what you actually want the model to do, so picking it carefully matters. Classification tasks use cross-entropy; regression uses MSE or mean absolute error; ranking and retrieval use contrastive or triplet losses; preference tuning of LLMs uses objectives like the DPO loss. Many real systems also add regularisation terms (such as weight decay) to the loss to discourage overfitting. Get the loss wrong and the model will faithfully optimise the wrong objective — a frequent and subtle source of failure.

Ad placeholder (rectangle)