Perplexity (AI Glossary)

The standard way to measure how well an LLM predicts text — lower is better

Ad placeholder (leaderboard)

What perplexity is

Perplexity is the standard intrinsic metric for how well a language model predicts text. Formally it is the exponential of the model’s average cross-entropy loss per token on a held-out test set. Intuitively, perplexity is the average number of equally likely options the model felt it was choosing among at each token. If a model has a perplexity of 10 on some text, it was, on average, about as uncertain as if it had to pick the right token from 10 equally probable candidates at every step. Lower is better: a perfect model that always assigned probability 1 to the correct token would have a perplexity of 1.

How it is computed

Perplexity is built directly on the same quantity used to train language models. During training the objective is to minimise cross-entropy loss — the average negative log-probability the model assigns to the actual next token. Perplexity is simply that average loss exponentiated, which converts a log-scale number into an interpretable “effective branching factor.” You evaluate it on a held-out test set the model has not trained on, so it measures generalisation rather than memorisation. Because perplexity depends on the tokeniser and the specific text used, two perplexity numbers are only comparable when both are measured the same way on the same data.

Why it is useful

Perplexity is cheap, automatic, and requires no human judgement, which made it the workhorse metric for comparing language models during development. It gives a single number that tracks how well the model has captured the statistical structure of language: as a model trains, its perplexity falls; a better architecture or more data usually shows up as lower perplexity on the same benchmark. For pre-training research it is an excellent early signal of progress.

Why low perplexity is not the whole story

The crucial caveat is that perplexity measures predictive fit to text, not usefulness. It tells you the model is good at guessing the next token in held-out data; it says nothing directly about whether the model is helpful, truthful, safe, or able to reason through a task. A model can have excellent perplexity and still hallucinate facts, refuse reasonable requests, or write unhelpfully. It can also score low perplexity by being well-calibrated to the style of the test set without being more capable. For these reasons, perplexity is used alongside task benchmarks (such as MMLU or HumanEval) and human evaluation, not as a standalone measure of quality.

Where it sits among metrics

Think of perplexity as the foundational, low-level health check derived straight from the next-token-prediction objective, while leaderboard benchmarks measure higher-level capabilities and human preference scores measure real usefulness. Each answers a different question, and a thorough evaluation of a model uses all three.

Ad placeholder (rectangle)