Why benchmarks exist
Large language models are general-purpose, so a single accuracy number cannot capture how “good” one is. Benchmarks break the problem into measurable slices — knowledge, reasoning, coding, math, conversational quality — and give each a repeatable score. That lets researchers and buyers compare models on the same footing and track progress over time. The catch is that no benchmark is the whole picture; each measures one narrow thing under specific rules, and reading leaderboards well means knowing exactly what each number represents.
MMLU: breadth of knowledge
MMLU (Massive Multitask Language Understanding) is the most-cited general benchmark. It poses roughly 16,000 multiple-choice questions across 57 subjects, ranging from high-school topics to professional-level law, medicine, and graduate math. The score is simply the percentage answered correctly, with random guessing scoring about 25%. MMLU rewards broad factual recall and reasoning, which is why it became the default headline figure. Its weakness is that multiple-choice format can be gamed, and many top models now score so high that the benchmark is saturating — leaving little room to tell strong models apart.
HumanEval and GSM8K: skills you can verify
Some benchmarks test skills with objectively checkable answers. HumanEval measures coding: the model is given 164 Python function specifications and its generated code is executed against hidden unit tests. The common metric, pass@1, is the share solved correctly on the first try — a functional test, not a judgement of style. GSM8K measures grade-school math word problems, where the model must reason through several steps to reach a single numeric answer. Both are valuable because correctness is unambiguous, but each covers a narrow domain: a great HumanEval score says nothing about writing quality.
MT-Bench and the LMSYS Chatbot Arena: human preference
Fixed test sets struggle to capture open-ended conversational quality, so the field also uses preference-based evaluation. MT-Bench uses multi-turn questions scored by a strong model acting as judge. The LMSYS Chatbot Arena goes further: real users submit their own prompts, see two anonymous model answers side by side, and vote for the better one. Votes feed an Elo-style rating that ranks models the way chess players are ranked. Because the prompts are live, diverse, and human-judged, the Arena is harder to game and often reflects real-world usefulness better than static tests — though preference is subjective and can reward confident style over factual accuracy.
The limitations every reader should know
Benchmarks are signals, not verdicts. The biggest pitfall is data contamination: if test questions appear in a model’s training data, its score reflects memorisation rather than ability. Saturation means once models approach the ceiling, small differences become noise. Narrowness means a leaderboard win on math says little about coding or safety. And benchmarks rarely test the things that matter most in production — latency, cost, reliability under your prompts, and refusal behaviour. The smart way to use them is as a first filter, then validate the shortlist on your own representative tasks before trusting any single number.