An AI benchmarking glossary demystifies the alphabet soup of scores that accompany every model release. MMLU, GPQA, HumanEval, SWE-bench, GSM8K, TruthfulQA — each measures something specific, and a headline percentage means little without knowing what was tested and how. This searchable reference explains the major benchmarks so you can read a model card with a critical eye.
How it works
Type a benchmark name or a keyword like “coding” or “reasoning” and the glossary filters instantly; you can also narrow by category — reasoning, knowledge, coding, math, language, safety, or agentic. Every entry follows the same three-part structure: what it tests, how it’s scored, and why it matters. That structure is the point — knowing that HumanEval checks Python functions by running unit tests (not matching text), or that SWE-bench measures patching real GitHub issues across a codebase, tells you far more than the raw number.
How to read benchmark scores
A single benchmark is a narrow lens. A model can top a knowledge test like MMLU and still write weak code, so always look across categories that match your use case. Watch for saturation — when the best models cluster near the ceiling, a benchmark stops being useful, which is why MMLU-Pro followed MMLU and competition-level math followed grade-school GSM8K. In coding, prefer pass@1, the strictest single-attempt measure, over best-of-many figures.
Why benchmarks can mislead
Scores are evidence, not verdicts. Contamination — when test questions leak into training data — can inflate results, and a model tuned to a benchmark may not generalise. Human-preference evaluations like Chatbot Arena and LLM-judged tests like MT-Bench capture conversational quality that multiple-choice accuracy ignores. The most reliable signal is still your own: run a model on a handful of your real tasks and compare. Use this glossary to interpret the published numbers, then validate the ones that matter for what you actually do.