AI Benchmarking Glossary

Decode MMLU, HumanEval, HellaSwag, and every AI benchmark

Ad placeholder (leaderboard)

An AI benchmarking glossary demystifies the alphabet soup of scores that accompany every model release. MMLU, GPQA, HumanEval, SWE-bench, GSM8K, TruthfulQA — each measures something specific, and a headline percentage means little without knowing what was tested and how. This searchable reference explains the major benchmarks so you can read a model card with a critical eye.

How it works

Type a benchmark name or a keyword like “coding” or “reasoning” and the glossary filters instantly; you can also narrow by category — reasoning, knowledge, coding, math, language, safety, or agentic. Every entry follows the same three-part structure: what it tests, how it’s scored, and why it matters. That structure is the point — knowing that HumanEval checks Python functions by running unit tests (not matching text), or that SWE-bench measures patching real GitHub issues across a codebase, tells you far more than the raw number.

How to read benchmark scores

A single benchmark is a narrow lens. A model can top a knowledge test like MMLU and still write weak code, so always look across categories that match your use case. Watch for saturation — when the best models cluster near the ceiling, a benchmark stops being useful, which is why MMLU-Pro followed MMLU and competition-level math followed grade-school GSM8K. In coding, prefer pass@1, the strictest single-attempt measure, over best-of-many figures.

Why benchmarks can mislead

Scores are evidence, not verdicts. Contamination — when test questions leak into training data — can inflate results, and a model tuned to a benchmark may not generalise. Human-preference evaluations like Chatbot Arena and LLM-judged tests like MT-Bench capture conversational quality that multiple-choice accuracy ignores. The most reliable signal is still your own: run a model on a handful of your real tasks and compare. Use this glossary to interpret the published numbers, then validate the ones that matter for what you actually do.

Ad placeholder (rectangle)