Why are there so many AI benchmarks?

No single test captures everything a model can do, so the field uses many — knowledge tests like MMLU, coding tests like HumanEval and SWE-bench, math tests like GSM8K and AIME, and safety tests like TruthfulQA. Reading a model's full benchmark card gives a far truer picture than one headline number.

What does 'saturated' mean for a benchmark?

A benchmark is saturated when top models score so high that it no longer separates them — HellaSwag and GSM8K are largely there. Saturation is why harder successors keep appearing, such as MMLU-Pro replacing MMLU and AIME-level math replacing grade-school math.

What is pass@1 in coding benchmarks?

pass@k measures how often a model solves a problem when allowed k attempts, verified by running unit tests. pass@1 — solving on the first try — is the most cited and the strictest, since it reflects single-shot reliability rather than best-of-many sampling.

Is Chatbot Arena a benchmark?

It is an evaluation, but a human-preference one. Real users vote on anonymous head-to-head responses and the results form an Elo rating. It complements static benchmarks by capturing what people actually prefer, which accuracy tests can miss.

Should I trust a model's benchmark scores?

Treat them as evidence, not proof. Scores can be inflated by training on test-like data (contamination), and a high knowledge score says nothing about coding or safety. Look across multiple benchmarks and, where possible, test the model on your own real tasks.

What is the AI Benchmarking Glossary?

A searchable reference glossary of the major AI evaluation benchmarks — MMLU, GPQA, HumanEval, SWE-bench, GSM8K, MATH, TruthfulQA, Chatbot Arena, and more — explaining what each tests, how it's scored, and why it matters. It runs free in your browser on Gera Tools, with nothing uploaded.

AI Benchmarking Glossary

Name: AI Benchmarking Glossary
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

An AI benchmarking glossary demystifies the alphabet soup of scores that accompany every model release. MMLU, GPQA, HumanEval, SWE-bench, GSM8K, TruthfulQA — each measures something specific, and a headline percentage means little without knowing what was tested and how. This searchable reference explains the major benchmarks so you can read a model card with a critical eye.

The major benchmark categories

Knowledge and reasoning — tests like MMLU (Massive Multitask Language Understanding) cover knowledge across 57 academic subjects in a multiple-choice format. GPQA (Graduate-Level Google-Proof Q&A) specifically targets questions requiring expert-level reasoning that cannot be answered by simple web lookup. These benchmarks measure breadth and depth of factual knowledge but say little about practical task execution.

Coding — HumanEval uses Python function completion problems verified by running unit tests. SWE-bench is harder: it gives a model a real GitHub issue and measures whether the model can produce a patch that passes the repository’s existing test suite. Pass@1 (solving on the first try) is the standard figure to compare; best-of-five or best-of-ten results are more optimistic and reflect best-case sampling, not typical use.

Mathematics — GSM8K covers grade-school word problems and is now largely saturated for frontier models. MATH (a competition-level benchmark) and AIME (American Invitational Mathematics Examination) problems are harder and distinguish between top models more reliably.

Safety and honesty — TruthfulQA measures whether models avoid asserting common misconceptions even when a confident wrong answer is plausible. Safety benchmarks test refusal of harmful requests and robustness to adversarial prompting.

Human preference — Chatbot Arena (LMSYS Arena) has real users judge anonymous model pairs in live conversation, producing an Elo-style ranking. This captures qualities that multiple-choice tests cannot: helpfulness, naturalness, tone, and whether people actually prefer the output.

How it works

Type a benchmark name or a keyword like “coding” or “reasoning” and the glossary filters instantly; you can also narrow by category — reasoning, knowledge, coding, math, language, safety, or agentic. Every entry follows the same three-part structure: what it tests, how it’s scored, and why it matters. That structure is the point — knowing that HumanEval checks Python functions by running unit tests (not matching text), or that SWE-bench measures patching real GitHub issues across a codebase, tells you far more than the raw number.

How to read benchmark scores

A single benchmark is a narrow lens. A model can top a knowledge test like MMLU and still write weak code, so always look across categories that match your use case. Watch for saturation — when the best models cluster near the ceiling, a benchmark stops being useful, which is why MMLU-Pro followed MMLU and competition-level math followed grade-school GSM8K. In coding, prefer pass@1, the strictest single-attempt measure, over best-of-many figures.

Why benchmarks can mislead

Scores are evidence, not verdicts. Contamination — when test questions leak into training data — can inflate results, and a model tuned to a benchmark may not generalise. Human-preference evaluations like Chatbot Arena and LLM-judged tests like MT-Bench capture conversational quality that multiple-choice accuracy ignores. The most reliable signal is still your own: run a model on a handful of your real tasks and compare. Use this glossary to interpret the published numbers, then validate the ones that matter for what you actually do.