Question 1

What is an AI benchmark?

Accepted Answer

An AI benchmark is a fixed dataset of tasks with known correct answers used to score and compare models. Running every model on the same questions produces a single comparable number, such as accuracy or pass rate.

Question 2

What does MMLU measure?

Accepted Answer

MMLU (Massive Multitask Language Understanding) tests knowledge and reasoning across 57 subjects, from history to law to maths, using multiple-choice questions. It is reported as average accuracy across all subjects.

Question 3

Why don't benchmark scores always reflect real-world quality?

Accepted Answer

Benchmarks measure narrow, static tasks that can be gamed, memorised, or contaminated by appearing in training data. A model can top a leaderboard yet still hallucinate, refuse useful requests, or feel worse in everyday chat.

Question 4

What is benchmark contamination?

Accepted Answer

Contamination happens when benchmark questions and answers leak into a model's training data, so it recalls them rather than reasoning. This inflates scores and is a major reason researchers prefer fresh, held-out, or human-preference evaluations.

AI Benchmarks (AI Glossary)

Definition

The major benchmarks

What benchmarks get right

Known limitations

How to read a leaderboard