AI Benchmarks (AI Glossary)

MMLU, HumanEval, HellaSwag: how we measure whether AI is getting better

Ad placeholder (leaderboard)

Definition

An AI benchmark is a standardised set of tasks — usually with known correct answers — used to measure and compare the capabilities of AI models. By running every model through the same questions, researchers turn a fuzzy claim like “this model is smarter” into a concrete, reproducible number such as an accuracy percentage or a code pass rate. Benchmarks are how the field tracks progress and how vendors justify “state-of-the-art” claims.

The major benchmarks

Different benchmarks probe different skills:

  • MMLU — 57 subjects of multiple-choice questions testing broad knowledge and reasoning; reported as average accuracy.
  • HumanEval — 164 Python programming problems; scored by whether generated code passes hidden unit tests (the “pass@1” metric).
  • HellaSwag — commonsense sentence-completion that is easy for humans but was historically hard for models.
  • GSM8K — grade-school maths word problems that test multi-step arithmetic reasoning.
  • SWE-bench — real GitHub issues, measuring whether a model can produce a patch that fixes a bug.
  • MT-Bench / Chatbot Arena — human-judged conversation quality, where Arena uses pairwise votes to compute an Elo-style ranking.

What benchmarks get right

Benchmarks are valuable because they are objective, repeatable, and comparable. They let independent parties verify vendor claims, expose regressions between model versions, and direct research effort toward weak spots. Coding benchmarks like HumanEval and SWE-bench are especially useful because correctness is checkable by running tests rather than by subjective judgement.

Known limitations

Benchmark scores can mislead. Contamination — benchmark questions leaking into training data — lets models recall answers instead of reasoning, inflating results. Many benchmarks are also saturated: top models cluster near 90%+, so small differences are within noise. Worst of all, a high score on a narrow test set rarely captures real-world qualities like factual reliability, instruction-following, latency, refusal behaviour, or how the model feels across thousands of everyday conversations.

How to read a leaderboard

Treat benchmarks as one signal, not a verdict. Check whether the benchmark matches your use case (coding scores are irrelevant if you need long-form writing), look for human-preference evaluations like Chatbot Arena alongside static tests, and always validate a model on your own representative tasks before committing. The benchmark that matters most is the one built from your actual workload.

Ad placeholder (rectangle)