Question 1

What does MMLU actually measure?

Accepted Answer

MMLU (Massive Multitask Language Understanding) is a set of about 16,000 multiple-choice questions across 57 subjects, from history and law to mathematics and medicine. It tests broad knowledge and reasoning rather than any single skill. A high MMLU score means a model has wide factual coverage, but because the questions are multiple-choice and widely published, top models now cluster near the ceiling, so small score differences are not very meaningful.

Question 2

What is the difference between HumanEval and SWE-bench for coding?

Accepted Answer

HumanEval tests whether a model can write a small, self-contained function that passes hidden unit tests — useful but narrow. SWE-bench is far harder and more realistic: it asks a model to resolve real GitHub issues in large open-source codebases, which requires reading many files, understanding context, and making correct edits. SWE-bench scores are a much better signal for real software engineering ability than HumanEval.

Question 3

Why do benchmark scores sometimes overstate a model's ability?

Accepted Answer

Two main reasons. First, data contamination — if benchmark questions and answers appeared in the model's training data, it can recall rather than reason, inflating the score. Second, saturation — once the best models score above roughly ninety percent, the benchmark stops discriminating and the remaining points are mostly noise. Always check whether a benchmark is contaminated or saturated before trusting a headline number.

Question 4

Should I pick a model based on leaderboard rank alone?

Accepted Answer

No. Leaderboards are a useful starting filter, but they average over tasks that may not match yours. The model that tops MMLU may not be the one that writes the cleanest code for your stack or follows your formatting instructions best. Use benchmarks to build a shortlist, then run your own small evaluation on tasks that resemble your real workload before committing.

AI Model Benchmarks Explained: MMLU, HumanEval, HELM, and More

Why benchmarks exist and what they hide

Knowledge and reasoning benchmarks

Coding, math, and conversational benchmarks

Reading scores without being fooled