Question 1

What is MMLU and what does it measure?

Accepted Answer

MMLU (Massive Multitask Language Understanding) is a multiple-choice benchmark covering 57 subjects, from history and law to college-level math and medicine. It measures breadth of knowledge and reasoning by reporting the percentage of questions answered correctly. It is the most-cited single number for general LLM capability.

Question 2

What does HumanEval test?

Accepted Answer

HumanEval measures coding ability. It gives the model 164 Python programming problems, each with a function description, and runs the generated code against hidden unit tests. The score, usually reported as pass@1, is the fraction of problems solved correctly on the first attempt, making it a functional rather than stylistic test.

Question 3

Why is the LMSYS Chatbot Arena considered more reliable?

Accepted Answer

The Chatbot Arena collects real human votes on blind, head-to-head comparisons of model responses and ranks models with an Elo-style rating. Because it uses live, diverse prompts and human preference rather than fixed test questions, it is harder to game and reflects perceived quality, though it is subjective and can favour style over accuracy.

Question 4

Can benchmark scores be misleading?

Accepted Answer

Yes. Test questions can leak into training data (contamination), making scores look better than real ability. Benchmarks also saturate as models max them out, and a high score on one task says little about others. They are useful comparative signals, not proof a model will perform well on your specific use case.

LLM Benchmarks Explained: MMLU, HumanEval, and LMSYS Chatbot Arena

Why benchmarks exist

MMLU: breadth of knowledge

HumanEval and GSM8K: skills you can verify

MT-Bench and the LMSYS Chatbot Arena: human preference

The limitations every reader should know