Which AI Model Is Most Accurate? Factuality Benchmarks Compared

Ranked: the least-hallucinating AI models of 2024

Ad placeholder (leaderboard)

What “accuracy” actually means for an LLM

“Accuracy” is not one number. A model can be excellent at reasoning yet routinely invent citations, or great at coding yet wrong about recent events. For factual reliability specifically, the relevant question is: when the model states something as fact in open-ended text, how often is it true, and does it admit uncertainty instead of guessing? That is factuality, and it is measured by different benchmarks than the knowledge and reasoning scores you usually see quoted.

The benchmarks that measure factuality

  • TruthfulQA tests whether a model resists common human misconceptions (“Does lightning never strike the same place twice?”). High scores mean the model avoids repeating popular falsehoods.
  • SimpleQA (and similar fact-recall sets) ask short, verifiable factual questions and measure the correct rate plus how often the model wrongly claims confidence.
  • FActScore decomposes a generated passage into individual atomic claims and checks each against a reliable source, giving a percentage of supported facts.
  • HalluLens / hallucination leaderboards track fabricated content rates in summarisation and Q&A tasks.

Crucially, MMLU and HumanEval are not factuality benchmarks — they measure multiple-choice knowledge and code correctness. A top MMLU score does not mean a model will not fabricate a court case or a research paper.

How the major models rank

On current public factuality tests the frontier closed models — GPT-4o, Claude 3.5 Sonnet/Opus, and Gemini 1.5 Pro — sit at the top and trade places by a handful of points depending on the benchmark. Claude models are often praised for calibrated uncertainty (saying “I’m not certain”) and careful long-document grounding; GPT-4o and Gemini are strong all-rounders. Llama 3 70B is competitive on common knowledge but slips on obscure facts, while smaller models like Llama 3 8B and Mixtral hallucinate more on the long tail of rare entities and dates.

For research where citations matter, web-grounded tools such as Perplexity can beat a raw frontier model, because every claim is tied to a retrievable source you can verify yourself.

What reduces hallucination more than the model choice

Switching models gives a modest improvement; changing the setup gives a large one:

  • Ground the answer. Retrieval-augmented generation feeds the model real source text and instructs it to answer only from that context — the single biggest reliability win for factual tasks.
  • Demand citations. Asking for sources and then checking them surfaces fabrications immediately and discourages the model from inventing.
  • Lower the temperature for factual work, and instruct the model to say “I don’t know” rather than guess.
  • Verify the long tail yourself. All models are least reliable on rare, recent, or niche facts — exactly where confident-sounding fabrications hide.

Treat any “most accurate model” headline as a starting point, then test the candidates on your own representative questions before trusting one in production.

Ad placeholder (rectangle)