Hallucination Risk Scorer

Estimate hallucination risk for a given AI task and model type

Ad placeholder (leaderboard)

Know how likely your AI task is to make things up

Hallucination — confident, fluent, fabricated output — is the dominant reliability failure of large language models. But the risk is not uniform: asking a model to summarise a document you supplied is far safer than asking it for the population of a small town in 1987 with a citation. This scorer takes the task type, model family, output domain and how verifiable the ground truth is, and returns a structured 0-100 risk score plus the specific factors and mitigations that apply.

How the score is built

The score combines independent risk factors that research and practice consistently show drive hallucination up or down:

  • Task typetransformation tasks (summarise, translate, reformat) ground the model in your text and score low. Open factual recall and citation generation score high because the model answers from fallible parametric memory.
  • Model family — larger, newer, better-aligned models hallucinate less and are better calibrated, lowering the score. Small or older models raise it.
  • Output domain — specialised, fast-moving or long-tail domains (law, medicine, recent events, niche code APIs) raise risk; everyday general knowledge lowers it.
  • Verifiability — if the answer can be checked against a source you supply, effective risk drops sharply because mitigations become easy and reliable.

The factors are weighted and combined into a band — low, moderate, high or severe — with a plain-English driver list so you can see why.

Notes and tips

  • The biggest single lever is grounding: supply the source documents and instruct the model to answer only from them (retrieval-augmented generation).
  • For high or severe scores, add a verification pass — a second call or a human review that checks every factual claim and every citation.
  • Ask the model to say “I don’t know” and to cite, then reject outputs whose citations you cannot resolve.
  • This is a heuristic estimate, not a measurement. For high-stakes deployments, back it with an evaluation set that actually counts hallucinations on your real prompts.
Ad placeholder (rectangle)