What is an AI hallucination?

A hallucination is when a model produces confident, plausible-sounding output that is factually wrong or fabricated — invented citations, fake quotes, non-existent APIs. It stems from the model predicting likely text rather than retrieving verified facts.

Which tasks have the highest hallucination risk?

Open-ended factual questions about niche or recent topics, anything requiring precise figures or citations, and reasoning over long contexts. Risk is lowest for transformation tasks like summarising or rewriting text you supply.

Does a bigger model hallucinate less?

Generally yes, larger and more recent models hallucinate less and are better calibrated, but no model is immune. Capability reduces frequency, not the fundamental failure mode, so verification still matters for high-stakes outputs.

What is the single best mitigation?

Grounding the model in supplied source material — retrieval-augmented generation (RAG) — is usually the highest-leverage fix, because the model answers from documents you control rather than from parametric memory.

Can I trust a model's confidence?

No. Models often state false claims with the same fluent confidence as true ones, and self-reported confidence is poorly calibrated. Treat verifiability of the claim, not the model's tone, as your risk signal.

What is the Hallucination Risk Scorer?

Enter the task type, model family and output domain to get a structured hallucination-risk score with the main risk factors and recommended mitigations like retrieval augmentation, grounding and verification steps. It runs free in your browser on Gera Tools, with nothing uploaded.

Hallucination Risk Scorer

Name: Hallucination Risk Scorer
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Get one useful tool a week

Like this tool? Enter your email and we'll send you one genuinely useful Gera tool a week — plus a link to come back to this one. No spam, one-click unsubscribe any time.

Know how likely your AI task is to make things up

Hallucination — confident, fluent, fabricated output — is the dominant reliability failure of large language models. But the risk is not uniform: asking a model to summarise a document you supplied is far safer than asking it for the population of a small town in 1987 with a citation. This scorer takes the task type, model family, output domain and how verifiable the ground truth is, and returns a structured 0-100 risk score plus the specific factors and mitigations that apply.

How the score is built

The score combines independent risk factors that research and practice consistently show drive hallucination up or down:

Task type — transformation tasks (summarise, translate, reformat) ground the model in your text and score low. Open factual recall and citation generation score high because the model answers from fallible parametric memory.
Model family — larger, newer, better-aligned models hallucinate less and are better calibrated, lowering the score. Small or older models raise it.
Output domain — specialised, fast-moving or long-tail domains (law, medicine, recent events, niche code APIs) raise risk; everyday general knowledge lowers it.
Verifiability — if the answer can be checked against a source you supply, effective risk drops sharply because mitigations become easy and reliable.

The factors are weighted and combined into a band — low, moderate, high or severe — with a plain-English driver list so you can see why.

Notes and tips

The biggest single lever is grounding: supply the source documents and instruct the model to answer only from them (retrieval-augmented generation).
For high or severe scores, add a verification pass — a second call or a human review that checks every factual claim and every citation.
Ask the model to say “I don’t know” and to cite, then reject outputs whose citations you cannot resolve.
This is a heuristic estimate, not a measurement. For high-stakes deployments, back it with an evaluation set that actually counts hallucinations on your real prompts.