What is precision in retrieval?

Precision is the fraction of retrieved items that are actually relevant — true positives divided by total retrieved. High precision means few irrelevant chunks make it into your context. Low precision wastes context tokens and can distract the model with off-topic passages.

What is recall in retrieval?

Recall is the fraction of all relevant items that you successfully retrieved — true positives divided by total relevant. High recall means you are not missing important context. In RAG, low recall is dangerous because the model simply cannot answer correctly if the supporting passage was never retrieved.

What is the F1 score?

F1 is the harmonic mean of precision and recall, giving a single balanced number between 0 and 1. It rewards systems that are good at both rather than excelling at one. Use it when you care about precision and recall roughly equally; weight differently if one matters more for your use case.

What is mean reciprocal rank (MRR)?

MRR measures how high up the first relevant result appears. It is the reciprocal of the rank of the first relevant item (1 if it is first, 0.5 if second, and so on). It matters when only the top results feed the model, because a relevant chunk buried at rank 10 may never be used.

How many relevant IDs do I need to label?

Even a small labelled set per query is useful, but more is better and more representative. Label every chunk you know is genuinely relevant for the query. Unlabelled relevant chunks count as false negatives and will unfairly depress recall, so be thorough.

What is the Retrieval Precision & Recall Calculator?

Enter your retrieved chunk IDs and the ground-truth relevant IDs to compute precision, recall, F1 and MRR for RAG retrieval evaluation. No backend, no upload — everything is calculated in your browser. It runs free in your browser on Gera Tools, with nothing uploaded.

Retrieval Precision & Recall Calculator

Name: Retrieval Precision & Recall Calculator
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Get one useful tool a week

Like this tool? Enter your email and we'll send you one genuinely useful Gera tool a week — plus a link to come back to this one. No spam, one-click unsubscribe any time.

If you are building RAG, your answer quality is capped by your retrieval quality. This calculator turns two ID lists — what your retriever returned and what is actually relevant — into the four metrics that matter: precision, recall, F1, and mean reciprocal rank.

Why retrieval quality is the bottleneck in most RAG systems

A common pattern in RAG development is to spend significant effort on prompt engineering and model selection while treating the retrieval step as plumbing. The result is a system where the model behaves perfectly when given the right context, but the retriever frequently returns the wrong chunks — so answers are wrong not because of the model but because of what it was given to work with.

Precision and recall are the diagnostic lens for this problem. Low precision means the retrieved set is full of noise that wastes context tokens and can mislead the model. Low recall means critical information was not retrieved at all, so the model has no basis to give a correct answer regardless of prompt quality.

How it works

You paste two sets of IDs. The tool computes the overlap and derives:

Precision = relevant retrieved ÷ total retrieved (are the results clean?).
Recall = relevant retrieved ÷ total relevant (did you miss anything?).
F1 = harmonic mean of precision and recall (a single balanced score).
MRR = 1 ÷ rank of the first relevant result (does a good hit appear near the top?).

All computation is local — nothing is uploaded.

Worked example

Say a query has five relevant chunks in the corpus (IDs: 12, 34, 56, 78, 90), and your retriever returns six chunks in ranked order: 34, 99, 12, 55, 56, 88.

Relevant retrieved: IDs 34, 12, 56 — so 3 true positives
Precision = 3 ÷ 6 = 0.50 (half the retrieved set was irrelevant)
Recall = 3 ÷ 5 = 0.60 (two relevant chunks, 78 and 90, were missed)
F1 = 2 × (0.50 × 0.60) ÷ (0.50 + 0.60) = 0.545
MRR: the first relevant result is at rank 1 (ID 34), so MRR = 1 ÷ 1 = 1.0

The MRR of 1.0 looks great, but the recall of 0.60 reveals that two important documents were never retrieved. For a multi-fact question that depends on all five chunks, the answer would be incomplete.

Which metric to focus on for your use case

Scenario	Primary metric
Long-context model, cost is not a concern	Recall (retrieve everything relevant)
Tight context window, noise matters	Precision (keep retrieved set clean)
Only the top-k chunks feed the model	MRR (is the best chunk near the top?)
Balanced, general evaluation	F1

Tips and interpretation

In RAG, recall is usually the more dangerous failure: if the supporting passage was never retrieved, the model cannot answer correctly no matter how clever the prompt. Low precision wastes context and can distract the model, but at least the right answer is present. Watch MRR when only the top few chunks feed the model — a relevant chunk buried at rank 10 may never be used. Label your ground-truth set thoroughly: any genuinely relevant chunk you forget to list counts as a false negative and unfairly drags down recall.