If you are building RAG, your answer quality is capped by your retrieval quality. This calculator turns two ID lists — what your retriever returned and what is actually relevant — into the four metrics that matter: precision, recall, F1, and mean reciprocal rank.
How it works
You paste two sets of IDs. The tool computes the overlap and derives:
- Precision = relevant retrieved ÷ total retrieved (are the results clean?).
- Recall = relevant retrieved ÷ total relevant (did you miss anything?).
- F1 = harmonic mean of precision and recall (a single balanced score).
- MRR = 1 ÷ rank of the first relevant result (does a good hit appear near the top?).
All computation is local — nothing is uploaded.
Tips and interpretation
In RAG, recall is usually the more dangerous failure: if the supporting passage was never retrieved, the model cannot answer correctly no matter how clever the prompt. Low precision wastes context and can distract the model, but at least the right answer is present. Watch MRR when only the top few chunks feed the model — a relevant chunk buried at rank 10 may never be used. Label your ground-truth set thoroughly: any genuinely relevant chunk you forget to list counts as a false negative and unfairly drags down recall.