What is recall@k versus precision@k?

Recall@k is the fraction of all relevant chunks that appear in the top k results — it answers "did we retrieve enough of the right context?" Precision@k is the fraction of the top k that are actually relevant — it answers "how much noise are we paying tokens for?" Raising k usually raises recall but lowers precision, so the metrics trade off against each other.

What is MRR (Mean Reciprocal Rank)?

Reciprocal rank is 1 divided by the position of the first relevant result — so a relevant chunk at rank 1 scores 1.0, at rank 2 scores 0.5, at rank 3 scores 0.33, and so on. Averaging the reciprocal rank across all your test queries gives MRR. It rewards systems that put a relevant result near the very top, which matters when you only feed the LLM a few chunks.

How do I choose the right k for my RAG app?

Pick the smallest k where recall@k is high enough that the answer is almost always present, then stop — every extra chunk costs context tokens and can dilute the model with irrelevant text. This simulator lets you see exactly where recall plateaus so you don't overpay for k.

Where do the similarity scores come from?

You enter them manually — typically the cosine similarity or distance your vector database returns for each chunk against the query. The simulator only needs the relative ordering, so any consistent score works. This lets you model real or hypothetical retrieval results without standing up a database.

Is my data sent anywhere?

No. Ranking and all metric calculations run locally in your browser. You can paste real chunk text and scores safely.

What is the Top-K Retrieval Simulator?

Enter a query, candidate chunks, manual similarity scores, and mark which are relevant. The simulator ranks them, shows which are returned at each k, and computes recall@k, precision@k, and reciprocal rank — all in your browser. It runs free in your browser on Gera Tools, with nothing uploaded.

Top-K Retrieval Simulator

Name: Top-K Retrieval Simulator
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Tuning a RAG retriever comes down to one question: how many chunks should I return? Too few and the answer isn’t in the context; too many and you waste tokens and confuse the model. This simulator lets you play out the trade-off with your own scores before touching the pipeline.

How it works

Enter the query, then each candidate chunk with the similarity score your vector search assigned and a checkbox marking whether it is genuinely relevant (your ground truth). The tool sorts the chunks by score — exactly as a real top-k retriever would — and then, for k = 1, 3, 5, and 10, computes:

Hit@k — was at least one relevant chunk returned?
Recall@k — what fraction of all relevant chunks made it into the top k?
Precision@k — what fraction of the top k were relevant?

It also reports the reciprocal rank for this query: 1 divided by the position of the first relevant chunk. Average that across your whole test set and you have MRR.

The recall-precision trade-off in depth

These two metrics move in opposite directions as k increases, and understanding the shape of that trade-off is the core value this simulator provides.

Recall@k answers: “Did the retriever find the answer?” It starts low and climbs as k increases, but eventually plateaus once all relevant chunks are included. Watching where it plateaus tells you the minimum k that captures all the information you need.

Precision@k answers: “How much noise is the LLM reading?” It starts high (if the top result is relevant) and tends to fall as k increases, because you’re including more chunks that happen to score moderately well but don’t actually contain the answer.

The optimal k is where recall is high enough that the answer is almost always present, and precision hasn’t fallen so far that the model is flooded with irrelevant text. This simulator makes that sweet spot visible by showing both metrics across the four standard k values simultaneously.

What MRR and reciprocal rank tell you

Reciprocal rank (RR) is 1 / position-of-first-relevant-chunk. If the first relevant chunk is at position 1 out of 10 results, RR = 1.0. If it’s at position 3, RR = 0.33.

This matters because LLMs do not process all chunks equally. In practice, context near the beginning and end of the input is often weighted more in model attention. A relevant chunk buried at position 9 out of 10 is less useful than one at position 1, even if both nominally appear in the context.

A low average MRR (below 0.4) with high recall@10 is the specific pattern that calls for a re-ranker: your embeddings are retrieving the right chunks, but ranking them poorly. A cross-encoder re-ranker can re-score the top-k candidates and move the most relevant one toward position 1, raising MRR without changing the underlying embedding model.

Diagnosing what the metrics reveal

Pattern	What it means	What to try
Low recall@k at all k	Relevant chunks aren’t being retrieved at all	Better chunking, different embedding model, or hybrid sparse+dense retrieval
High recall@5, low recall@1	Answer is there but buried	Add a re-ranker
Low precision at low k	Top-ranked chunks are mostly irrelevant	Improve embedding quality or add metadata filtering
Recall plateaus at k=3	You can safely reduce k to 3	Lower k to save context tokens and cost

Tips

Run several representative queries and average the metrics; a single query can be misleading.
A low MRR with high recall means relevant chunks are retrieved but ranked poorly — a re-ranker is the usual fix.
Pair this with the RAG Eval Dataset Builder to turn ad-hoc tuning into a repeatable regression test.