Top-K Retrieval Simulator

Simulate top-k retrieval and see recall, precision, and MRR for k = 1, 3, 5, 10.

Ad placeholder (leaderboard)

Tuning a RAG retriever comes down to one question: how many chunks should I return? Too few and the answer isn’t in the context; too many and you waste tokens and confuse the model. This simulator lets you play out the trade-off with your own scores before touching the pipeline.

How it works

Enter the query, then each candidate chunk with the similarity score your vector search assigned and a checkbox marking whether it is genuinely relevant (your ground truth). The tool sorts the chunks by score — exactly as a real top-k retriever would — and then, for k = 1, 3, 5, and 10, computes:

  • Hit@k — was at least one relevant chunk returned?
  • Recall@k — what fraction of all relevant chunks made it into the top k?
  • Precision@k — what fraction of the top k were relevant?

It also reports the reciprocal rank for this query: 1 divided by the position of the first relevant chunk. Average that across your whole test set and you have MRR.

Choosing k

Watch where recall@k stops climbing. If recall hits 100% at k = 3, returning 10 chunks just adds noise and token cost for no benefit. If recall is still low at k = 10, the problem isn’t k — it’s your embeddings or chunking, and no amount of widening the window will fix it. Precision@k tells you the opposite story: a low precision means the model is wading through irrelevant text to find the answer.

Tips

  • Run several representative queries and average the metrics; a single query can be misleading.
  • A low MRR with high recall means relevant chunks are retrieved but ranked poorly — a re-ranker is the usual fix.
  • Pair this with the RAG Eval Dataset Builder to turn ad-hoc tuning into a repeatable regression test.
Ad placeholder (rectangle)