Tuning a RAG retriever comes down to one question: how many chunks should I return? Too few and the answer isn’t in the context; too many and you waste tokens and confuse the model. This simulator lets you play out the trade-off with your own scores before touching the pipeline.
How it works
Enter the query, then each candidate chunk with the similarity score your vector search assigned and a checkbox marking whether it is genuinely relevant (your ground truth). The tool sorts the chunks by score — exactly as a real top-k retriever would — and then, for k = 1, 3, 5, and 10, computes:
- Hit@k — was at least one relevant chunk returned?
- Recall@k — what fraction of all relevant chunks made it into the top k?
- Precision@k — what fraction of the top k were relevant?
It also reports the reciprocal rank for this query: 1 divided by the position of the first relevant chunk. Average that across your whole test set and you have MRR.
Choosing k
Watch where recall@k stops climbing. If recall hits 100% at k = 3, returning 10 chunks just adds noise and token cost for no benefit. If recall is still low at k = 10, the problem isn’t k — it’s your embeddings or chunking, and no amount of widening the window will fix it. Precision@k tells you the opposite story: a low precision means the model is wading through irrelevant text to find the answer.
Tips
- Run several representative queries and average the metrics; a single query can be misleading.
- A low MRR with high recall means relevant chunks are retrieved but ranked poorly — a re-ranker is the usual fix.
- Pair this with the RAG Eval Dataset Builder to turn ad-hoc tuning into a repeatable regression test.