RAG Eval Dataset Builder

Build question / context / answer triples for evaluating your RAG pipeline.

Ad placeholder (leaderboard)

You can’t improve a RAG pipeline you don’t measure. The first thing every serious RAG project needs is an evaluation set — a list of questions paired with the context that should be retrieved and the answer the system should produce. This builder gives you a structured form for those triples and exports them in the exact shape your eval harness expects.

How it works

Each row is one test case with three parts:

  1. Question — a real user query your system should handle.
  2. Ground-truth context — the chunk(s) that contain the answer, one per line. Use stable chunk IDs if you score retrieval by exact match, or the chunk text if you score by semantic overlap.
  3. Expected answer — the reference response a correct system would give.

Pick an export format and the tool assembles the JSON live. RAGAS output is columnar (parallel arrays) and drops straight into Dataset.from_dict. TruLens output is a list of records with query, expected_response, and expected_chunks. The custom format is a clean array of {question, contexts, answer} objects for your own scripts.

Why ground truth matters

Retrieval metrics like context precision and recall, and generation metrics like faithfulness and answer correctness, all need a reference to compare against. Without ground-truth context you can only measure whether the final answer looks right — you can’t tell whether a good answer came from good retrieval or from the model guessing. Capturing the expected chunks lets you separate retrieval failures from generation failures, which is usually where the real bug lives.

Tips

  • Seed the set with questions your system currently answers wrong — those are your highest-value regression tests.
  • Keep the set in version control next to your pipeline and re-run it on every prompt or index change.
  • Cover the full query distribution: easy lookups, multi-hop questions, and “no answer in the corpus” cases where the correct behaviour is to abstain.
Ad placeholder (rectangle)