Question 1

What exactly is an "eval"?

Accepted Answer

An eval is a repeatable test that scores an LLM output against a known input. It can be a deterministic check (does the JSON parse, does the answer contain the right entity), a reference comparison, or a graded judgement. A collection of evals over a dataset gives you a number you can track across prompt and model changes.

Question 2

When should I use LLM-as-judge versus human scoring?

Accepted Answer

Use human scoring to establish ground truth and to validate your judge. Use LLM-as-judge to scale that judgement cheaply across hundreds of cases in CI. The standard pattern is to calibrate the judge against a human-labelled subset, confirm they agree, then let the judge run at volume.

Question 3

What is RAGAS for?

Accepted Answer

RAGAS is a framework for evaluating retrieval-augmented generation. It scores properties specific to RAG — faithfulness (is the answer grounded in retrieved context), answer relevance, and context precision and recall — so you can tell whether a failure came from retrieval or from generation.

Question 4

How do I stop a prompt change from silently regressing quality?

Accepted Answer

Put your eval suite in CI and gate deploys on it. Define a passing threshold per metric, run the suite on every prompt or model change, and block the merge if scores drop. This turns "the new prompt feels better" into "the new prompt scores 4% higher on 120 cases with no regressions."

Question 5

How big does my eval dataset need to be?

Accepted Answer

Start small — even 20 to 50 well-chosen cases that cover your real failure modes are far more useful than none. Grow the set by adding every production failure you find as a new test case, so the suite hardens against the exact mistakes your users hit.

How to Evaluate LLM Outputs: A Developer's Guide to Evals

Why evals matter

The spectrum of eval methods

Building a CI eval harness

Common pitfalls