LLM Eval Rubric Builder

Design structured evaluation rubrics for LLM output quality.

Ad placeholder (leaderboard)

Turn “is this good?” into a repeatable score

Evaluating LLM output by gut feel does not scale. A rubric makes quality measurable: it names the dimensions you care about, gives each a scale and a weight, and defines what counts as a pass. This builder lets you assemble one in minutes and export it as JSON for an LLM-as-judge or markdown for human reviewers.

How a good rubric is built

A useful rubric has a few moving parts:

  • Dimensions — the distinct qualities you score, such as accuracy, tone, format adherence, and completeness. Keep them independent so a single flaw does not tank every score.
  • Scales — the range for each dimension (for example 1–5). Each level should have a concrete description so two reviewers grade the same output the same way.
  • Weights — how much each dimension contributes. A factual-QA task might weight accuracy heavily; a creative task might weight tone.
  • Pass threshold — the weighted overall score that counts as acceptable.

The overall score is a weighted average normalized to a percentage:

overall% = Σ(score_i × weight_i) / Σ(max_i × weight_i) × 100

Tips for reliable evals

  • Define the top of the scale precisely. “5 = factually correct with no unsupported claims” beats “5 = good”.
  • Keep dimensions orthogonal so you can see why an output failed, not just that it did.
  • Use the JSON export for LLM-as-judge and have the judge return one score per dimension plus a one-line justification — then spot-check against humans.
  • Version your rubric. Re-grading old outputs with a changed rubric makes comparisons meaningless.
Ad placeholder (rectangle)