Turn “is this good?” into a repeatable score
Evaluating LLM output by gut feel does not scale. A rubric makes quality measurable: it names the dimensions you care about, gives each a scale and a weight, and defines what counts as a pass. This builder lets you assemble one in minutes and export it as JSON for an LLM-as-judge or markdown for human reviewers.
How a good rubric is built
A useful rubric has a few moving parts:
- Dimensions — the distinct qualities you score, such as accuracy, tone, format adherence, and completeness. Keep them independent so a single flaw does not tank every score.
- Scales — the range for each dimension (for example 1–5). Each level should have a concrete description so two reviewers grade the same output the same way.
- Weights — how much each dimension contributes. A factual-QA task might weight accuracy heavily; a creative task might weight tone.
- Pass threshold — the weighted overall score that counts as acceptable.
The overall score is a weighted average normalized to a percentage:
overall% = Σ(score_i × weight_i) / Σ(max_i × weight_i) × 100
Tips for reliable evals
- Define the top of the scale precisely. “5 = factually correct with no unsupported claims” beats “5 = good”.
- Keep dimensions orthogonal so you can see why an output failed, not just that it did.
- Use the JSON export for LLM-as-judge and have the judge return one score per dimension plus a one-line justification — then spot-check against humans.
- Version your rubric. Re-grading old outputs with a changed rubric makes comparisons meaningless.