What is an LLM evaluation rubric?

A rubric is a structured scorecard that defines the dimensions you judge model output on, the scale for each, and the bar that counts as a pass. It turns vague impressions of quality into consistent, comparable scores.

What is 'LLM-as-a-judge'?

It is using a language model to score another model's output against your rubric. Exporting the rubric as JSON lets you drop it straight into a judge prompt so the grader returns a structured score per dimension.

Each dimension has a weight, and the overall score is the weighted average of the per-dimension scores normalized to a percentage. Give the dimensions that matter most a higher weight so they dominate the final result.

Is my rubric stored anywhere?

No. The builder runs entirely in your browser. Your task description and dimensions are never uploaded, and the exports are generated locally.

What is the LLM Eval Rubric Builder?

Free LLM eval rubric builder. Define evaluation dimensions (accuracy, tone, format and more), set scoring scales and weights, write pass/fail criteria, and export a clean JSON or markdown rubric for human reviewers or an LLM-as-judge. It runs free in your browser on Gera Tools, with nothing uploaded.

LLM Eval Rubric Builder

Name: LLM Eval Rubric Builder
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Get one useful tool a week

Like this tool? Enter your email and we'll send you one genuinely useful Gera tool a week — plus a link to come back to this one. No spam, one-click unsubscribe any time.

Turn “is this good?” into a repeatable score

Evaluating LLM output by gut feel does not scale. A rubric makes quality measurable: it names the dimensions you care about, gives each a scale and a weight, and defines what counts as a pass. This builder lets you assemble one in minutes and export it as JSON for an LLM-as-judge or as Markdown for human reviewers.

When to use a rubric

A rubric is the right tool when you need to:

Compare outputs from different models, prompts, or prompt versions on the same task.
Run regression tests — check that a model change doesn’t degrade quality you care about.
Hand off review to a team where different reviewers need to agree on what “good” means.
Set up automated LLM-as-judge evaluation at scale without re-explaining criteria in each prompt.

Ad-hoc vibe-checking works for one-off exploration. Once you are iterating on a prompt or comparing models, a rubric is essential.

How a good rubric is built

A useful rubric has a few moving parts:

Dimensions — the distinct qualities you score, such as accuracy, tone, format adherence, and completeness. Keep them independent so a single flaw does not tank every dimension.
Scales — the range for each dimension (for example 1–5). Each level should have a concrete description so two reviewers grade the same output the same way.
Weights — how much each dimension contributes to the final score. A factual-QA task might weight accuracy at 50%; a copywriting task might weight tone equally with clarity.
Pass threshold — the weighted overall percentage that counts as acceptable output.

The overall score is a weighted average normalized to a percentage:

overall% = Σ(score_i × weight_i) / Σ(max_i × weight_i) × 100

Example: a rubric for a support-email task

Suppose you are evaluating a model that drafts customer support replies. A minimal rubric might look like this:

Dimension	Scale	Weight	Top score description
Issue resolution	1–5	40%	Answer directly addresses the customer’s stated problem
Tone	1–5	25%	Professional, empathetic, and brand-appropriate
Accuracy	1–5	25%	All factual claims are correct; no invented product details
Format	1–5	10%	Correct greeting, sign-off, and paragraph structure

Pass threshold: 70%. An output scoring 4/5 on resolution and accuracy but 2/5 on tone (perhaps too blunt) might score around 77%, passing overall — which is useful feedback: the facts are right but the phrasing needs work.

Tips for reliable evals

Define the top of the scale precisely. “5 = factually correct with no unsupported claims” beats “5 = good.”
Keep dimensions orthogonal so you can see why an output failed, not just that it did.
Use the JSON export for LLM-as-judge and instruct the judge to return one score per dimension plus a one-line justification — then spot-check the judge against humans periodically.
Version your rubric. Re-grading old outputs with a changed rubric makes trend comparisons meaningless; bump a version number whenever you edit any criterion.
Start small. Three to five dimensions well-defined beats ten vague ones.