Why build a benchmark at all?

Without a fixed test set you cannot tell whether a prompt change or model upgrade improved or regressed quality. A benchmark turns "it feels better" into a measurable score you can track over time.

How many test cases do I need?

Twenty is a reasonable floor for a single task; the tool nudges you toward it. Include normal cases, edge cases, and adversarial inputs so the benchmark exercises the prompt's weak spots, not just the happy path.

What scoring types are supported?

Exact match for deterministic outputs like labels, semantic or fuzzy match for free-text answers, and rubric scoring where you define point-based criteria for a grader (human or LLM) to apply.

What format is the export?

A clean JSON object with the task, scoring type, optional rubric, a version number, and a numbered array of input and expected-output cases. It loads directly into most eval frameworks.

Does the tool run the evaluation?

No. It builds the benchmark file. You replay it against your model in your own harness, which keeps the test data and the model under your control.

What is the Prompt Benchmark Builder?

Guides you through defining test cases with input, expected output, and scoring criteria, then exports a clean JSON benchmark file you can replay against any model or prompt version to catch regressions. It runs free in your browser on Gera Tools, with nothing uploaded.

Prompt Benchmark Builder

Name: Prompt Benchmark Builder
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Prompt benchmark builder

You cannot improve what you do not measure. Tweaking a prompt by feel works until a model upgrade silently breaks half your cases and you have no way to notice. A benchmark fixes a set of inputs with known-good outputs and a scoring rule, so every prompt or model change produces a comparable score. This builder helps you assemble that benchmark and export it as portable JSON.

How it works

You describe the task and pick a scoring type: exact match for labels and structured output, semantic match for free text, or a rubric where you write point-based criteria. Then you add input and expected-output pairs through the table, aiming for at least twenty that cover normal, edge, and adversarial cases. The tool renumbers them, wraps everything in a versioned JSON object with the task and scoring rule, and lets you copy it. Drop that file into your own eval harness to replay the benchmark against any model or prompt revision.

Choosing the right scoring type

Exact match — the model’s output must equal the expected output exactly (case-insensitive, with optional whitespace normalisation). Use this for:

Classification tasks: sentiment labels (positive / negative / neutral), category tags.
Structured data extraction: dates, numbers, named entities where the format is fixed.
Code generation outputs that must compile and pass specific tests.

Exact match is the most objective scoring method and the easiest to automate, but it is only appropriate when there is genuinely one correct answer.

Semantic match — measures whether the model’s output conveys the same meaning as the expected output, rather than using identical words. Useful for:

Summarization tasks where multiple valid phrasings exist.
Free-text question answering where the exact wording does not matter.
Translation evaluation where several correct renderings exist.

Semantic match typically uses an embedding similarity score (cosine similarity between vector representations of the expected and actual outputs) or a secondary LLM as a judge. Your eval harness needs to implement this; the benchmark file records the expected output and the semantic match setting so your harness knows which method to apply.

Rubric scoring — a human or LLM judge assigns points based on defined criteria. Use this for:

Long-form generation tasks: blog posts, email drafts, product descriptions.
Multi-attribute evaluation: accuracy, tone, format, and completeness weighted separately.
Any task where “good output” is genuinely subjective and correctness cannot be reduced to a single answer.

A rubric defines the criteria and point values, typically as a short list: “1 point for correct tone, 2 points for covering all required facts, 1 point for staying within the word limit.” The benchmark exports the rubric alongside the test cases so the grader (human or LLM-as-judge) applies the same criteria consistently across every evaluation run.

Building a useful test set

A benchmark’s quality depends almost entirely on the quality of its test cases. Twenty cases that cover the full difficulty distribution are more useful than 200 easy cases.

Normal cases — inputs that represent the typical production usage of your prompt. These should form about half the test set. They tell you whether the prompt works at all.

Edge cases — inputs that are valid but unusual: very short inputs, very long inputs, ambiguous phrasing, multilingual text if your prompt is supposed to handle it, unusual character sets, missing optional fields in structured input. These tell you where the prompt’s assumptions break down.

Adversarial cases — inputs designed to expose failure modes: inputs that look like the normal case but have a subtle difference that should change the output (a negative example in a classification task), instructions embedded in the user content that could confuse an insufficiently constrained prompt (prompt injection), or edge values that might cause parsing failures in a structured output task.

A benchmark skewed heavily toward normal cases will report high scores that do not predict production failure rates. The hard cases are where your evaluation budget is best spent.

The exported JSON format

The benchmark exports a structured JSON object:

{
  "task": "Classify the sentiment of a product review",
  "scoringType": "exact_match",
  "version": 1,
  "cases": [
    { "id": 1, "input": "Great product, love it!", "expected": "positive" },
    { "id": 2, "input": "Arrived broken. Useless.", "expected": "negative" },
    { "id": 3, "input": "It works I guess.", "expected": "neutral" }
  ]
}

This format loads directly into most eval frameworks and is easy to process in a simple script that runs each input through the model, compares the result to expected, and aggregates a score. Bump the version number whenever you add or change cases so historical scores remain comparable.

Tips for maintaining the benchmark over time

Re-run on every change. Replay the benchmark after each prompt edit and each model upgrade. That is the point of having it.
Add cases when you find a production failure. When a user reports a bad output, add that input to the benchmark so it can never regress undetected again.
Version your benchmark. The export carries a version field — bump it when you change the cases so historical scores stay comparable.
Keep the benchmark under version control alongside the prompt it tests, so the prompt and its evaluation live and evolve together.