What format does the export use?

Each example is an object with prompt, expected, and tags fields. JSONL puts one object per line (the format most eval harnesses expect for datasets); JSON wraps them in an array.

Where is my golden set stored?

Entirely in your browser's local storage. Nothing is uploaded to a server, so the set is private to this browser and persists across refreshes until you clear it.

Tags let you group cases — by feature, difficulty, or edge case — so you can filter or slice your eval results later. They are optional and stored alongside each example.

Can I import an existing set?

This tool focuses on building and exporting. To extend an existing set, paste cases in one at a time, or export and merge JSONL files with your harness's own tooling.

What is the Golden Set Builder for LLM Evals?

Free golden set builder for LLM evaluations. Add, edit, and tag prompt and expected-output pairs, then export as JSONL or JSON for Promptfoo, Braintrust, or a custom eval harness. Saves locally in your browser. It runs free in your browser on Gera Tools, with nothing uploaded.

Golden Set Builder for LLM Evals

Name: Golden Set Builder for LLM Evals
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Build a golden set for LLM regression testing

A golden set is the backbone of any serious LLM evaluation: a curated list of prompt / expected-output pairs you run on every prompt or model change to catch regressions. This tool gives you a fast form to add, edit, and tag those pairs, keeps them in your browser between sessions, and exports them as JSONL or JSON for Promptfoo, Braintrust, or a homegrown harness.

How it works

Each entry captures three things: the prompt your system would send, the expected output you want to assert against (an exact string, a substring, or a reference answer for an LLM judge), and optional tags to slice your results later. Entries are stored in your browser’s local storage, so refreshing the page won’t lose your work. When you’re ready, export the set: JSONL emits one object per line — the format most eval frameworks ingest directly as a dataset — while JSON wraps everything in a single array for tools that prefer it.

Why golden sets matter more than benchmarks

Public benchmarks (MMLU, HumanEval, HellaSwag) measure general model capability, but they do not tell you whether your specific prompt, system message, and use-case still behave correctly after a model upgrade or a prompt edit. A golden set is a regression suite for your application layer — it answers the question “does my product still work?” rather than “is this model generally smart?”

Companies that ship reliable LLM features typically maintain at least two types of golden sets:

Unit-level golden cases — individual, targeted tests for known edge cases, previously failed prompts, and narrow behavioural assertions (for example, “the model should always return JSON with a status field”).
System-level golden cases — end-to-end examples that represent full user interactions, used to catch degradation in quality when switching models or providers.

What to put in the expected output field

The expected field is flexible — it does not have to be the exact model output. Use it for whatever your assertion strategy requires:

Exact match — Use when the output should be deterministic, for example a classification label (“POSITIVE”, “NEGATIVE”) or a structured response with fixed keys.

Substring assertion — Store just the fragment that must appear (“the total is”), and check with a contains assertion. Useful when the surrounding wording may vary.

Reference answer for LLM judge — Store a high-quality reference answer. At eval time, a grader model (often a stronger or separate model) compares the production output to the reference and assigns a similarity or quality score. Use this for open-ended generation tasks.

Regex pattern — Store a pattern that the output must match, for example ^\d{3}-\d{4}$ for a formatted code. Useful for structured generation tasks.

Exporting and integrating with eval frameworks

The JSONL export produces one line per case:

{"prompt": "Summarise this article...", "expected": "...", "tags": ["summarization"]}
{"prompt": "Extract all dates from...", "expected": "[\"2023-01-15\"]", "tags": ["extraction"]}

Most eval harnesses ingest this format directly. In Promptfoo, reference the file as a tests dataset in your promptfooconfig.yaml. In Braintrust, upload the JSONL as a dataset. For a homegrown script, parse with readline in Node.js or with open(...) in Python.

Tips for a strong golden set

Cover the boring happy path and the edge cases — empty inputs, adversarial prompts, and known past failures each deserve a case.
Tag by category (refunds, jailbreak, formatting) so a failing slice points you straight at the broken behaviour.
Keep expected outputs minimal and assertable. For open-ended tasks, store a reference answer and grade with an LLM judge rather than exact match.
Re-export after every editing session and commit the file to version control so your eval set evolves alongside your prompts.
Seed the golden set with real production logs. Real user prompts that previously produced bad outputs are the most valuable cases you have.