Golden Set Builder for LLM Evals

Curate a golden test set of prompt/expected-output pairs for regression testing.

Ad placeholder (leaderboard)

Build a golden set for LLM regression testing

A golden set is the backbone of any serious LLM evaluation: a curated list of prompt / expected-output pairs you run on every prompt or model change to catch regressions. This tool gives you a fast form to add, edit, and tag those pairs, keeps them in your browser between sessions, and exports them as JSONL or JSON for Promptfoo, Braintrust, or a homegrown harness.

How it works

Each entry captures three things: the prompt your system would send, the expected output you want to assert against (an exact string, a substring, or a reference answer for an LLM judge), and optional tags to slice your results later. Entries are stored in your browser’s local storage, so refreshing the page won’t lose your work. When you’re ready, export the set: JSONL emits one object per line — the format most eval frameworks ingest directly as a dataset — while JSON wraps everything in a single array for tools that prefer it.

Tips for a strong golden set

  • Cover the boring happy path and the edge cases — empty inputs, adversarial prompts, and known past failures each deserve a case.
  • Tag by category (refunds, jailbreak, formatting) so a failing slice points you straight at the broken behaviour.
  • Keep expected outputs minimal and assertable. For open-ended tasks, store a reference answer and grade with an LLM judge rather than exact match.
  • Re-export after every editing session and commit the file to version control so your eval set evolves alongside your prompts.
Ad placeholder (rectangle)