How is pass or fail decided?

You provide pass keywords. A case passes if the model output contains all of those keywords (case-insensitive). Leave the keywords blank to skip auto-grading and just read every output yourself.

Is my API key stored?

No. The key lives only in the page state and is sent solely in the direct HTTPS request to your chosen provider. It is never written to disk, logged, or sent to any Gera server.

Why run cases sequentially?

Sending one case at a time keeps you well inside provider rate limits and makes a failure on one input easy to isolate. For large suites, run in smaller batches.

Does this replace a real eval framework?

No. It is a fast manual harness for iterating on a prompt. For production evals with many cases, scoring rubrics, and version tracking, use a dedicated evaluation pipeline.

What is the Prompt Testing Harness?

Define a prompt template with a placeholder, a list of test inputs, and a keyword-based pass criterion, then run every case through your own OpenAI or Anthropic key and see which inputs pass or fail in one run. Bring your own key. It runs free in your browser on Gera Tools, with nothing uploaded.

Prompt Testing Harness

Name: Prompt Testing Harness
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Get one useful tool a week

Like this tool? Enter your email and we'll send you one genuinely useful Gera tool a week — plus a link to come back to this one. No spam, one-click unsubscribe any time.

Prompt testing harness

Tweaking a prompt and eyeballing one output is how prompts silently regress. A proper change should be checked against several representative inputs at once. This harness lets you write a template with a placeholder, supply a list of test inputs, and run them all through your own API key — then it grades each result against keywords you choose so you can see at a glance which inputs still pass.

The problem with single-case testing

Most prompt engineers test a change by running the new version on the one input they had in mind when they made the change. This approach has a well-known failure mode: the change that improved output on that one case has regressed two others that were not checked.

This is the classic regression pattern. You fix the edge case that was failing, ship the new prompt, and two days later someone reports that the happy-path case that always worked now produces wrong output. The change was correct but untested against the full case space.

A testing harness breaks this pattern by running the entire case set on every prompt change, not just the case you were thinking about. Regressions become visible before they reach production.

Building a useful test suite

The case list should cover three categories of input:

Happy-path cases — inputs that should always work cleanly. If a prompt starts failing these, something significant broke.

Edge cases — unusual but valid inputs that have previously caused problems: very short inputs, very long inputs, inputs that are ambiguous, inputs that are off-topic.

Adversarial cases — inputs designed to probe for known failure modes: instructions embedded in user input (if you are concerned about injection), inputs that trigger the specific ambiguity you fixed in the latest version.

Keep the suite small enough to run quickly — five to fifteen cases covers most prompts in early development. Grow it by adding new cases each time you discover a failure in production.

How it works

Your template includes the token {{input}}. For each line in your test-input list, the tool substitutes that line into the template and sends the result to your chosen provider — OpenAI or Anthropic — directly from your browser using your own key. Cases run one at a time to stay within rate limits.

If you supply pass keywords, each output is graded pass when it contains all of them (case-insensitive) and fail otherwise; the full output is always shown so you can judge for yourself.

Tips and notes

Pick test inputs that span the edges, not just the happy path: an empty-ish case, a very long one, a tricky one that previously broke the prompt. Keep your pass keywords specific enough to catch real regressions but loose enough to survive harmless rewording — a single decisive token often works better than a whole phrase. When you change the prompt, rerun the whole set rather than the one case you were thinking about; that is precisely how you catch the regression you did not expect. Use a small, cheap model for routine iteration and reserve a larger one for a final confirmation pass.