How is this different from the side-by-side model tester?

The side-by-side tool varies the model with one prompt. This tool fixes the model and varies the prompt, so you are isolating prompt wording rather than model choice. It also keeps a win tally across runs.

Where does my API key and tally go?

The key is used only for the direct browser-to-provider request and is never stored. The win tally is kept in your browser's local storage so it survives refreshes, but it never leaves your device.

Why run the same pair multiple times?

LLM outputs vary, especially above zero temperature. Voting across several runs gives you a real signal about which prompt is more reliable, not just which one got lucky once.

Should I test at zero temperature?

For a pure wording comparison, low temperature reduces noise. But if you will ship at a higher temperature, test there too — a prompt that is robust at temperature 0 can still drift at temperature 0.8.

How do I reset the tally?

A reset button clears the stored win counts for the current prompt pair. It only affects your local browser data.

What is the A/B Prompt Tester (BYO Key)?

Run prompt A versus prompt B against the same model with your own API key, read both outputs side by side, vote a winner, and build a local tally of which prompt wins over repeated runs — all client-side, nothing stored on a server. It runs free in your browser on Gera Tools, with nothing uploaded.

A/B Prompt Tester (BYO Key)

Name: A/B Prompt Tester (BYO Key)
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Get one useful tool a week

Like this tool? Enter your email and we'll send you one genuinely useful Gera tool a week — plus a link to come back to this one. No spam, one-click unsubscribe any time.

A/B prompt tester

Two phrasings of the same instruction can produce very different model behaviour, and the only honest way to know which wins is to run them head to head. This tool sends prompt A and prompt B to the same model with your own API key, shows both outputs side by side, and lets you vote a winner. It keeps a running tally in your browser, so across several runs a genuine winner emerges instead of one lucky output deciding it.

How it works

You write two prompt variations and pick a single model with shared temperature and max-token settings. Both prompts run against that one model in parallel, so the prompt wording is the only thing that differs. You read the two responses with their latency and token counts, then click to record which one won. The tool stores a per-prompt win count locally (shape-guarded so a corrupted entry never breaks the page) and shows the tally so far. Your key is used only for the direct provider calls and is never stored; the tally lives only in your browser and can be reset at any time.

What to vary and what to keep fixed

The whole point of an A/B test is isolating one variable. The model, temperature, max tokens, and any system prompt should all stay identical between A and B. Only the user-facing prompt wording changes. If you vary two things — say the wording and the temperature — and A wins, you cannot tell which change caused it.

Good single-variable changes to test:

Role framing — “You are a senior copywriter” vs. “Write marketing copy for…”
Output format instruction — “Return a bulleted list” vs. “Return a numbered list”
Level of detail — “Summarise in one sentence” vs. “Summarise in two sentences”
Constraint placement — instructions placed at the start of the prompt vs. at the end
Persona or audience specification — “for a technical audience” vs. “for a general audience”

How many runs before you trust the result?

With LLM outputs, a single comparison is close to noise, especially at temperatures above 0. A prompt that “wins” once might simply have gotten a good sample. As a rough guide:

3 runs — useful for eliminating an obviously worse option
5 to 7 runs — enough to see a consistent lean
10+ runs — meaningful for prompts going into production systems

The running tally in this tool is designed for this workflow: run the pair several times across different sessions to build a genuine signal rather than anchoring on a single result.

Tips and notes

Change one thing. If A and B differ in five ways you will not know which change mattered. Vary wording deliberately, one idea at a time.
Vote over several runs. One comparison is noise; five to ten votes is a signal. The tally is built for exactly this.
Match your shipping temperature. Test where you will run in production so the winner holds up after launch.
Watch tokens, not just quality. A slightly better prompt that doubles output length can lose on cost at scale.
Test both happy and edge-case inputs. A prompt that wins on a clean, typical input can fail on an ambiguous or adversarial one.