A/B prompt tester
Two phrasings of the same instruction can produce very different model behaviour, and the only honest way to know which wins is to run them head to head. This tool sends prompt A and prompt B to the same model with your own API key, shows both outputs side by side, and lets you vote a winner. It keeps a running tally in your browser, so across several runs a genuine winner emerges instead of one lucky output deciding it.
How it works
You write two prompt variations and pick a single model with shared temperature and max-token settings. Both prompts run against that one model in parallel, so the prompt wording is the only thing that differs. You read the two responses with their latency and token counts, then click to record which one won. The tool stores a per-prompt win count locally (shape-guarded so a corrupted entry never breaks the page) and shows the tally so far. Your key is used only for the direct provider calls and is never stored; the tally lives only in your browser and can be reset at any time.
Tips and notes
- Change one thing. If A and B differ in five ways you will not know which change mattered. Vary wording deliberately, one idea at a time.
- Vote over several runs. One comparison is noise; five to ten votes is a signal. The tally is built for exactly this.
- Match your shipping temperature. Test where you will run in production so the winner holds up after launch.
- Watch tokens, not just quality. A slightly better prompt that doubles output length can lose on cost at scale.