Prompt A/B Comparator (BYO-key)

Send two prompt variants to the same model and diff the outputs

Ad placeholder (leaderboard)

A/B test two prompt wordings, side by side

Small wording changes can move an LLM’s output a lot — but eyeballing the difference one run at a time is slow. This comparator takes two prompt variants, calls the same model with both in parallel using your own API key, and lays the responses side by side with a word-level diff so you can see exactly where they diverge and decide which wording to ship.

How it works

You provide prompt A and prompt B and choose a single model and provider. The tool fires both requests at once against your own OpenAI or Anthropic key, waits for both to return, and renders them in two columns. It then computes a word-level diff: words unique to A are marked one way, words unique to B another, and shared words left plain. That gives an at-a-glance map of where the two outputs differ — phrasing, structure, inclusions and omissions — without you re-reading both in full.

Tips and notes

Because LLM sampling is non-deterministic, a single comparison can be misled by chance variation; run the same pair two or three times, or pair this with a majority-voting tool, before concluding one prompt is genuinely better. Keep the two variants identical except for the change you are testing, so the diff isolates the effect of that one change. The word-level diff is a visual aid, not a quality score — read both outputs and judge against your actual goal. Each comparison costs two calls on your key, and that key is used only for the direct request to the provider and never stored.

Ad placeholder (rectangle)