How is the comparison fair?

Both variants run against the same model in the same request settings, in parallel, so differences in output reflect the prompt wording rather than model or timing differences. Note that sampling is still non-deterministic, so re-run a few times for a confident read.

What does the diff highlight show?

A word-level diff between the two outputs — words present only in A, only in B, and words common to both. It is a quick visual cue for where the responses diverge, not a semantic judgement of which is better.

Does running both cost more?

Yes. Each comparison makes two API calls billed against your own key, one per variant. The tool sends them in parallel so you get both results at once.

Your key stays in the browser tab and is sent only to the provider's official endpoint with each request. It is never logged, stored or transmitted to Gera.

What is the Prompt A/B Comparator (BYO-key)?

Enter two prompt variants and your own OpenAI or Anthropic key; the tool calls the model with both in parallel and shows the outputs side by side with a word-level diff highlight. A fast way to A/B test prompt wording — runs in your browser, key never stored. It runs free in your browser on Gera Tools, with nothing uploaded.

Prompt A/B Comparator (BYO-key)

Name: Prompt A/B Comparator (BYO-key)
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

A/B test two prompt wordings, side by side

Small wording changes can move an LLM’s output a lot — but eyeballing the difference one run at a time is slow. This comparator takes two prompt variants, calls the same model with both in parallel using your own API key, and lays the responses side by side with a word-level diff so you can see exactly where they diverge and decide which wording to ship.

How it works

You provide prompt A and prompt B and choose a single model and provider. The tool fires both requests at once against your own OpenAI or Anthropic key, waits for both to return, and renders them in two columns. It then computes a word-level diff: words unique to A are marked one way, words unique to B another, and shared words left plain. That gives an at-a-glance map of where the two outputs differ — phrasing, structure, inclusions and omissions — without you re-reading both in full.

Designing a fair A/B test

The most common mistake in prompt A/B testing is changing too many things at once. When prompt A and prompt B differ in three places — the instruction verb, the output format request, and an example — and the outputs diverge, you cannot tell which change drove the difference. Good A/B prompt testing isolates one variable at a time.

What to change in a single comparison:

Instruction phrasing: “Summarise this article” versus “Write a three-sentence summary of this article” — tests specificity.
Role framing: no system role versus “You are a professional copywriter” — tests persona effect.
Output format: free text versus “Use exactly three bullet points” — tests format constraint.
Tone directive: adding “Be concise and direct” versus leaving it absent — tests behavioral steering.
Example inclusion: one-shot example versus zero-shot — tests few-shot effect.

Change one of these per comparison and you get a meaningful signal from each run.

Reading the word-level diff

The word-level diff is a fast visual signal, not a quality score. Some patterns to look for:

Large divergence in the middle of the response — the two prompts may be producing structurally different outputs rather than just paraphrasing the same content. Neither is automatically better; decide against your goal.

Unique words only at the start or end — the models reached the same conclusion but via different openings or closings. This often indicates a difference in tone or framing without a significant change in content quality.

Nearly identical outputs — the change between A and B did not affect the model’s response. The variation you introduced was not legible to the model, or the instruction was so constrained that any wording produced the same result.

Contradictory content — words in A directly conflict with words in B (for example, “do include” vs. “do not include”). This is a prompt ambiguity signal: the model’s random sampling is splitting between two interpretations of your instruction, and you should rewrite to eliminate the ambiguity.

Sample size and non-determinism

LLM outputs are non-deterministic: the same prompt run twice will often produce noticeably different responses. A single A/B comparison can be misleading if one variant happened to get a good sample run and the other did not.

As a rough guide, run the same pair three to five times and look for consistent patterns across runs rather than reacting to a single result. If the outputs are highly variable run-to-run, that is also useful information — it suggests the task is under-specified and more constraint in the prompt (format, length, examples) may reduce variance before you can meaningfully compare wordings.

Pair this tool with the Prompt Benchmark Builder when you want to move from informal visual comparison to a structured evaluation against fixed test cases with scoring.

Each comparison costs two API calls on your own key. The key is used only for the direct request to the provider’s official endpoint and is never stored or logged.