Yes. The key lives only in your browser memory for the duration of the request and is sent directly to the provider over HTTPS. It is never logged, stored, or sent to any Gera server.

Why are the two replies different lengths or styles?

That difference is exactly what you're measuring. The system prompt steers tone, format, and behaviour, so comparing A and B on the same user message isolates the effect of the prompt change.

Which providers are supported?

OpenAI (chat completions) and Anthropic (messages). Pick the provider that matches your key, then select a model from the dropdown.

Does this cost money?

Yes — each run makes two real API calls billed to your own account at your provider's normal token rates. The token usage shown helps you estimate that cost.

What is the System Prompt A/B Tester (BYO-key)?

Send the same user message with two different system prompts to OpenAI or Anthropic and see both replies side by side, with latency and token usage for each. Bring your own API key — nothing is stored. It runs free in your browser on Gera Tools, with nothing uploaded.

System Prompt A/B Tester (BYO-key)

Name: System Prompt A/B Tester (BYO-key)
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Get one useful tool a week

Like this tool? Enter your email and we'll send you one genuinely useful Gera tool a week — plus a link to come back to this one. No spam, one-click unsubscribe any time.

A/B test two system prompts in one click

Prompt engineering is empirical: the only way to know whether a wording change helps is to run both versions on the same input. This tool sends an identical user message with system prompt A and system prompt B to your chosen model, then shows both replies side by side along with latency and token usage — so you can compare quality and cost directly.

Why side-by-side testing beats guessing

Prompt changes that seem obviously better often aren’t — and changes that seem trivially cosmetic sometimes dramatically affect output. The only reliable way to evaluate a change is to hold everything else constant and observe both outputs on the same input. Doing this mentally, by running one version and remembering what the other “probably would have said,” leads to confirmation bias. This tool makes the controlled comparison automatic.

How it works

You bring your own OpenAI or Anthropic API key. When you click Run, the tool fires two requests in parallel — same user message, different system prompt — and renders each response in its own column. For OpenAI it calls the chat-completions endpoint; for Anthropic it calls the messages API. Everything happens client-side; your key never leaves your machine except in the direct HTTPS call to the provider.

The latency column shows how long each call took wall-clock time, and the token-usage column shows input and output tokens. Both matter for production systems where prompt cost and response time affect user experience and infrastructure budget.

What to test and how to read the results

Change one thing at a time. If you rewrite the role section and the constraints section simultaneously, you cannot tell which change caused the difference in output. Treat each A/B run as a controlled experiment: isolate the variable.

Common high-value experiments:

Role precision: “You are an assistant” (A) vs. “You are a senior product manager specialising in B2B SaaS pricing” (B).
Output format: free-form prose (A) vs. explicit JSON structure (B).
Constraint placement: constraints at the end of the prompt (A) vs. immediately after the role (B).
Length instruction: no length guidance (A) vs. “respond in under 150 words” (B).

Interpreting the output:

Look for differences in format, length, specificity, tone, and whether any constraints were followed. If both outputs are identical, the change you made had no measurable effect at this temperature. If one is clearly better on your criteria, consider whether the difference is consistent across multiple user messages — a single run can be noisy.

On token usage: a longer system prompt that barely improves quality over a shorter one costs more on every API call. At high volume, a 200-token difference in the system prompt adds up quickly. The token counts here let you quantify that trade-off.

Run the same pair a few times, especially if the result is ambiguous. Lowering the temperature toward 0 makes outputs more deterministic, which is better for prompt-structure tests. Higher temperatures are more realistic for creative or conversational use cases.