A/B test two system prompts in one click
Prompt engineering is empirical: the only way to know whether a wording change helps is to run both versions on the same input. This tool sends an identical user message with system prompt A and system prompt B to your chosen model, then shows both replies side by side along with latency and token usage — so you can compare quality and cost directly.
How it works
You bring your own OpenAI or Anthropic API key. When you click Run, the tool fires two requests in parallel — same user message, different system prompt — and renders each response in its own column. For OpenAI it calls the chat-completions endpoint; for Anthropic it calls the messages endpoint with the browser-access header. Everything happens client-side; your key never leaves your machine except in the direct HTTPS call to the provider.
Tips
- Change one thing at a time between A and B so you can attribute any difference to that change.
- Run the same pair a few times — sampling means a single run can be noisy. Lower the temperature for more repeatable comparisons.
- Watch the token usage column: a longer system prompt that barely improves quality may not be worth the per-call cost at scale.