Counterfactual Output Evaluator

Check if your LLM output would change with minor input variations

Ad placeholder (leaderboard)

The counterfactual output evaluator stress-tests a prompt by asking a simple question: if a user had phrased the request slightly differently, would they get the same answer? Robust prompts are insensitive to surface wording — the answer depends on the task, not the exact words. Fragile prompts flip their output on trivial paraphrases, which is a hidden source of inconsistent product behaviour. This tool generates close paraphrases, runs each with your own API key, and measures how far the outputs drift.

How it works

You provide a base prompt and your own OpenAI or Anthropic key. The tool first runs the base prompt to establish a baseline output. It then asks the model to produce several meaning-preserving paraphrases of your input, runs each one as a fresh request, and compares every result against the baseline using a fast word-overlap similarity score. Variations that fall below your chosen threshold are flagged. Everything runs from your browser — the key is used only for direct provider calls and is never stored.

Tips and examples

Use this on prompts where consistency matters: classification, extraction, factual Q&A, anything whose output feeds downstream code. Set the threshold according to the task — factual answers should overlap heavily across paraphrases, while creative writing naturally varies, so do not over-interpret drift there. When a paraphrase flips the answer, look at what changed: often a single ambiguous instruction is doing too much work. Fix it with an explicit constraint, a worked example, or a fixed output format, then re-run until the paraphrases converge. Treat the similarity score as a directional signal, not a precise metric — its job is to point you at the fragile prompts, which you then inspect by hand.

Ad placeholder (rectangle)