What is counterfactual evaluation?

It asks "what if the input had been phrased slightly differently?" A robust prompt produces essentially the same answer across meaning-preserving paraphrases. If the answer changes a lot, the output depends on surface wording rather than the actual task — a fragility worth fixing.

How is similarity measured?

The tool compares each variation's output to the baseline output using a word-overlap (Jaccard-style) similarity on the response text. It is a fast, transparent local heuristic to spot large divergences, not a semantic embedding model.

Yes. The key stays in your browser, is used only for direct requests to the provider you choose, and is never stored or sent to our servers. Reloading the page clears it.

What similarity threshold indicates a problem?

There is no universal number, but outputs that overlap less than roughly half the words with the baseline usually signal real divergence worth inspecting. Set the threshold for your tolerance — factual answers should be very stable, creative tasks naturally vary more.

How do I harden a fragile prompt?

Add explicit constraints, a worked example, a fixed output format, or a step-by-step instruction so the answer depends on the task rather than the phrasing. Re-run the evaluator until paraphrases converge on consistent outputs.

What is the Counterfactual Output Evaluator?

Bring your own API key, paste a prompt, and the tool generates close paraphrases, runs each, and measures how much the output drifts — flagging fragile prompts whose answers flip on trivial wording changes. It runs free in your browser on Gera Tools, with nothing uploaded.

Counterfactual Output Evaluator

Name: Counterfactual Output Evaluator
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

The counterfactual output evaluator stress-tests a prompt by asking one question: if a user had phrased the request slightly differently, would they get the same answer? Robust prompts are insensitive to surface wording — the answer depends on the task, not the exact phrasing. Fragile prompts flip their output on trivial paraphrases, which is a hidden source of inconsistent product behaviour that unit tests rarely catch. This tool generates close paraphrases, runs each with your own API key, and measures how far the outputs drift from your baseline.

Why counterfactual testing matters

When you write and test a prompt, you test it with your exact phrasing. Your users do not. They ask the same question differently every time — shorter, longer, with different emphasis, with or without context. A prompt that behaves perfectly on your phrasing but produces different answers on natural variations is fragile in production, even if it passes all your fixed test cases.

Counterfactual evaluation is not about adversarial inputs or jailbreaks. It is about detecting sensitivity to meaning-preserving rewordings — the kind of variation that should not change the answer but sometimes does.

How it works

You provide a base prompt and your own OpenAI or Anthropic key. The tool:

Runs the base prompt to establish a baseline output.
Asks the model to generate several meaning-preserving paraphrases of your input — same intent, different phrasing.
Runs each paraphrase as a fresh, independent request.
Compares each output against the baseline using a word-overlap similarity score (a fast, transparent Jaccard-style metric on the response text).
Flags any variation whose score falls below your chosen threshold.

Everything runs from your browser. Your key is used only for direct calls to the provider and is never stored or routed through any server.

What the similarity score tells you

The word-overlap score measures how much the vocabulary of two outputs overlaps. A score near 1.0 means the paraphrase produced nearly identical text. A score near 0 means the outputs share almost no words — a likely sign of a qualitatively different answer.

Thresholds to consider:

Score	Typical interpretation
0.9+	Essentially the same answer, minor rewording
0.7–0.9	Same answer, different expression — usually acceptable
0.5–0.7	Meaningfully different; inspect whether the substance changed
Below 0.5	Very different output — strong signal the prompt is fragile

For factual Q&A and classification, you want most paraphrases above 0.7. For creative tasks, natural variation means lower scores are expected and less concerning.

How to harden a fragile prompt

When a paraphrase flips the answer, look at what changed in the input. Common causes:

Ambiguous instruction carrying most of the semantic weight — split it into two explicit sentences.
Missing output format constraint — without specifying a format, the model may structure answers differently depending on how the question is framed. Add a fixed format.
Implicit assumption the model handles differently based on trigger words — make the assumption explicit.
No worked example — adding one grounds the model’s behaviour against a concrete reference.

Fix, then re-run the evaluator until paraphrases converge. Treat the similarity score as a directional signal, not a precise metric — its job is to point you at fragile prompts, which you then inspect by reading the actual outputs.