How is sensitivity measured?

For each ablation the tool compares the new output against the baseline output using a character-level similarity ratio, then reports the shift as one minus that similarity. A larger shift means removing that phrase changed the output more, so the phrase carries more weight.

Is this a rigorous statistical method?

It is a practical, lightweight ablation, not a formal experiment. Because LLMs are stochastic, scores vary run to run, and a single pass is indicative rather than definitive. Run it a few times or lower temperature for steadier signal.

Does my API key get stored or sent anywhere?

No. The key lives only in the page's memory and is sent directly to OpenAI or Anthropic from your browser. It is never logged, stored, or routed through any Gera server.

How many requests will this make?

One baseline call plus one call per ablated phrase. With the default of four ablations that is five requests. Each uses tokens on your own account, so keep prompts modest while testing.

What is the Prompt Sensitivity Analyzer?

Run ablation-style variations of your prompt using your own API key. The tool removes one key phrase at a time, re-runs the prompt, and measures how much each removal shifts the output so you can find the load-bearing words. It runs free in your browser on Gera Tools, with nothing uploaded.

Prompt Sensitivity Analyzer

Name: Prompt Sensitivity Analyzer
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Prompt sensitivity analyzer

Two prompts can look almost identical yet behave completely differently because one word is doing all the work. The classic way to find that word is ablation: remove one piece at a time and see how much the result changes. If deleting a phrase barely moves the output, it is decorative; if the output swings wildly, that phrase is load-bearing and worth protecting. This tool automates the ablation loop against your own model using your own API key.

How it works

First the tool sends your complete prompt and records the baseline output. Then it splits your prompt into candidate phrases — sentences and clauses — and, for each one, sends a version of the prompt with that phrase removed. It compares each ablated output to the baseline with a character-level similarity ratio and reports the shift (one minus similarity) as a percentage. Phrases are ranked so the most influential ones rise to the top. Everything runs client side: your key and prompts go straight to OpenAI or Anthropic, never through a server.

What ablation analysis tells you

Prompt ablation is borrowed from the machine learning research practice of systematically removing components to measure their individual contribution. Applied to prompt engineering, it answers a question that is otherwise very hard to answer: which part of my prompt actually matters?

Long, carefully constructed prompts often carry dead weight — instruction clauses added defensively, elaborations that felt important when writing but turn out to have no effect on the output, or constraints that the model already follows without being told. Ablation surfaces these so you can trim them.

Conversely, ablation also reveals fragile dependencies: a phrase that looks like optional context turns out to be the anchor that keeps the rest of the prompt coherent. Removing it collapses the output even though nothing about the phrase looks critical.

How to read the shift scores

The shift score for each phrase is the decrease in similarity between the baseline output and the ablated output. A shift of near 0% means removing that phrase had virtually no effect — the phrase is probably safe to cut. A shift above roughly 30–40% means the output changed substantially — that phrase is doing significant work.

Shift scores are relative and contextual, not absolute thresholds. A 50% shift on a short factual task (where a small change flips the answer completely) means something different than a 50% shift on a long narrative task (where the model rewrote one paragraph). Always read the actual outputs alongside the scores.

Example scenario

Suppose you have a system prompt for a customer support bot with these clauses:

“You are a helpful support agent for Acme Inc.”
“Always respond in English.”
“Do not discuss competitor products.”
“Reply in a friendly, professional tone.”
“If you do not know the answer, say so and offer to escalate.”

Running the ablation might show:

Clause 1: high shift (the role definition anchors everything)
Clause 2: low shift if all test inputs are in English already
Clause 3: high shift only when test inputs mention competitors
Clause 4: low shift if the baseline tone is already appropriate
Clause 5: moderate shift, depending on the test questions used

This tells you clause 1 and 3 are load-bearing; clauses 2 and 4 may be safely cut if your users only write in English and the model’s default tone is already appropriate.

Tips and notes

Lower the temperature first. LLMs are stochastic, so a high temperature adds noise that masks the real effect of an ablation. Steadier output makes the sensitivity scores more trustworthy.
Run it more than once. A single pass is indicative. If a phrase scores high across several runs, you can be confident it is genuinely load-bearing.
Watch your token spend. The tool fires one request per phrase plus a baseline, all on your account. Keep test prompts short while you learn the pattern.
Act on the findings. Tighten or pin the high-shift phrases, and consider trimming the near-zero ones — they are adding length and cost without changing behavior.