Prompt Sensitivity Analyzer

Identify which words in your prompt most affect the output

Ad placeholder (leaderboard)

Prompt sensitivity analyzer

Two prompts can look almost identical yet behave completely differently because one word is doing all the work. The classic way to find that word is ablation: remove one piece at a time and see how much the result changes. If deleting a phrase barely moves the output, it is decorative; if the output swings wildly, that phrase is load-bearing and worth protecting. This tool automates the ablation loop against your own model using your own API key.

How it works

First the tool sends your complete prompt and records the baseline output. Then it splits your prompt into candidate phrases — sentences and clauses — and, for each one, sends a version of the prompt with that phrase removed. It compares each ablated output to the baseline with a character-level similarity ratio and reports the shift (one minus similarity) as a percentage. Phrases are ranked so the most influential ones rise to the top. Everything runs client side: your key and prompts go straight to OpenAI or Anthropic, never through a server.

Tips and notes

  • Lower the temperature first. LLMs are stochastic, so a high temperature adds noise that masks the real effect of an ablation. Steadier output makes the sensitivity scores more trustworthy.
  • Run it more than once. A single pass is indicative. If a phrase scores high across several runs, you can be confident it is genuinely load-bearing.
  • Watch your token spend. The tool fires one request per phrase plus a baseline, all on your account. Keep test prompts short.
  • Act on the findings. Tighten or pin the high-shift phrases, and consider trimming the near-zero ones — they are adding length without changing behavior.
Ad placeholder (rectangle)