Prompt Injection Detector

Scan user input for prompt injection patterns before you send it to your LLM.

Ad placeholder (leaderboard)

Large language models can’t tell the difference between your instructions and instructions buried in user input — that’s the root of prompt injection. This scanner gives you a fast, local first line of defence: paste any untrusted text and see whether it carries known attack signatures before it ever reaches your model.

How it works

The detector runs a set of weighted regular-expression rules against the input. Each rule targets a recognised injection technique:

  • Instruction override — “ignore previous instructions,” “disregard the above.”
  • Role / persona override — “you are now,” “act as,” “pretend to be.”
  • System-prompt exfiltration — “reveal your system prompt,” “repeat the instructions above.”
  • Jailbreak personas — DAN, “do anything now,” “developer mode.”
  • Delimiter smuggling — fake <system> tags, [INST] markers, stray code fences.
  • Safety suppression, credential fishing, and encoding evasion hints.

Matched rule weights are summed and capped at 100. The result is shown as a coloured score with every match highlighted, including the exact text that triggered it, so you can audit false positives and tune your own filter.

Why a heuristic is only step one

No keyword list can fully solve prompt injection — attackers paraphrase, translate, or encode their payloads. Use this as a cheap, instant filter, but pair it with structural defences: isolate untrusted content in a dedicated user turn, never concatenate it into the system prompt, constrain and validate any tool calls the model can make, and gate irreversible actions behind human review.

Tips

  • Run retrieved RAG chunks through this too — injected instructions hidden inside indexed documents are a common and overlooked attack vector.
  • A medium score on benign text usually means a false positive (e.g. a user genuinely asking the model to “act as a translator”); read the match before blocking.
  • Log scores over time. A sudden spike in high-risk inputs is a useful early signal of an attack.
Ad placeholder (rectangle)