Large language models can’t tell the difference between your instructions and instructions buried in user input — that’s the root of prompt injection. This scanner gives you a fast, local first line of defence: paste any untrusted text and see whether it carries known attack signatures before it ever reaches your model.
How it works
The detector runs a set of weighted regular-expression rules against the input. Each rule targets a recognised injection technique:
- Instruction override — “ignore previous instructions,” “disregard the above.”
- Role / persona override — “you are now,” “act as,” “pretend to be.”
- System-prompt exfiltration — “reveal your system prompt,” “repeat the instructions above.”
- Jailbreak personas — DAN, “do anything now,” “developer mode.”
- Delimiter smuggling — fake
<system>tags,[INST]markers, stray code fences. - Safety suppression, credential fishing, and encoding evasion hints.
Matched rule weights are summed and capped at 100. The result is shown as a coloured score with every match highlighted, including the exact text that triggered it, so you can audit false positives and tune your own filter.
Why a heuristic is only step one
No keyword list can fully solve prompt injection — attackers paraphrase, translate, or encode their payloads. Use this as a cheap, instant filter, but pair it with structural defences: isolate untrusted content in a dedicated user turn, never concatenate it into the system prompt, constrain and validate any tool calls the model can make, and gate irreversible actions behind human review.
Tips
- Run retrieved RAG chunks through this too — injected instructions hidden inside indexed documents are a common and overlooked attack vector.
- A medium score on benign text usually means a false positive (e.g. a user genuinely asking the model to “act as a translator”); read the match before blocking.
- Log scores over time. A sudden spike in high-risk inputs is a useful early signal of an attack.