Jailbreak prompt detector
Before a user prompt reaches your model, it is worth a fast screen for the well-worn jailbreak patterns: “DAN,” “Developer Mode,” “ignore all previous instructions,” roleplay-as-an-unrestricted-AI setups, and base64-encoded payloads meant to slip past keyword filters. This tool runs that screen in your browser and explains every match with a risk score.
How it works
The detector runs a set of pattern checks over your prompt. Each check targets a known jailbreak family — DAN framings, Developer Mode requests, instruction- override phrases, fictional-roleplay bypasses, and obfuscation such as long base64 strings. Every hit contributes to a weighted risk score and is listed with its category and a short reason. Because it is heuristic and local, it is fast and private, but it only catches patterns it knows about.
Tips and notes
- Use it as a first filter, not the whole defence. Layer it with model-side safety, output moderation, and rate limiting; static patterns are easy to paraphrase around.
- Decode what it flags. When it spots a base64-looking blob, decode and read it before deciding — that is exactly where hidden instructions hide.
- Tune your threshold. Block on high scores, queue medium scores for human review, and log everything to spot new attack trends.
- Expect false positives. Security researchers and educators legitimately quote jailbreaks; a match is a reason to look, not an automatic block.