Jailbreak Prompt Detector

Detect known jailbreak and DAN patterns in prompts before you process them.

Ad placeholder (leaderboard)

Jailbreak prompt detector

Before a user prompt reaches your model, it is worth a fast screen for the well-worn jailbreak patterns: “DAN,” “Developer Mode,” “ignore all previous instructions,” roleplay-as-an-unrestricted-AI setups, and base64-encoded payloads meant to slip past keyword filters. This tool runs that screen in your browser and explains every match with a risk score.

How it works

The detector runs a set of pattern checks over your prompt. Each check targets a known jailbreak family — DAN framings, Developer Mode requests, instruction- override phrases, fictional-roleplay bypasses, and obfuscation such as long base64 strings. Every hit contributes to a weighted risk score and is listed with its category and a short reason. Because it is heuristic and local, it is fast and private, but it only catches patterns it knows about.

Tips and notes

  • Use it as a first filter, not the whole defence. Layer it with model-side safety, output moderation, and rate limiting; static patterns are easy to paraphrase around.
  • Decode what it flags. When it spots a base64-looking blob, decode and read it before deciding — that is exactly where hidden instructions hide.
  • Tune your threshold. Block on high scores, queue medium scores for human review, and log everything to spot new attack trends.
  • Expect false positives. Security researchers and educators legitimately quote jailbreaks; a match is a reason to look, not an automatic block.
Ad placeholder (rectangle)