Robust AI products are tested against the ways people try to break them. This jailbreak phrase reference library is a defensive catalogue of documented bypass techniques, grouped by category and described at a conceptual level, so developers and authorised red-teamers can understand what their filters must withstand and test their own systems without reaching for live external services.
How it works
The library is a bundled, offline dataset. Each entry names a technique, places it in a category — role-play personas, instruction override, hypothetical or fictional framing, token and encoding manipulation, payload splitting, authority impersonation, and others — and explains the mechanism that makes it effective and the guardrail it targets. Rather than shipping turnkey harmful payloads, entries provide neutral, fill-in-the-blank templates so you can adapt them to your own test harness. Filter by category or search by keyword to find what is relevant to the surface you are hardening.
Everything runs in the browser with no network calls, which matters for security teams working in air-gapped or restricted environments. Nothing you search or copy is logged or transmitted.
Tips and examples
Use the categories, not just the phrasings. Specific wordings get patched and stop working, but the underlying technique — say, wrapping a disallowed request inside a fictional story, or splitting it across turns so no single message looks problematic — endures. Map each category to a defensive control: instruction override calls for a hardened, immutable system prompt; encoding tricks call for normalising input before moderation; multi-turn splitting calls for conversation-level review rather than per-message checks.
Stay on the right side of the line. Only exercise systems you own or are explicitly authorised to assess, feed what you learn back into your prompts and moderation layers, and disclose responsibly if you uncover a weakness in someone else’s product.