Is this library for attacking AI systems?

No. It is a defensive reference for developers and authorised red-teamers to test the robustness of their own filters and guardrails. The entries describe techniques at a conceptual level so you can build defences, not bypass someone else's system.

Why are the entries templated rather than ready payloads?

The goal is education and self-testing, so entries describe the technique and provide a neutral, fill-in-the-blank template. This is enough to evaluate your own defences without shipping turnkey harmful payloads.

Does it call any external service?

No. The entire catalogue is bundled and runs in your browser with no network calls. Nothing you search or copy is logged or transmitted, which suits security teams working in restricted environments.

How should I use this responsibly?

Only test systems you own or have explicit written authorisation to assess. Use findings to strengthen system prompts, input filters, and output moderation, and follow responsible disclosure if you find a flaw in a third-party product.

Will these patterns keep working?

Model providers continuously patch known bypasses, so any specific phrasing degrades over time. The value here is understanding the categories of attack, which remain stable even as individual phrasings stop working.

What is the Jailbreak Phrase Reference Library?

A curated, offline reference of documented LLM jailbreak and bypass patterns grouped by technique — role-play, instruction override, token manipulation, hypothetical framing and more — so developers and red-teamers can study and test their own filters without any network calls. It runs free in your browser on Gera Tools, with nothing uploaded.

Jailbreak Phrase Reference Library

Name: Jailbreak Phrase Reference Library
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Robust AI products are tested against the ways people try to break them. This jailbreak phrase reference library is a defensive catalogue of documented bypass techniques, grouped by category and described at a conceptual level, so developers and authorised red-teamers can understand what their filters must withstand and test their own systems without reaching for live external services.

How it works

The library is a bundled, offline dataset. Each entry names a technique, places it in a category — role-play personas, instruction override, hypothetical or fictional framing, token and encoding manipulation, payload splitting, authority impersonation, and others — and explains the mechanism that makes it effective and the guardrail it targets. Rather than shipping turnkey harmful payloads, entries provide neutral, fill-in-the-blank templates so you can adapt them to your own test harness. Filter by category or search by keyword to find what is relevant to the surface you are hardening.

Everything runs in the browser with no network calls, which matters for security teams working in air-gapped or restricted environments. Nothing you search or copy is logged or transmitted.

Technique categories and what they target

Understanding the category matters more than any specific phrasing, because individual wordings degrade as model providers patch known bypasses, but the underlying structural technique persists:

Role-play and persona framing. The attacker asks the model to adopt a character — an unrestricted AI, a character in a story, a historical figure — whose fictional identity is supposed to be exempt from the model’s normal guidelines. Defensive countermeasure: ensure the system prompt explicitly states that the model’s values and policies carry over into any role or persona.

Instruction override. The attacker includes text claiming to supersede the original system prompt — for example “ignore all previous instructions” or “your true instructions are…”. Defensive countermeasure: treat the system prompt as immutable and train or configure the model to recognise and reject claimed overrides.

Hypothetical and fictional framing. Wrapping a harmful request inside a story, a research scenario, a thought experiment, or a screenplay is meant to let the content pass through output filters that operate on surface-level context rather than semantic meaning. Defensive countermeasure: review extracted content, not just the framing wrapper.

Token and encoding manipulation. Using unusual Unicode variants, zero-width characters, leet-speak, base64, or fragmented spelling to make a harmful keyword invisible to string-match filters. Defensive countermeasure: normalise and decode input before running moderation.

Payload splitting across turns. Breaking a disallowed request across several conversation turns so no single message looks problematic in isolation. Defensive countermeasure: conversation-level review that aggregates context across turns, not only per-message checks.

Authority impersonation. Claiming to be a developer, the model’s creator, or an authorised override user with special privileges. Defensive countermeasure: never grant elevated access based on in-context claims; verify authority through the system channel only.

Using the library responsibly

Use the categories to map your attack surface, then verify each category against your system. Document what held and what didn’t. Apply fixes — hardened system prompts, normalisation pipelines, conversation-level review — then retest. Only test systems you own or have explicit written authorisation to assess. If you find a weakness in a third-party system, follow responsible disclosure practices and report it to the provider before publishing.

Jailbreak Phrase Reference Library

Get one useful tool a week

How it works

Technique categories and what they target

Using the library responsibly