Question 1

What is AI jailbreaking?

Accepted Answer

AI jailbreaking is the practice of crafting prompts that trick a language model into ignoring its safety guidelines and producing content it was trained to refuse. The term borrows from device jailbreaking, but here the goal is to bypass behavioural restrictions rather than software locks.

Question 2

What was the DAN prompt?

Accepted Answer

DAN, short for Do Anything Now, was an early viral jailbreak that told ChatGPT to roleplay as an unrestricted alter ego with no rules. It worked by reframing harmful output as a fictional character speaking, which temporarily slipped past the model's refusals before being patched.

Question 3

Why is jailbreaking so hard to stop?

Accepted Answer

Safety alignment is applied at training time, but jailbreaks attack at inference time using infinite possible phrasings. Because language is open-ended, defenders cannot enumerate every attack in advance, so it becomes an ongoing arms race rather than a problem with a final fix.

Question 4

What is red-teaming in AI?

Accepted Answer

Red-teaming is the deliberate practice of attacking your own model to find jailbreaks and unsafe behaviours before release. Labs employ dedicated teams and automated tools to generate adversarial prompts, then use the findings to retrain and harden the model.

What Is AI Jailbreaking? How Users Bypass Safety Guardrails

What jailbreaking means

Common techniques

Why safety alignment is so hard

Red-teaming as the response

Why it matters