Indirect Prompt Injection Explainer

Interactive explainer on indirect prompt injection attacks

Ad placeholder (leaderboard)

Indirect prompt injection explainer

Direct prompt injection is when a user tries to jailbreak the model they are talking to. Indirect prompt injection is sneakier and more dangerous: the attacker hides instructions inside content that the AI agent will later retrieve — a web page, a PDF in a knowledge base, an incoming email. The user asks an innocent question, the agent fetches the poisoned content, and the planted instruction hijacks what the agent does next. This explainer lets you walk through real-shaped scenarios and watch the difference defences make.

How it works

Pick a scenario — a web summariser, an email assistant, or a document retrieval pipeline. You will see the user’s harmless request, the untrusted content the agent retrieved, and the malicious instruction an attacker planted inside it (often in an HTML comment, brackets, or a note buried in body text). Toggle the defences checkbox to compare outcomes: a vulnerable agent treats the planted text as a command and may exfiltrate data or take unauthorised actions, while a hardened agent frames the content as untrusted data, runs with least-privilege tools, and requires confirmation before anything high-impact.

Detection and defence

The core principle is that retrieved content is data, not instructions. Delimit it clearly, tell the model never to obey commands found inside it, and strip hidden text — HTML comments, white-on-white styling, zero-width characters, and metadata are all classic carriers. Then limit what the agent can do: send, pay, and delete actions should require explicit user confirmation, so even a successful injection cannot cause irreversible harm. Finally, log every tool call and alert on any action that does not trace back to the user’s original request. No single trick eliminates the risk, so depth across these layers is the realistic goal.

Ad placeholder (rectangle)