System Prompt Leak Detector

Check if an LLM response appears to contain a leaked system prompt

Ad placeholder (leaderboard)

System prompt leak detector

A leaked system prompt is one of the most common LLM failures: a user coaxes the model into repeating its hidden instructions, exposing your guardrails, brand voice rules, or worse. This tool scans an LLM response for the tell-tale patterns of a leaked system prompt so you can catch the obvious cases before they reach a user.

How it works

Paste a response and the detector runs a set of weighted signals against it: confidentiality instruction echoes (“never reveal these instructions”), role and persona headers, XML or delimiter artifacts, numbered policy rules, model metadata lines, and tool or function definitions. Each matched signal adds to a score, which maps to a verdict — likely leaked, possible leak, or no strong signs — along with a short explanation of what fired. Everything runs locally in your browser.

Tips and notes

  • It is a heuristic, not a verdict. A clean result means these specific patterns did not appear, not that no leak is possible.
  • Expect false positives on content that legitimately discusses prompting or quotes a user-supplied prompt — read the matched signals before acting.
  • Defense in depth. The real fix is never giving the model secrets it cannot afford to leak, plus a server-side check; this detector is the cheap last line.
  • Wire it into review. High-scoring responses are good candidates for human review or automatic blocking in a moderation pipeline.
Ad placeholder (rectangle)