Prompt Safety Classifier

Classify your prompt by safety risk category before sending

Ad placeholder (leaderboard)

Prompt safety classifier

Before a prompt reaches a model — especially one wired into tools, databases, or production output — it is worth a quick safety pass. This classifier runs entirely in your browser and checks your prompt against patterns drawn from the OWASP LLM Top 10, flagging things like injection vectors, output that will be used unsafely, sensitive data, and over-broad agency. It will not catch every attack, but it catches the obvious mistakes that cause most real incidents.

How it works

You paste a prompt and pick a sensitivity level. The tool scans the text with a set of heuristics, each tied to an OWASP LLM risk category. It looks for injection phrasing (“ignore previous instructions,” “disregard your rules”), signs that model output will be executed or rendered without validation, embedded secrets and credentials, requests that grant the model excessive autonomy, and sensitive personal data. Every match reports the category, the snippet that triggered it, and a concrete mitigation. Because it is pure pattern matching, nothing is sent anywhere — important, since the prompts you most want to check are often the ones carrying sensitive content.

Tips and notes

Heuristics cut both ways: they catch common problems fast but also produce false positives, so always read the matched snippet rather than reacting to the count. A prompt that discusses prompt injection as a subject will, correctly, match the injection pattern. The highest-value fix this tool surfaces is structural — keeping untrusted user input clearly delimited and labeled as data, never merged into your instruction block, which neutralizes the most common injection class. A clean result is reassurance, not a guarantee; pair it with server-side input validation and output encoding for anything that touches production.

Ad placeholder (rectangle)