Does this guarantee my prompt is safe?

No. It runs heuristic pattern matching against common OWASP LLM Top 10 risks, which catches many obvious problems but cannot understand intent or every attack. Treat a clean result as one signal, not a security certification.

Is my prompt sent anywhere?

No. All classification runs entirely in your browser with pattern matching. Your prompt never leaves the page, which is intentional given that prompts being checked may contain sensitive content.

What categories does it check?

It maps to OWASP LLM Top 10 items including prompt injection (LLM01), insecure output handling (LLM02), sensitive information disclosure (LLM06), and excessive agency / over-permissive instructions (LLM08), plus checks for embedded secrets.

Why did it flag a harmless prompt?

Heuristics produce false positives — a prompt that legitimately discusses "ignore previous instructions" as a topic will match the injection pattern. Read the matched snippet and use judgment; lower the sensitivity to reduce borderline flags.

What should I do with a flagged risk?

Each flag includes a mitigation. Common fixes are clearly separating untrusted user input from instructions, validating or encoding model output before using it, and removing hardcoded secrets in favor of server-side injection.

What is the Prompt Safety Classifier?

Runs client-side heuristic checks against OWASP LLM Top 10 risk categories — prompt injection, insecure output handling, sensitive data leakage, and more — and flags risky patterns in your prompt before you send it to a model. It runs free in your browser on Gera Tools, with nothing uploaded.

Prompt Safety Classifier

Name: Prompt Safety Classifier
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Prompt safety classifier

Before a prompt reaches a model — especially one wired into tools, databases, or production output — it is worth a quick safety pass. This classifier runs entirely in your browser and checks your prompt against patterns drawn from the OWASP LLM Top 10, flagging things like injection vectors, output that will be used unsafely, sensitive data, and over-broad agency. It will not catch every attack, but it catches the obvious mistakes that cause most real incidents.

How it works

You paste a prompt and pick a sensitivity level. The tool scans the text with a set of heuristics, each tied to an OWASP LLM risk category. It looks for injection phrasing (“ignore previous instructions,” “disregard your rules”), signs that model output will be executed or rendered without validation, embedded secrets and credentials, requests that grant the model excessive autonomy, and sensitive personal data. Every match reports the category, the snippet that triggered it, and a concrete mitigation. Because it is pure pattern matching, nothing is sent anywhere — important, since the prompts you most want to check are often the ones carrying sensitive content.

The OWASP LLM Top 10 categories covered

The OWASP LLM Top 10 is a published list of the most significant security risks in LLM applications, maintained by a community of AI security practitioners. The classifier maps its checks to these categories:

LLM01 — Prompt Injection: Instructions that attempt to override the system prompt, change the model’s role, or make it ignore its constraints. Both direct injection (user submitting malicious instructions) and indirect injection (malicious content embedded in data the model processes) are checked.

LLM02 — Insecure Output Handling: Signs that the model’s output will be used without validation — for example, prompts that instruct the model to generate SQL, shell commands, or HTML that will be executed directly. This is one of the highest-severity risks because the attack surface is the downstream system, not the model itself.

LLM06 — Sensitive Information Disclosure: The prompt contains or may elicit sensitive personal or proprietary data. This includes prompts that ask the model to reveal its system prompt, prompts that contain embedded credentials or PII, and prompts designed to extract training data.

LLM08 — Excessive Agency: The model is granted autonomy that exceeds what the task requires — for example, a prompt that allows the model to decide which tool to call, which file to write, or which actions to take without a human approval step. Least-privilege prompting is the mitigation.

Other categories checked include hardcoded secrets, over-broad output permissions, and patterns associated with model denial of service (excessively nested or recursive instructions).

Sensitivity levels and what they mean

The tool offers low, medium, and high sensitivity settings. At low sensitivity, only clear-signal patterns fire — things that are almost certainly a risk. At high sensitivity, borderline patterns are also flagged, including prompts that mention injection as a topic (which creates false positives but catches edge cases). For production audits, start at high sensitivity and then reason about each flag; for quick pre-send checks, low or medium is usually appropriate.

Tips and notes

Heuristics cut both ways: they catch common problems fast but also produce false positives, so always read the matched snippet rather than reacting to the count. A prompt that discusses prompt injection as a subject will, correctly, match the injection pattern. The highest-value fix this tool surfaces is structural — keeping untrusted user input clearly delimited and labeled as data, never merged into your instruction block, which neutralizes the most common injection class. A clean result is reassurance, not a guarantee; pair it with server-side input validation and output encoding for anything that touches production.