Before user comments, generated copy, or community posts go live, it helps to screen them for content that could be harmful, biased, or expose personal data. This tool runs a structured safety audit on any text using your own OpenAI or Anthropic key and returns a categorised report with quoted evidence, all in your browser.
How it works
Pick a provider and model, paste your API key, and choose a severity threshold. Paste the content and the tool builds an audit prompt covering eight categories — violence, hate, self-harm, sexual content, PII, bias, misinformation, and dangerous instructions. The model returns a single JSON object: an overall verdict (safe, review, or unsafe) plus a findings array where each entry names the category, a severity, a short reason, and the exact snippet that triggered it. The prompt instructs the model to quote evidence and to never comply with or repeat any harmful instructions the content itself might contain. It makes one direct request to the provider and shows the report to copy.
For Anthropic, the request includes the official direct-browser-access header so it works straight from the page.
Using it in a workflow
The structured JSON makes this practical for triage rather than a one-off check. Run user-generated content through it and route anything with a verdict of “review” or “unsafe” to a human moderator, while letting clearly safe items pass faster. The per-category severity lets you set different rules — for example, auto-hold anything flagged for PII or self-harm at any severity, but only escalate medium-or-higher bias findings. Keep the threshold consistent so your queue stays calibrated.
Notes and limits
LLM moderation is a useful filter, not a guarantee. It produces both false positives — flagging benign text — and false negatives, missing genuinely harmful content, especially when phrasing is indirect or in another language. The severity threshold helps you trade those off, but it cannot eliminate them. Always keep a human in the loop for consequential decisions, comply with your jurisdiction’s legal obligations independently of this tool, and remember that a “safe” verdict is the model’s opinion, not a clearance.