LLM output toxicity pattern scorer
Even well-aligned models can produce harmful text under the wrong prompt, and shipping that to users is a real safety and reputational risk. The LLM output toxicity pattern scorer gives you a fast, private first-pass filter: paste the model’s output and get a score across several harm dimensions, computed entirely offline with no API call. It is instant, costs nothing per check, and never sends your text anywhere.
How it works
The scorer matches your text against curated pattern lists for each harm dimension — hate-speech indicators, self-harm language, violent-content markers, and misinformation cues. Matches are weighted and combined into a per-dimension score and an overall score, and the tool shows the exact phrases that triggered each dimension so the result is explainable rather than a black box. Because it runs on local pattern lists, scanning is immediate and works on sensitive content you would not want to send to a third-party moderation API.
Tips and notes
- Use it as a first layer. Pattern matching catches the explicit cases fast; pair it with a model-based or human review for borderline content.
- Tune thresholds to your context. A children’s product needs a far lower tolerance than an internal developer tool — set block and review thresholds accordingly.
- Read the triggered phrases. They reveal false positives (quoted or educational use) and help you justify a moderation decision.
- Pattern matching has limits. Obfuscation, sarcasm, and context-dependent harm evade it — do not rely on it alone for high-stakes safety.