How does offline scoring work?

It matches the text against curated pattern lists for each harm dimension — hate-speech indicators, self-harm language, violent markers, and misinformation cues — and weights matches into a per-dimension score. No model or API is used, so it is instant and private.

Is pattern matching as good as a moderation model?

No. It is a fast, transparent first-pass filter that catches explicit and common patterns. Sophisticated or obfuscated content, sarcasm, and context-dependent harm can slip through, so high-stakes moderation should add a model-based or human review layer.

Why screen LLM output at all?

Even aligned models occasionally produce harmful, biased, or unsafe text, especially under adversarial prompting. Screening output before it reaches users is a defence-in-depth control that catches failures the model's own safety training missed.

Does this send my text anywhere?

No. All scoring happens in your browser against local pattern lists. Nothing is uploaded, which makes it safe to screen sensitive content.

What do I do with the score?

Use thresholds suited to your risk tolerance — block above a high score, route mid-range output to human review, and allow low scores. The per-phrase breakdown helps you tune those thresholds and explain decisions.

What is the LLM Output Toxicity Pattern Scorer?

Paste LLM output and score it across toxicity dimensions — hate-speech indicators, self-harm language, violent content markers, and misinformation patterns — using offline pattern matching with no API call required. Runs entirely in your browser. It runs free in your browser on Gera Tools, with nothing uploaded.

LLM Output Toxicity Pattern Scorer

Name: LLM Output Toxicity Pattern Scorer
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

LLM output toxicity pattern scorer

Even well-aligned models can produce harmful text under the wrong prompt, and shipping that to users is a real safety and reputational risk. The LLM output toxicity pattern scorer gives you a fast, private first-pass filter: paste the model’s output and get a score across several harm dimensions, computed entirely offline with no API call. It is instant, costs nothing per check, and never sends your text anywhere.

The four harm dimensions and why each matters

Hate speech indicators. Slurs, dehumanizing comparisons, and group-targeting language. Even when the model intends to quote or discuss harmful content rather than produce it, these patterns in output directed at users can cause real harm and platform policy violations.

Self-harm language. Specific method references, encouragement phrasing, and ideation language. Applications serving general audiences — and especially those handling mental health topics — need to catch these before display. This dimension is intentionally conservative: false positives (educational, clinical, or journalistic use) are preferable to misses.

Violent content markers. Graphic descriptions of injury, threats, and instructions that combine with a target. This is distinct from discussing violence historically or journalistically — the pattern list targets the conjunction of graphic detail and implied direction.

Misinformation cues. Absolute health claims, conspiracy-adjacent trigger phrases, and language patterns associated with known false narratives. Pattern matching alone cannot verify facts, so this dimension flags text for human review rather than automatic blocking.

How it works

The scorer matches your text against curated pattern lists for each harm dimension — hate-speech indicators, self-harm language, violent-content markers, and misinformation cues. Matches are weighted and combined into a per-dimension score and an overall score, and the tool shows the exact phrases that triggered each dimension so the result is explainable rather than a black box. Because it runs on local pattern lists, scanning is immediate and works on sensitive content you would not want to send to a third-party moderation API.

Where offline pattern scoring fits in a moderation stack

No single layer handles all content risk. A practical moderation pipeline typically combines:

Pattern matching (this tool) — instant, offline, zero cost, catches explicit and common patterns.
Model-based classification — a purpose-built moderation model or provider safety API handles nuance, context, and obfuscation that patterns miss.
Human review queue — for borderline scores, policy-sensitive topics, and appeals.

The offline layer is most valuable as an early filter that handles the clear cases cheaply before paying per-call costs for model-based review on every output.

Tips and notes

Use it as a first layer. Pattern matching catches the explicit cases fast; pair it with a model-based or human review for borderline content.
Tune thresholds to your context. A children’s product needs a far lower tolerance than an internal developer tool — set block and review thresholds accordingly.
Read the triggered phrases. They reveal false positives (quoted or educational use) and help you justify a moderation decision.
Pattern matching has limits. Obfuscation, sarcasm, and context-dependent harm evade it — do not rely on it alone for high-stakes safety.