Is this library safe to browse at work?

Yes. It describes categories and indicators in clinical, abstract terms for filter design — it does not contain explicit slurs, graphic descriptions, or harmful instructions. It is a taxonomy reference, not example toxic content.

Can I use this to build a content moderation classifier?

It is a useful starting taxonomy for defining your categories and writing test cases, but a production classifier needs labelled training data, human review, and policy sign-off. Use this to structure your thinking, not as a substitute for evaluation.

Why do edge cases matter so much?

Most moderation failures are edge cases — reclaimed slurs, quoted hate speech, clinical discussion of self-harm, fiction, and counter-speech. Knowing these in advance lets you design tests that catch over-blocking and under-blocking before users do.

Does this tool send my queries anywhere?

No. The entire library is bundled in the page and filtered in your browser. Nothing you search or view is uploaded or logged.

What is the AI Toxic Content Pattern Reference Library?

Browse an offline categorized reference library of toxic content patterns used in AI filter and moderation design — hate speech indicators, harassment patterns, self-harm language, and violent content markers — with category definitions, examples, and edge cases. It runs free in your browser on Gera Tools, with nothing uploaded.

AI Toxic Content Pattern Reference Library

Name: AI Toxic Content Pattern Reference Library
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

AI toxic content pattern reference library

Building a content moderation filter or safety classifier starts with a clear taxonomy: what counts as hate speech versus harassment, where self-harm language differs from clinical discussion, and which edge cases trip up every system. This offline library organises the common categories used in trust-and-safety work into plain-English definitions, abstract indicators, and the edge cases that cause the most false positives and false negatives.

How it works

The library groups patterns into the major moderation categories — hate speech, harassment and bullying, self-harm and suicide, violence and threats, and adult or sexual content. Each entry gives a definition, a set of abstract indicators (the signals a classifier looks for, described clinically rather than with explicit examples), and the edge cases that commonly cause mistakes. You filter by category to focus on the patterns relevant to your filter, then use the indicators to seed positive test cases and the edge cases to seed the hard negatives that expose over-blocking.

Why edge cases break most moderation systems

Most content moderation failures fall into two types: false positives (blocking legitimate content) and false negatives (missing harmful content). The edge cases in this library are specifically the situations that cause each:

Reclaimed language — slurs and terms that are harmful in some contexts are used affirmatively by the communities they describe. A classifier trained only on surface patterns will over-block in-community speech, eroding trust with the very users it is supposed to protect.

Quotation and counter-speech — reporting on or arguing against hateful content necessarily repeats it. “Their post said X, which is wrong because Y” looks identical to a plain statement of X at the token level. Context signals (attribution verbs, framing, surrounding argument) must be part of the feature set.

Clinical and educational discussion — self-harm, suicide, and violence topics arise in health journalism, academic research, fiction, and support communities. A working self-harm filter must distinguish a first-person crisis statement from a third-person research citation.

Satire and fiction — exaggeration, irony, and fictional settings can look like literal threats or incitement. Tone detection and genre signals are necessary but imprecise at scale.

Severity gradations — a mild insult, a personal threat, and a credible incitement to violence are all “harassment,” but they need entirely different responses. Treating them identically produces either over-escalation or under-response.

Category overview

Category	Core concern	Hardest edge cases
Hate speech	Protected-characteristic targeting	Reclaimed language, counter-speech
Harassment / bullying	Individual targeting, repeated behaviour	Satire, fictional scenarios
Self-harm / suicide	Risk of harm to self	Clinical discussion, fiction, reporting
Violence / threats	Risk of harm to others	Hyperbole, sports trash talk, fiction
Adult / sexual content	Age, consent, context	Age-ambiguous subjects, art, medical

Tips and notes

Design for the edge cases first. Reclaimed language, quotation, counter-speech, and fiction are where moderation systems lose user trust by over-blocking — write those tests before the obvious ones.
Separate severity from category. A threat and a mild insult are both harassment but need very different responses; keep severity as its own axis.
Context beats keywords. Keyword matching alone produces the famous “Scunthorpe problem”; the indicators here are meant to inform contextual classifiers, not naive blocklists.
Always keep a human in the loop for high-stakes or ambiguous content — this taxonomy supports human reviewers, it does not replace them.