AI toxic content pattern reference library
Building a content moderation filter or safety classifier starts with a clear taxonomy: what counts as hate speech versus harassment, where self-harm language differs from clinical discussion, and which edge cases trip up every system. This offline library organises the common categories used in trust-and-safety work into plain-English definitions, abstract indicators, and the edge cases that cause the most false positives and false negatives.
How it works
The library groups patterns into the major moderation categories — hate speech, harassment and bullying, self-harm and suicide, violence and threats, and adult or sexual content. Each entry gives a definition, a set of abstract indicators (the signals a classifier looks for, described clinically rather than with explicit examples), and the edge cases that commonly cause mistakes. You filter by category to focus on the patterns relevant to your filter, then use the indicators to seed positive test cases and the edge cases to seed the hard negatives that expose over-blocking.
Tips and notes
- Design for the edge cases first. Reclaimed language, quotation, counter-speech, and fiction are where moderation systems lose user trust by over-blocking — write those tests before the obvious ones.
- Separate severity from category. A threat and a mild insult are both harassment but need very different responses; keep severity as its own axis.
- Context beats keywords. Keyword matching alone produces the famous “Scunthorpe problem”; the indicators here are meant to inform contextual classifiers, not naive blocklists.
- Always keep a human in the loop for high-stakes or ambiguous content — this taxonomy supports human reviewers, it does not replace them.