A reliable LLM application is not made safe by one clever prompt — it is made safe by layers. This guide turns a short description of your application into a concrete, defense-in-depth safety architecture you can hand to your engineering team.
How it works
You describe three things: the type of application (chatbot, agent, content generator, internal tool), your user base (internal staff, authenticated customers, or the anonymous public), and the risk categories your model touches (self-harm, medical, legal, financial, harassment, code execution). You then set an overall risk tolerance.
The tool maps those choices onto five safety layers — input filtering, model-level controls, output moderation, rate limiting and abuse controls, and human-in-the-loop review — and emits a checklist with the specific controls each layer should contain. Higher-risk inputs and stricter tolerances add more aggressive controls (e.g. blocking instead of flagging, mandatory human review, hard rate caps).
Tips and notes
- Public + high-risk is the danger zone. Anonymous public access combined with self-harm, medical, or financial topics warrants every layer and the strictest blocking behavior.
- Fail closed for irreversible actions. Any agent that can spend money, delete data, or send external messages should require confirmation or human approval rather than a soft warning.
- Log everything you moderate. Moderation decisions, blocked inputs, and escalations should be recorded so you can tune thresholds and demonstrate accountability later.
- The checklist is a starting architecture, not a compliance certificate — pair it with the jurisdiction-specific disclaimers and audit-trail tools in this collection.