What is a defense-in-depth approach to AI safety?

Defense in depth means no single control is trusted to catch everything. You layer an input filter, the model's own system prompt, an output moderation pass, rate limiting, and human review so that a failure in one layer is caught by another. This tool maps those layers to your specific application.

Do I need every layer the tool recommends?

Not always. The recommendations scale with your risk tolerance and risk categories. A low-risk internal tool may safely skip human review, while a public-facing app touching health or financial topics should implement every layer suggested.

What is the difference between input filtering and output moderation?

Input filtering inspects what the user sends before it reaches the model — catching prompt injection, PII, or disallowed requests. Output moderation inspects what the model produces before it reaches the user — catching unsafe, off-policy, or hallucinated content. You generally want both.

When should a request be escalated to a human reviewer?

Common triggers include low model confidence, detection of a high-risk topic (self-harm, legal, medical), repeated policy violations from one user, or any action with irreversible consequences. The generated checklist lists triggers appropriate to your selected risk categories.

Does this tool send my inputs anywhere?

No. The entire design guide runs in your browser. Your application description and selections are never uploaded — the checklist is generated locally from your inputs.

What is the AI Safety Layer Design Guide?

Input your application type and risk tolerance to receive a recommended safety layer architecture — input filters, output moderation, rate limiting, human review triggers, and fallback behaviors — as an actionable design checklist. It runs free in your browser on Gera Tools, with nothing uploaded.

AI Safety Layer Design Guide

Name: AI Safety Layer Design Guide
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

A reliable LLM application is not made safe by one clever prompt — it is made safe by layers. This guide turns a short description of your application into a concrete, defense-in-depth safety architecture you can hand to your engineering team.

How it works

You describe three things: the type of application (chatbot, agent, content generator, internal tool), your user base (internal staff, authenticated customers, or the anonymous public), and the risk categories your model touches (self-harm, medical, legal, financial, harassment, code execution). You then set an overall risk tolerance.

The tool maps those choices onto five safety layers — input filtering, model-level controls, output moderation, rate limiting and abuse controls, and human-in-the-loop review — and emits a checklist with the specific controls each layer should contain. Higher-risk inputs and stricter tolerances add more aggressive controls (e.g. blocking instead of flagging, mandatory human review, hard rate caps).

Tips and notes

Public + high-risk is the danger zone. Anonymous public access combined with self-harm, medical, or financial topics warrants every layer and the strictest blocking behavior.
Fail closed for irreversible actions. Any agent that can spend money, delete data, or send external messages should require confirmation or human approval rather than a soft warning.
Log everything you moderate. Moderation decisions, blocked inputs, and escalations should be recorded so you can tune thresholds and demonstrate accountability later.
The checklist is a starting architecture, not a compliance certificate — pair it with the jurisdiction-specific disclaimers and audit-trail tools in this collection.

The five safety layers in detail

Understanding why each layer exists — and what it cannot do alone — is what makes defense in depth more than a buzzword.

Layer 1: Input filtering

Input filtering inspects what a user sends before it reaches the model. Its jobs are to catch direct prompt injection attempts, to detect PII that should not enter the model’s context, to classify the intent of the request (is this on-policy?), and to enforce hard limits (maximum length, disallowed content categories). The limit of input filtering is that it sees raw user text and cannot catch indirect prompt injection — malicious instructions embedded in documents or web pages the model retrieves.

Layer 2: Model-level controls

The model itself is a safety layer. A well-crafted system prompt, reinforcement learning from human feedback tuned for safety, and a model with built-in refusal behaviors all reduce the baseline rate of problematic outputs. But model-level controls alone are not sufficient — they can be bypassed, and they do not protect against the model’s correct behavior being weaponised (e.g., summarizing a document that contains a prompt injection payload).

Layer 3: Output moderation

Output moderation inspects what the model produces before the user sees it. A separate classifier or rule set checks the output for unsafe content, PII, hallucinated data that should not surface, and off-policy material. This catches things the model should not have said even when the input was innocuous. It is distinct from input filtering and cannot be omitted on the assumption that a safe input will produce a safe output.

Layer 4: Rate limiting and abuse controls

Rate limiting prevents one user or automated attacker from consuming excessive resources, mounting brute-force jailbreak attempts, or triggering denial-of-wallet attacks. User-level rate limits, IP-level limits, and cost caps all belong here. For authenticated users, anomaly detection (this user has never sent messages at this volume before) can catch automated abuse that would otherwise look legitimate.

Layer 5: Human review and escalation

Some requests or outputs should never go directly to a user without a human in the loop. Hard triggers — a user expressing self-harm intent, a request to take an irreversible high-value action, repeated policy violations — should route to a human reviewer or at minimum trigger a cooling-off and a welfare check. For agentic systems that take real-world actions, requiring confirmation from a human before any consequential action is not a UX cost; it is the difference between an error being recoverable and it not being.

Calibrating to your actual risk

The checklist the guide generates scales to your inputs: a low-risk internal drafting tool with authenticated employees needs far fewer controls than a public-facing chatbot that touches health or financial topics. Proportionality matters — over-blocking on a low-risk tool trains users to route around the safety system, which is worse than appropriate permissiveness with good logging.