What is a guard model?

A guard model is a smaller, cheaper classifier that screens each request (and optionally each response) for policy violations — toxicity, self-harm, jailbreaks, PII — before the expensive main model runs. Examples include OpenAI's moderation endpoint, Meta's Llama Guard, and NVIDIA Nemo Guardrails. It is a safety layer, not the model that does your actual work.

How much does a guard model usually add?

Because guard models are small and process short inputs, the overhead is often a few percent of total cost. The exact figure depends on guard token volume versus your main-model spend — high-volume apps with cheap main models feel it most, which is exactly what this tool quantifies.

Should I guard the input, the output, or both?

Input guarding catches malicious or policy-violating prompts before you spend on generation. Output guarding catches unsafe model responses before they reach the user. Many production systems do both, which roughly doubles the guard token volume — model that by entering combined guard tokens per request.

Can the moderation API be free?

OpenAI's moderation endpoint is currently free for OpenAI customers, in which case set the guard cost to zero and the calculator confirms the overhead is purely latency, not dollars. Self-hosted guards like Llama Guard cost compute instead of per-token fees — estimate that as an effective per-token price.

Is anything uploaded?

No. All figures are computed locally in your browser. Nothing you enter is transmitted.

What is the LLM Guard / Moderation Model Cost Calculator?

Model the overhead of running a guard model — a moderation API, Llama Guard, or Nemo Guardrails — before each main LLM call. See the per-request and monthly guard cost, what percentage it adds to your bill, and the break-even blocked-request rate. Runs in your browser. It runs free in your browser on Gera Tools, with nothing uploaded.

LLM Guard / Moderation Model Cost Calculator

Name: LLM Guard / Moderation Model Cost Calculator
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Get one useful tool a week

Like this tool? Enter your email and we'll send you one genuinely useful Gera tool a week — plus a link to come back to this one. No spam, one-click unsubscribe any time.

LLM guard model cost calculator

Putting a moderation or guard model in front of your main LLM is one of the cheapest safety wins available — but “cheap” is not “free.” This tool prices the guard layer precisely: per request, per month, and as a percentage of your total bill, so you can confirm the overhead is acceptable before you wire it into the hot path.

How it works

The guard cost per request is guard_tokens × (guard_price_per_1k / 1000). Multiply by daily requests and 30 days for the monthly guard bill. The tool then compares that to your main-model spend (main_cost_per_request × requests × 30) and reports the percentage overhead the guard adds to your total LLM cost.

It also computes a break-even view: how many blocked-or-unsafe requests the guard must prevent (each saving a wasted main-model call) for the guard to pay for itself purely on saved generation spend — before you even count the value of avoiding a harmful output.

Illustrative example

Suppose you have 50,000 daily requests, your main model costs $0.005 per request, and your guard model processes 200 tokens at $0.001 per 1K tokens:

guard cost per request  = 200 × (0.001 / 1000) = $0.0000002
monthly guard cost      = 0.0000002 × 50,000 × 30 = $0.30
main model monthly cost = 0.005 × 50,000 × 30 = $7,500
guard overhead          = 0.30 / 7,500 ≈ 0.004%

In this scenario the guard adds essentially nothing to the bill. Break-even: the guard needs to block just one main-model call per month to pay for itself on cost alone — making it strongly net-positive even before you count the value of safety.

At higher guard costs (for example self-hosted Llama Guard on GPU), the overhead percentage rises, and the break-even number of blocked requests increases. The tool makes that trade-off visible.

Guard architecture choices

Input-only guarding screens the user’s prompt before generation. This catches jailbreaks, policy violations, and clearly harmful requests at the lowest cost, since you never spend on a main-model call you would refuse. Most applications start here.

Output guarding screens the model’s response before delivering it to the user. This catches cases where a well-intentioned prompt elicits an unsafe response, which input guards miss. It adds a second round of guard token costs.

Both input and output is the most robust approach. Enter the combined token count (input tokens + output tokens) when modelling both.

Self-hosted vs API guards: OpenAI’s moderation endpoint is currently provided at no token cost to OpenAI API customers, making the dollar overhead zero and leaving latency as the only cost. Self-hosted models like Llama Guard run on your own compute; estimate a cost per token based on your GPU or inference service pricing.

Tips and notes

Guard models are small, so token-for-token they are far cheaper than your main model — the overhead is usually single-digit percent.
If your main model is expensive, every request the guard blocks saves a full generation, which can make the guard net-positive on cost alone.
Guarding both input and output roughly doubles guard token volume; enter the combined figure.
The free OpenAI moderation endpoint makes the dollar overhead zero — at that point the only cost is the added latency of one extra round trip, so keep the guard call fast and parallel where you can.