What is response prefilling?

Prefilling means you start the assistant's reply yourself by providing the first tokens of its response. The model then continues from there. Anthropic's API supports this by ending the messages array with an assistant role; some other providers offer similar partial-completion features.

How does prefilling save money?

You are not billed for output you supply — only for tokens the model generates. If the model would otherwise spend 60 tokens producing a fixed preamble or JSON scaffold, prefilling that text removes those 60 generated output tokens from every request.

Does prefilling do more than save tokens?

Yes. It also constrains the output format. Prefilling an opening brace forces the model straight into JSON, and prefilling a header skips chatty preambles. That improves reliability of structured output as well as cutting cost.

Why enter two different token numbers?

The prefill field is what you supply; the unprefilled field is what the model would have generated to reproduce that prefix. They differ when your prefill is more compact than the model's natural verbose version. Savings are based on the generated tokens you avoid.

Is my data sent anywhere?

No. The calculator runs entirely in your browser. Nothing you enter is uploaded, stored or logged.

What is the Prompt Prefill Token Cost Calculator?

Free response prefill cost calculator. Prefilling the assistant turn (JSON braces, headers, a partial answer) means the model generates fewer output tokens. Enter your prefill size and request volume to see daily and monthly savings. It runs free in your browser on Gera Tools, with nothing uploaded.

Prompt Prefill Token Cost Calculator

Name: Prompt Prefill Token Cost Calculator
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Get one useful tool a week

Like this tool? Enter your email and we'll send you one genuinely useful Gera tool a week — plus a link to come back to this one. No spam, one-click unsubscribe any time.

Prompt prefill token cost calculator

Output tokens are the expensive half of an LLM bill — typically three to five times the price of input tokens. Prefilling the assistant’s response lets you supply the boilerplate yourself instead of paying the model to generate it. This tool quantifies exactly how much you save per request, per day and per month.

How it works

When you prefill, you provide the first part of the assistant’s reply and the model continues from there. You are billed only for what the model generates, not for the prefix you wrote. So the saving per request is the number of output tokens the model no longer has to produce:

saved_output_tokens = tokens the model would have generated for that prefix
saving = (saved_output_tokens / 1,000,000) × output_price

Multiply by your daily request volume to get daily savings, and by thirty for a monthly figure. The savings compound with scale — a tiny per-request gain becomes meaningful across millions of calls.

Practical prefill patterns

Forcing JSON output — instead of asking “Respond in JSON format”, prefill { and the model begins its output from the first key. This is the most common and reliable use case. The savings depend on how much preamble the model would otherwise write before starting the JSON:

// prefill: {"result":
// model continues: [1, 2, 3], "status": "ok"}

Skipping preambles — many models open responses with “Sure, here is the…”, “Certainly, I can help you with…” or similar phrases before the actual content. Prefilling the first word of the real content jumps straight to the answer. Even 20–30 tokens per request adds up to significant cost at millions of calls.

Structured reports — if every response should start with a fixed header (a timestamp, a document title, or a section label), supply it as prefill. The model inherits the structure without generating it.

Partial answers — for chain-of-thought or fill-in-the-middle tasks, a partial sentence can steer the model’s continuation. For example, prefilling “The error is caused by” before asking for a diagnosis can reduce meandering.

Prefill and safety

Some prefill patterns interact with model safety systems. Very directive prefills that push the model into a constrained or contradictory state can produce lower-quality or unexpected completions. Always test your prefill on a representative sample before deploying at scale. Providers may also differ in whether they allow prefills that start mid-word, end with whitespace, or contain only partial JSON structures.

Tips and notes

The biggest wins come from removing predictable boilerplate: opening JSON braces, fixed report headers, “Sure, here is…” preambles, or a known sentence stem. Prefilling does double duty — besides cutting cost it forces the output into the shape you want, which is the most reliable way to get clean JSON out of a model. Keep prefills short and unambiguous; an overly long prefill can box the model into a structure it cannot complete naturally. Always test that the model continues coherently from your prefix, and remember that some providers strip trailing whitespace from the prefill, which can affect formatting.