Why compare prompt and completion separately?

Providers price output tokens several times higher than input, so the two levers have different leverage. The analyzer varies each independently so you can see whether shaving the prompt or capping the output gives the bigger reduction for your workload.

When does output length dominate the bill?

When completions are long relative to the prompt, or whenever the model has expensive output pricing. In those cases setting a tight max_tokens and asking for concise answers usually beats prompt trimming.

When does prompt length dominate?

For retrieval-augmented or long-context tasks where you stuff large documents in and ask for a short answer. There the input tokens drive cost, so compression, caching and chunking pay off most.

Is my data sent anywhere?

No. All calculations run in your browser. Nothing you enter is uploaded, stored or logged.

What is the Cost Sensitivity Analyzer?

Free LLM cost sensitivity analyzer. Enter your base prompt and completion tokens and a model, then see how total cost changes as each varies by ±50%, revealing whether trimming the prompt or capping the output gives you the biggest savings. It runs free in your browser on Gera Tools, with nothing uploaded.

Cost Sensitivity Analyzer

Name: Cost Sensitivity Analyzer
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Cost sensitivity analyzer

Should you spend effort trimming the prompt or capping the completion? It depends on which one your bill is sensitive to. This analyzer takes your baseline tokens and model, then sweeps prompt and completion length independently across a range so you can see exactly which lever moves total cost more — and stop optimizing the one that barely matters.

How it works

The cost of one request is:

cost = (input_tokens / 1,000,000) × input_price
     + (output_tokens / 1,000,000) × output_price

The tool holds one value at its baseline and varies the other from −range to +range (e.g. −50% to +50%), then repeats with the roles swapped. Because output is priced higher, the completion curve is usually steeper — but the result depends on your specific ratio. The analyzer reports the cost at the low and high end of each sweep and the swing each lever produces, so the dominant factor is obvious.

When each lever dominates

The answer changes with your specific token profile. Two patterns illustrate the extremes:

Output-dominated requests happen when completions are long relative to the prompt. A creative writing endpoint, a code-generation tool, or a long-form analysis feature all tend to have this shape. Here the output token count has much more leverage on total cost than the prompt length. Tightening the prompt from 500 to 400 tokens saves relatively little when the completion is 2,000 tokens; capping the completion from 2,000 to 1,500 tokens saves far more.

Input-dominated requests happen in RAG and retrieval-augmented workflows where you inject large context documents into every call and ask for a short answer. If your prompt is 8,000 tokens and the reply is 200 tokens, almost all your cost is in the input. Here compression of retrieved context, smarter chunking, and prompt caching are the productive optimizations.

Most real products fall somewhere between these extremes, and the analyzer puts the actual numbers on the difference so you do not have to guess.

How model pricing shapes the analysis

Input and output tokens are priced differently by every provider. Output tokens are typically priced two to four times higher than input tokens by frontier models. Some smaller models invert this ratio. When you change models — for example, routing cheaper queries to a smaller model — the dominant lever can flip entirely. Re-running the analysis after a model change is worthwhile because the cost structure may be completely different even if your token profile is the same.

Tips for acting on the result

If output dominates: set a tighter max_tokens, ask for concise answers, and avoid prompting for verbose explanations you don’t read.
If input dominates: compress the prompt, dedupe context, and use prompt caching so stable context bills at a fraction of the input rate.
Re-check after model changes. Switching to a model with different input/output pricing can flip which lever matters.
Optimize the dominant lever first. A 20% cut on the lever that drives 80% of cost beats a 50% cut on the one that drives 20%.
Use the sensitivity range to set realistic targets. If cutting prompt tokens by 30% only moves total cost by 5%, that gives you a realistic expectation before investing engineering time in prompt compression.