Why does the input/output split matter?

Output tokens usually cost 3 to 5 times more than input tokens. A workload that looks input-heavy by token count can still be dominated by output cost, which changes what you should optimize.

What is a typical prompt-to-completion ratio?

It varies widely — summarization is input-heavy while content generation is output-heavy. There is no universal number; the point is to measure yours and act on it.

How does this help me cut costs?

If output dominates your bill, capping max_tokens and tightening instructions pays off most. If input dominates, prompt trimming and caching matter more. The split tells you where to spend effort.

Is my data sent anywhere?

No. The analyzer runs entirely in your browser. Nothing you enter is uploaded, stored or logged.

What is the Prompt-to-Completion Cost Ratio Analyzer?

Free prompt-to-completion cost ratio analyzer. Input and output tokens are priced differently, so this tool breaks down your workload's prompt vs completion token split and shows where your spend actually goes across models. It runs free in your browser on Gera Tools, with nothing uploaded.

Prompt-to-Completion Cost Ratio Analyzer

Name: Prompt-to-Completion Cost Ratio Analyzer
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Prompt-to-completion cost ratio analyzer

Because providers price input and output tokens at different rates, the token split of your workload is not the same as its cost split. A request that is 80% input by token count can still be mostly output by cost. This analyzer breaks down both so you optimize the side that actually drives your bill.

How it works

The analyzer takes your average prompt and completion token counts and applies the model’s separate prices:

input_cost  = prompt_tokens     / 1,000,000 × input_price
output_cost = completion_tokens / 1,000,000 × output_price
input_share = input_cost / (input_cost + output_cost)

It then reports the token ratio (prompt vs completion) and the cost ratio side by side. The divergence between the two is the insight — it reveals when a small amount of output is quietly dominating your spend.

Why the token ratio and the cost ratio diverge

The key fact is that output tokens typically cost several times more than input tokens, depending on the model. This pricing asymmetry means that even a small amount of output can represent a disproportionate share of cost.

For a concrete illustration: if a model charges $0.50 per million input tokens and $1.50 per million output tokens, then a request with 500 input tokens and 500 output tokens is actually 75% output by cost (500 × $1.50 vs 500 × $0.50), even though the token split is exactly 50/50.

This matters because the correct optimisation action depends on which side dominates your cost — and the token count is a misleading signal for that decision.

Common workload types and their typical cost split

Different product use cases have very different prompt-to-completion profiles, which determines where optimisation effort should go:

Document summarisation: typically input-heavy. Long documents are processed (many input tokens), but summaries are short (few output tokens). Even with output priced higher, input often dominates because the token volume is so asymmetric. The most impactful optimisation here is reducing unnecessary context — chunking, filtering, or preprocessing documents before sending.

Content generation: output-heavy by design. A short creative brief produces many paragraphs of copy. Output cost dominates regardless of prompt length. The most impactful optimisation is constraining output length (max_tokens, explicit word-count instructions) and evaluating whether a cheaper generation model is acceptable.

Classification and extraction: can be close to balanced if few-shot examples are in the prompt, but generally input-leaning. Caching stable system prompts using provider-level prompt caching (where available) has significant cost impact.

Conversational / multi-turn: input cost grows with conversation history because prior turns are re-sent. Strategies like summarising older context and replacing it with a condensed version keep input tokens bounded.

Acting on the results

Once you know which side dominates your bill:

If output dominates:

Add explicit length constraints to your prompt (“reply in under 150 words,” “use bullet points not prose”)
Set max_tokens to a value that captures 95th-percentile completions rather than the theoretical maximum
Consider whether a model with lower output pricing serves your quality needs

If input dominates:

Trim the fixed system prompt — every token removed reduces cost on every call
Investigate provider-level prompt caching for stable context blocks
Evaluate whether retrieved context can be pre-filtered before inclusion

If both are significant:

Profile the longest calls separately — a small percentage of long-tail requests often drives a disproportionate share of cost
Look for calls that can be routed to a smaller, cheaper model for simpler sub-tasks

Tips and notes

Optimize the dominant side. If output is most of the cost, cap max_tokens and ask for concise answers. If input dominates, trim prompts and cache stable context.
Compare models on your ratio. A model with cheap input but pricey output is great for input-heavy work and bad for generation — match the model to your split.
Watch the ratio drift. As features evolve, your mix changes. Re-check the split periodically so your cost strategy stays aimed at the right target.