What does max_tokens actually do?

max_tokens sets a hard ceiling on the number of tokens the model will generate in one completion. If a response would exceed it, generation stops early. It protects you from rare runaway outputs that can be 10× longer than typical.

Why cap just above p95 and not p50?

Capping at the median would truncate nearly half of normal responses. Capping a little above p95 lets 95%+ of responses finish naturally while still clipping the expensive long tail that drives variance in your bill.

Will a cap hurt quality?

Only if it truncates legitimate responses. The recommended cap includes headroom above p95, so the vast majority of outputs are unaffected. If your task genuinely needs long answers, raise the percentile target accordingly.

Is my data sent anywhere?

No. The calculation is done entirely in your browser. Nothing you enter is uploaded, stored or logged.

What is the Cost-Aware max_tokens Setter?

Enter your completion-length percentiles (p50, p95, p99) and daily volume to get a recommended max_tokens cap that captures the natural output distribution while preventing runaway long completions from inflating your bill. It runs free in your browser on Gera Tools, with nothing uploaded.

Cost-Aware max_tokens Setter

Name: Cost-Aware max_tokens Setter
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Set max_tokens with data, not guesswork

The max_tokens parameter is the cheapest insurance you can buy against a runaway LLM bill, but most teams either leave it at the default (often far too high) or set it by feel. This tool uses your real completion-length percentiles to recommend a cap that lets normal responses finish while clipping the rare, expensive long tail.

How it works

Output lengths for a given prompt follow a long-tailed distribution: most completions cluster near the median (p50), but a small fraction run very long (p99 and beyond). Output tokens are the most expensive part of an LLM bill, so an uncapped tail quietly inflates your average cost. The tool recommends:

recommended_max_tokens ≈ p95 × 1.15   (rounded up to a clean number)

That keeps roughly 95% of responses untouched while capping the tail near the p99 region. It then estimates the cost difference between paying for an uncapped average (skewed toward p99) and a capped average, scaled to your daily volume.

Why the tail is where the money hides

LLM output length distributions for a given prompt type look roughly like this: most responses cluster within a factor of two of the median, but a few are dramatically longer. At a 10% tail above p99, that means a small fraction of requests consume disproportionate token budget.

For example, a prompt that typically produces around 200 tokens (p50) might occasionally produce 2,000 tokens (p99). If you pay per output token, that outlier costs ten times a normal response. At high daily volumes, even a 1% tail at 10× cost meaningfully inflates your average bill. The cap attacks this directly — it does not change the cost of normal responses at all, only clips the outliers.

How to get your own percentiles

You cannot set this cap well without real data. Log the usage.completion_tokens (or equivalent field) from every API response. After a representative sample — typically a few hundred to a few thousand calls — compute percentiles from that distribution. Most logging and observability tools (Datadog, Grafana, Honeycomb) compute percentiles directly from a metric. If you are doing it manually:

Sort all observed completion_token values ascending.
p50  = value at position 50% of the way through the list
p95  = value at position 95%
p99  = value at position 99%

Separate percentiles by prompt type if your application has multiple very different tasks — a summarization endpoint will have a completely different distribution from a code-generation endpoint.

Tips for choosing a cap

Pull real percentiles. Log usage.completion_tokens for a week and compute p50/p95/p99 — guessed numbers give guessed savings.
Tune the target per task. Summaries can cap tight; open-ended writing needs more headroom. Use a higher percentile target for creative tasks.
Combine with streaming. Streaming lets you stop generation in app logic too, giving a second layer of control beyond the hard cap.
Re-check after prompt changes. A new system prompt or format can shift the whole distribution; re-measure and re-cap.
Watch for truncation errors. If your downstream parsing depends on complete JSON or markdown, a cap that truncates mid-structure breaks the parse. Add enough headroom that the model can close its structured output.