Set max_tokens with data, not guesswork
The max_tokens parameter is the cheapest insurance you can buy against a
runaway LLM bill, but most teams either leave it at the default (often far too
high) or set it by feel. This tool uses your real completion-length
percentiles to recommend a cap that lets normal responses finish while clipping
the rare, expensive long tail.
How it works
Output lengths for a given prompt follow a long-tailed distribution: most completions cluster near the median (p50), but a small fraction run very long (p99 and beyond). Output tokens are the most expensive part of an LLM bill, so an uncapped tail quietly inflates your average cost. The tool recommends:
recommended_max_tokens ≈ p95 × 1.15 (rounded up to a clean number)
That keeps roughly 95% of responses untouched while capping the tail near the p99 region. It then estimates the cost difference between paying for an uncapped average (skewed toward p99) and a capped average, scaled to your daily volume.
Tips for choosing a cap
- Pull real percentiles. Log
usage.completion_tokensfor a week and compute p50/p95/p99 — guessed numbers give guessed savings. - Tune the target per task. Summaries can cap tight; open-ended writing needs more headroom. Use a higher percentile target for creative tasks.
- Combine with streaming. Streaming lets you stop generation in app logic too, giving a second layer of control beyond the hard cap.
- Re-check after prompt changes. A new system prompt or format can shift the whole distribution; re-measure and re-cap.