Temperature and retries: the hidden cost multiplier
Most LLM cost estimates assume one call equals one bill. In production that is rarely true. When you sample at a high temperature and then validate the output — JSON parsing, schema checks, guardrails — a fraction of calls fail and your code retries. Every retry is billed in full, so your effective cost is higher than the headline per-call price. This calculator shows the real multiplier.
How it works
Each request succeeds with probability 1 − p, where p is your retry (failure)
rate. The expected number of attempts, capped at your maximum retries, is a
finite geometric series:
expected_attempts = 1 + p + p² + … + p^maxRetries
effective_cost = base_cost × expected_attempts
A 20% retry rate with up to 3 retries means about 1.25 attempts per request on average — a 25% cost uplift. Push temperature up so the failure rate hits 50%, and you are paying nearly double. The temperature field here is a guide: it nudges a suggested failure rate so you can see the relationship, but you should measure your real retry rate from logs whenever possible.
Tips to cut retry cost
- Lower temperature for structured tasks. Extraction, classification and JSON
output rarely benefit from high temperature;
0–0.3slashes the failure rate. - Cap retries. An uncapped retry loop on a consistently failing prompt can silently 5–10× a request’s cost.
- Fix the prompt, not the loop. If retries are high, the prompt or schema is usually the problem — repair it once instead of paying to re-roll the dice.
- Use constrained decoding. Tool-calling or JSON mode removes most parse failures, driving the retry rate — and the multiplier — toward 1.0.