Self-hosted inference endpoint cost calculator
Running your own LLM endpoint on rented or owned GPUs can be cheaper than a hosted API — but only under the right conditions. This calculator turns your GPU hourly cost, throughput and utilization into a real cost per million tokens, then puts it side by side with an equivalent API price so you can see whether self-hosting actually wins.
How self-hosted inference cost works
A GPU bills by the hour regardless of how busy it is, so the unit you care about is tokens produced per dollar of GPU time. The math is:
tokens_per_hour = throughput_tps × 3600
effective_tph = tokens_per_hour × utilization
cost_per_million = (gpu_hourly_cost / effective_tph) × 1,000,000
The killer term is utilization. At 100% utilization a GPU at $2/hr doing 2,000 tokens/sec costs about $0.28 per million tokens. Drop utilization to 20% and the same hardware costs $1.40 per million — five times more — because you pay for the idle 80% of the day too.
Tips for an honest comparison
- Measure real throughput. Use steady-state tokens/sec under your actual batch size, not the marketing peak.
- Be honest about utilization. Most internal endpoints sit far below 50%. Spiky traffic without autoscaling is where self-hosting quietly loses.
- Count everything in the hourly rate. For owned hardware, amortize the card over its life and add power and hosting; for cloud, use the on-demand or committed rate you will really pay.
- Remember the API floor. Hosted APIs bill per token with no idle cost, so for bursty or low-volume workloads they are almost always cheaper.