Self-Hosted Inference Endpoint Cost Calculator

Calculate the true cost of running your own LLM inference endpoint vs. a hosted API.

Ad placeholder (leaderboard)

Self-hosted inference endpoint cost calculator

Running your own LLM endpoint on rented or owned GPUs can be cheaper than a hosted API — but only under the right conditions. This calculator turns your GPU hourly cost, throughput and utilization into a real cost per million tokens, then puts it side by side with an equivalent API price so you can see whether self-hosting actually wins.

How self-hosted inference cost works

A GPU bills by the hour regardless of how busy it is, so the unit you care about is tokens produced per dollar of GPU time. The math is:

tokens_per_hour   = throughput_tps × 3600
effective_tph     = tokens_per_hour × utilization
cost_per_million  = (gpu_hourly_cost / effective_tph) × 1,000,000

The killer term is utilization. At 100% utilization a GPU at $2/hr doing 2,000 tokens/sec costs about $0.28 per million tokens. Drop utilization to 20% and the same hardware costs $1.40 per million — five times more — because you pay for the idle 80% of the day too.

Tips for an honest comparison

  • Measure real throughput. Use steady-state tokens/sec under your actual batch size, not the marketing peak.
  • Be honest about utilization. Most internal endpoints sit far below 50%. Spiky traffic without autoscaling is where self-hosting quietly loses.
  • Count everything in the hourly rate. For owned hardware, amortize the card over its life and add power and hosting; for cloud, use the on-demand or committed rate you will really pay.
  • Remember the API floor. Hosted APIs bill per token with no idle cost, so for bursty or low-volume workloads they are almost always cheaper.
Ad placeholder (rectangle)