Why does utilization matter so much?

A GPU you rent by the hour costs the same whether it is busy or idle. If your endpoint only serves traffic 20% of the day, your effective cost per token is five times higher than the busy-hour rate. Self-hosting only beats an API at consistently high utilization.

What throughput number should I use?

Use the steady-state tokens per second you measure under realistic batch sizes, not the single-request peak. vLLM, TGI and TensorRT-LLM report this. Larger models and longer contexts lower throughput, which raises cost per token.

Does this include input and output tokens separately?

No. Self-hosted cost is driven by total tokens processed (prompt plus generated), because the GPU spends time on both prefill and decode. The tool uses a single tokens-per-second figure for combined throughput, which matches how GPU time is actually consumed.

Is my data sent anywhere?

No. The calculator runs entirely in your browser. Nothing you enter is uploaded, stored or logged.

Self-Hosted Inference Endpoint Cost Calculator

Self-hosted inference endpoint cost calculator

Running your own LLM endpoint on rented or owned GPUs can be cheaper than a hosted API — but only under the right conditions. This calculator turns your GPU hourly cost, throughput and utilization into a real cost per million tokens, then puts it side by side with an equivalent API price so you can see whether self-hosting actually wins.

How self-hosted inference cost works

A GPU bills by the hour regardless of how busy it is, so the unit you care about is tokens produced per dollar of GPU time. The math is:

tokens_per_hour   = throughput_tps × 3600
effective_tph     = tokens_per_hour × utilization
cost_per_million  = (gpu_hourly_cost / effective_tph) × 1,000,000

The killer term is utilization. At 100% utilization a GPU at $2/hr doing 2,000 tokens/sec costs about $0.28 per million tokens. Drop utilization to 20% and the same hardware costs $1.40 per million — five times more — because you pay for the idle 80% of the day too.

Tips for an honest comparison

Measure real throughput. Use steady-state tokens/sec under your actual batch size, not the marketing peak.
Be honest about utilization. Most internal endpoints sit far below 50%. Spiky traffic without autoscaling is where self-hosting quietly loses.
Count everything in the hourly rate. For owned hardware, amortize the card over its life and add power and hosting; for cloud, use the on-demand or committed rate you will really pay.
Remember the API floor. Hosted APIs bill per token with no idle cost, so for bursty or low-volume workloads they are almost always cheaper.