Open-source model hosting cost calculator
“It’s free and open source” only covers the weights — running them is not free. Self-hosting Llama 3, Mistral or Qwen means renting GPUs by the hour, paying for networking and storage, and staffing the operational work to keep it up. This calculator gives you the all-in monthly cost and an honest effective cost per request, so you can compare self-hosting against a managed API on real numbers.
How it works
The model first has to fit in GPU memory. A practical estimate is two bytes per parameter at fp16, plus about 20% for the KV cache and activations:
memory_needed ≈ params_billion × 2 GB × 1.2
The instance is billed for all 730 hours in a month whether busy or idle. The tool adds a networking and storage allowance plus an operational overhead percentage, then divides the total by your monthly request volume to get the effective cost per thousand requests. It also flags how much spend is wasted running below full utilization.
Tips and notes
The single biggest cost lever is utilization. A GPU idling at 40% busy is still 100% billed, so the idle-time waste line is where self-hosting quietly loses money. Push utilization up with continuous batching (vLLM, TGI) and by consolidating workloads onto fewer GPUs. The second lever is quantization: int8 or int4 weights roughly halve or quarter the memory footprint, letting a 70B model run on a smaller, cheaper instance. Self-hosting tends to win only at high, steady volume — for spiky or low traffic a per-token API is usually cheaper and far less operational burden. Run this against the LLM API cost calculator with your real volume to find the break-even.