Why is self-hosting often more expensive than an API?

GPU instances bill by the hour whether or not you serve traffic. At low or spiky utilization you pay for idle hardware, which is why managed APIs that bill per token usually win until you reach high, steady volume. This tool shows the idle-time waste directly.

How much GPU memory does a model need?

A rough rule is two bytes per parameter at fp16, so a 70B model needs about 140GB just for weights, plus roughly 20% more for the KV cache and activations. Quantizing to int8 or int4 roughly halves or quarters that, letting models fit on smaller GPUs.

What is operational overhead?

Self-hosting is not just GPU rental. You need on-call coverage, security patching, autoscaling, monitoring and incident response. The overhead percentage approximates that engineering cost as a fraction of the raw hardware spend.

Is my data sent anywhere?

No. The calculator runs entirely in your browser. Nothing you enter is uploaded, stored or logged.

What is the Open-Source Model Hosting Cost Calculator?

Free self-hosting cost calculator for open-source LLMs. Enter model size, cloud GPU instance, utilization and daily volume to see the all-in monthly cost — GPU rental, networking, storage and operational overhead — plus a GPU memory fit check. It runs free in your browser on Gera Tools, with nothing uploaded.

Open-Source Model Hosting Cost Calculator

Name: Open-Source Model Hosting Cost Calculator
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Open-source model hosting cost calculator

“It’s free and open source” only covers the weights — running them is not free. Self-hosting Llama 3, Mistral or Qwen means renting GPUs by the hour, paying for networking and storage, and staffing the operational work to keep it up. This calculator gives you the all-in monthly cost and an honest effective cost per request, so you can compare self-hosting against a managed API on real numbers.

GPU memory requirements: the first gate

Before you can think about cost, the model has to fit. A practical estimate is two bytes per parameter at fp16, plus about 20% headroom for the KV cache and activations:

memory_needed ≈ params_billion × 2 GB × 1.2

For common model sizes, this means:

Model size	fp16 estimate	Fits on	Quantized (int4) estimate
7B parameters	~17 GB	1× A100 40GB, 2× 3090	~6 GB — fits on a 3090
13B parameters	~31 GB	1× A100 80GB	~10 GB
34B parameters	~82 GB	2× A100 80GB	~24 GB — fits on 1× A100 80GB
70B parameters	~168 GB	4× A100 80GB	~48 GB — fits on 2× A100 80GB

Quantization (int8 or int4) roughly halves or quarters the memory requirement at some quality cost, letting you run larger models on smaller, cheaper instances.

How it works

The instance is billed for all 730 hours in a month whether busy or idle. The tool adds a networking and storage allowance plus an operational overhead percentage, then divides the total by your monthly request volume to get the effective cost per thousand requests:

monthly_gpu_cost   = hourly_rate × 730
networking_storage = estimated allowance
overhead           = (monthly_gpu_cost + networking_storage) × overhead_%
total_monthly      = monthly_gpu_cost + networking_storage + overhead
cost_per_1k_req    = (total_monthly / monthly_requests) × 1000

It also flags how much spend is wasted running below full utilization — because idle GPU time is the primary reason self-hosting loses to APIs at low volume.

The utilization problem

The single biggest cost lever is utilization. A GPU running at 40% capacity still costs 100% of the hourly rate. At 40% utilization, 60% of your GPU spend buys nothing. Managed APIs only bill when you send a request, so they are structurally more cost-efficient at low or spiky traffic volumes.

Strategies to raise utilization:

Continuous batching (vLLM, TGI, SGLang): serves multiple requests in parallel on the same GPU, dramatically improving throughput per dollar.
Consolidating workloads: If you run five different models for five use cases, a single larger model that covers all five may have higher utilization than five separate instances each running at 20%.
Quantization: Fitting more into a smaller GPU means fewer idle GPUs.

When does self-hosting actually pay off?

Self-hosting beats API pricing when:

Volume is high and steady — the GPU is busy most of the day
Latency requirements are strict — no cold starts, no provider rate limits
Data privacy is non-negotiable — requests cannot leave your infrastructure
You need fine-tuned model weights not available from API providers

For most early-stage or moderate-volume applications, a managed API is cheaper and requires no GPU expertise, on-call infrastructure, or operational overhead. Run this calculator against your actual daily request volume, then compare the effective cost per thousand requests against the managed API price for the same capability. That comparison tells you your break-even utilization point.