How is the self-hosting cost calculated?

It assumes you rent a GPU instance that runs continuously for the month, so cost equals the hourly rate multiplied by roughly 730 hours. The calculator ignores idle time because a dedicated instance is billed whether or not it is busy.

What does the break-even mean?

It is the monthly request volume at which the API token bill equals the fixed monthly GPU cost. Below it, API is cheaper because you only pay per use; above it, the fixed GPU cost is spread over enough requests to win.

Does self-hosting have hidden costs?

Yes. Beyond the GPU you pay for setup, monitoring, scaling, and engineering time, and you carry reliability risk. The calculator captures the raw compute comparison; weigh the operational burden separately.

Is anything sent to a server?

No. The comparison runs entirely in your browser. You enter only counts and rates, and nothing is uploaded, stored, or logged.

What is the Open-Source vs API Cost Comparison?

Compare the break-even point between renting LLM API access and self-hosting an open-source model on cloud GPU. Factor in GPU hourly rate, utilisation, and request volume to see which is cheaper at your scale. It runs free in your browser on Gera Tools, with nothing uploaded.

Open-Source vs API Cost Comparison

Name: Open-Source vs API Cost Comparison
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Open-source vs API cost comparison

Renting an LLM API is pay-as-you-go: cheap at low volume, expensive at scale. Self-hosting an open-source model like Llama 3 flips that — a fixed monthly GPU bill that gets cheaper per request the more you use it. This tool finds the break-even volume where self-hosting starts to win.

How it works

The API side is simple: total tokens (volume × tokens per request, split into input and output) priced at the model’s per-million rates gives a monthly bill that scales linearly with usage.

The self-hosted side is a fixed cost: a dedicated GPU instance billed by the hour, running continuously (~730 hours/month), regardless of how busy it is. Setting the two equal and solving for volume gives the break-even — the monthly request count above which the fixed GPU cost is spread thinly enough to beat per-request API pricing.

API monthly cost      = (requests × avg_tokens) / 1,000,000 × price_per_million
Self-hosted monthly   = gpu_hourly_rate × 730
Break-even requests   = Self-hosted monthly / (avg_tokens × price_per_million / 1,000,000)

Illustrative break-even scenarios

These are illustrative examples using representative numbers — verify current pricing with the specific provider before making infrastructure decisions.

API model tier	Approx. input price	GPU (A10G, ~$0.90/hr) monthly	Break-even at 1k tokens/request
Small/fast model	~$0.15/M tokens	~$657/month	High volume needed
Mid-tier model	~$0.50/M tokens	~$657/month	Moderate volume
Large/capable model	~$2.50/M tokens	~$657/month	Lower volume

At higher API prices per token, the break-even volume is lower — meaning self-hosting wins at less traffic. At very low API prices (cheap small models), the fixed GPU cost has to be spread over an enormous request count to pay off.

The hidden costs of self-hosting

The GPU hourly rate is only the visible line item. Real self-hosting involves:

Setup and integration time — configuring inference servers (vLLM, TGI, Ollama), load balancers, and authentication takes real engineering hours
Monitoring and reliability — you own the uptime; a GPU crash at 3am is your on-call’s problem, not the API vendor’s
Scaling complexity — bursty traffic means GPUs sit idle most of the time or you need to auto-scale (adding complexity and cost)
Model updates — when a new version of the open-source model releases, you re-download, re-quantize, re-test, and re-deploy it yourself
Security — the inference server is now inside your infrastructure and carries its own attack surface

A common rule of thumb: if your engineering team costs more than the API bill and you cannot dedicate someone to maintaining inference infrastructure, the API wins on total cost even above the theoretical break-even volume.

When self-hosting genuinely wins

Self-hosting makes financial sense when:

Your traffic is high and consistent (the GPU runs above ~60% utilisation)
You have a dedicated ML platform team who can maintain the stack
You need data privacy that prevents sending inputs to a third-party API
You are using a model small enough to fit on a single GPU and your quality bar is met (for example, a fine-tuned 7B or 13B model for a narrow task)

Tips and notes

Self-hosting is a fixed cost, not free. A 24/7 GPU costs the same whether you serve ten requests or ten million — utilisation is everything.
Add the operational tax. Setup, autoscaling, monitoring, and on-call carry real engineering cost and risk that this raw compute comparison omits.
Batch and autoscale to improve economics. Bursty traffic wastes a dedicated GPU; serverless GPU or batching can lower the effective self-hosted cost below the always-on assumption used here.