Open-source vs API cost comparison
Renting an LLM API is pay-as-you-go: cheap at low volume, expensive at scale. Self-hosting an open-source model like Llama 3 flips that — a fixed monthly GPU bill that gets cheaper per request the more you use it. This tool finds the break-even volume where self-hosting starts to win.
How it works
The API side is simple: total tokens (volume × tokens per request, split into input and output) priced at the model’s per-million rates gives a monthly bill that scales linearly with usage.
The self-hosted side is a fixed cost: a dedicated GPU instance billed by the hour, running continuously (~730 hours/month), regardless of how busy it is. Setting the two equal and solving for volume gives the break-even — the monthly request count above which the fixed GPU cost is spread thinly enough to beat per-request API pricing.
Tips and notes
- Self-hosting is a fixed cost, not free. A 24/7 GPU costs the same whether you serve ten requests or ten million — utilisation is everything.
- Add the operational tax. Setup, autoscaling, monitoring, and on-call carry real engineering cost and risk that this raw compute comparison omits.
- Batch and autoscale to improve economics. Bursty traffic wastes a dedicated GPU; serverless GPU or batching can lower the effective self-hosted cost below the always-on assumption used here.