How is monthly cost calculated per model?

For each model the tool computes (input_tokens ÷ 1M × input_price) + (output_tokens ÷ 1M × output_price), multiplies by requests per day, then by 30 for a monthly estimate.

Are these prices current?

They are published list-price estimates and are clearly labelled as such. Providers change pricing frequently, so confirm the live rate in your provider's dashboard before budgeting.

Why is the cheapest model not always the best choice?

Price is only one axis. A cheaper model may need more retries, longer prompts or worse reasoning that costs you elsewhere. Use the matrix to shortlist, then test quality on your own task.

Is my data sent anywhere?

No. The entire matrix is computed in your browser. Nothing you enter is uploaded, stored or logged.

What is the Token Cost by Language Model Matrix?

Enter your daily token profile once and see costs for 25+ major LLMs simultaneously in a sortable matrix, with the cheapest option highlighted so you can compare GPT, Claude, Gemini and more at a glance. It runs free in your browser on Gera Tools, with nothing uploaded.

Token Cost by Language Model Matrix

Name: Token Cost by Language Model Matrix
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Token cost by language-model matrix

Enter your token profile once and instantly compare the cost across every major model — GPT-4o, the o-series, Claude Opus/Sonnet/Haiku, Gemini, and popular open-weight hosts. No more opening five pricing pages: the whole field lands in a single sortable matrix with the cheapest option highlighted.

Why per-profile comparison matters

Model pricing pages show cost per million tokens, but what matters for your budget is cost per million tokens at your specific input-to-output ratio. A model that charges the same amount for input and output will rank differently than one that charges much more for output, depending on whether your workload is prompt-heavy or generation-heavy.

For example, a document-summarization pipeline might send 4,000 input tokens and receive 300 output tokens per call. A chatbot might send 800 input tokens and receive 1,200 output tokens. The cheapest model for one profile may not be the cheapest for the other. The matrix runs the calculation for all models simultaneously so the ranking reflects your workload, not a generic benchmark.

How it works

Every model bills input and output tokens separately, quoted per million tokens. For each model the tool computes:

per_request = (input / 1e6) × input_price + (output / 1e6) × output_price
monthly     = per_request × requests_per_day × 30

It runs that for all models at once and sorts by monthly cost, so the cheapest model for your specific input/output ratio rises to the top.

The model tiers at a glance

The matrix covers models across a spectrum of capability and price:

Frontier reasoning models — the most capable at complex, multi-step tasks; highest per-token prices; best for hard problems that genuinely need deep reasoning
Flagship chat models (GPT-4o-class, Claude Sonnet-class, Gemini Pro-class) — strong general capability; moderate prices; the default choice for most production workloads
Mini / flash / haiku-class models — fast and low-cost; well-suited for classification, extraction, reformatting, and simple Q&A where quality is sufficient
Open-weight hosted models (Llama, Mixtral via Together AI, Groq, and others) — lowest prices; best when quality meets your bar after evaluation

How to use the ranking

Enter your real token profile from your provider’s usage dashboard
Sort the matrix by monthly cost
Identify the cheapest two or three models that are plausibly strong enough for your task
Run those candidates against a sample of your real prompts, scoring for quality
Pick the cheapest model that meets your quality threshold

Routing to save money without sacrificing quality

For many production applications, a single model is not the optimal choice — different requests have different difficulty levels. A common pattern is to route:

Easy requests (simple lookups, classification, extraction, formatting) to a mini or flash-class model at very low cost
Medium requests (conversational replies, summarization, straightforward Q&A) to a flagship chat model
Hard requests (multi-step reasoning, coding, long-form synthesis, ambiguous intent) to a frontier model

A rough rule of thumb is that in most applications, 60–70% of requests are straightforward enough for a mini-class model. Routing those away from a flagship model can cut total inference cost significantly at the same perceived quality. The matrix helps you understand the price spread between tiers so you can quantify the savings of a routing strategy before implementing it.

Checking prices periodically is worthwhile — providers adjust pricing regularly, and the ranking can change.