What does the Pareto frontier mean here?

A model is on the frontier if no other model is both cheaper and higher quality. Frontier models represent the best available trade-offs; a model off the frontier is "dominated" because you could get equal or better quality for the same or less money.

Which quality metric should I use?

Match it to your task. MMLU reflects broad knowledge and reasoning, HumanEval measures coding ability, and MT-Bench captures conversational quality. A model that tops one metric may sit mid-pack on another, so pick the one that mirrors your workload.

Should I always pick a frontier model?

Usually yes — off-frontier models are strictly worse value. Among frontier models, pick the cheapest one that clears your minimum quality bar; paying for more quality than you need is wasted budget.

Are the benchmark scores exact?

They are approximate published benchmark figures used as editable presets. Benchmarks vary by prompt and version and can be gamed, so treat them as a guide, not gospel, and validate on your own task.

Is anything sent to a server?

No. The plot and frontier are computed in your browser from a built-in table. Nothing is uploaded.

What is the Inference Cost vs Quality Frontier Explorer?

Interactive plot of major LLMs with cost per 1M tokens on one axis and a benchmark quality score on the other, with Pareto-optimal models highlighted so the best value-for-quality choices are visually obvious. Runs in your browser. It runs free in your browser on Gera Tools, with nothing uploaded.

Inference Cost vs Quality Frontier Explorer

Name: Inference Cost vs Quality Frontier Explorer
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Get one useful tool a week

Like this tool? Enter your email and we'll send you one genuinely useful Gera tool a week — plus a link to come back to this one. No spam, one-click unsubscribe any time.

See the cost-quality trade-off at a glance

Choosing an LLM is a trade-off between price and capability. This explorer plots major models with cost per 1M tokens against a benchmark quality score, then highlights the Pareto frontier — the models that give the most quality for their cost. Everything below the frontier is a worse deal.

How the frontier works

A model is Pareto-optimal if no other model beats it on both axes at once — that is, nothing else is simultaneously cheaper and higher quality:

dominated(A) = exists B such that cost(B) <= cost(A)
                                  and quality(B) >= quality(A)
                                  and B != A
frontier = models that are not dominated

The frontier traces the efficient trade-off curve. Picking off the frontier means you are leaving quality or money on the table — there is a strictly better model available.

Why benchmark choice reshapes the frontier

The frontier is not fixed — it shifts depending on which quality metric you choose. A model that dominates on MMLU (broad world knowledge and reasoning) may fall mid-pack on HumanEval (coding correctness) or MT-Bench (multi-turn conversation quality). This matters practically: if your application writes code, a model with slightly lower MMLU but much higher HumanEval might sit on the frontier for your use case even though it appears dominated on the all-purpose plot.

Three common metrics this tool supports:

MMLU — 57-subject multiple-choice covering STEM, law, history, and more. Good for general-purpose assistants.
HumanEval — Python function completion tasks with functional tests. Better for coding and programming agents.
MT-Bench — judge-scored multi-turn conversations. Most relevant for chatbots and dialogue systems.

Reading the plot effectively

Models cluster in a few regions on the cost-quality chart. Budget models (often open-weight or distilled) sit at the low-cost, lower-quality corner. Frontier-class proprietary models sit at the high-quality, higher-cost end. The interesting models are those that sit on the Pareto curve — each step up the curve costs more but buys meaningful quality improvement; the models between the curve and the axes are strictly worse value than a neighbouring frontier model.

A model can appear high-quality on the y-axis but be off the frontier because another model matches it at lower cost. Similarly, the cheapest model is always on the frontier (nothing is cheaper and worse is irrelevant) — but it only makes sense to pick if your quality bar is very low.

Tips for using the plot

Set your quality floor first. Decide what minimum benchmark score your task requires, then pick the cheapest frontier model that clears it. Paying for extra quality above your actual need is wasted budget.
Switch metrics to match the job. A coding workload should rank on HumanEval, not MMLU — the frontier reshapes per metric. Re-check which models stay on the frontier after switching.
Validate on your own task. Public benchmarks guide the shortlist; your actual prompts and latency requirements decide the winner. Run a small sample of real inputs on the two or three finalist models before committing.