How is the efficiency score calculated?

Each model's speed (tokens/sec) and cost (blended USD per 1M tokens) are normalized to 0-1 across the set. The score is: weight × normalized_speed + (1 − weight) × (1 − normalized_cost), so higher is always better and the cost-sensitivity weight controls the trade-off.

Why does throughput matter for cost?

For batch and streaming workloads, slow models tie up connections, raise infrastructure cost, and hurt user experience. A model that is twice as fast for the same token price often wins even if the per-token cost looks identical.

Are the speed numbers exact?

No. Tokens-per-second varies with prompt length, region, load, and whether you stream. The presets are representative editable estimates — benchmark on your own traffic before committing.

What is first-token latency vs throughput?

First-token latency is how long until the model starts responding (matters for interactive UX). Throughput (tokens/sec) is how fast it produces the rest. This tool focuses on throughput-per-dollar but flags latency against your threshold.

What is the Tokens-per-Second Speed vs Cost Calculator?

Rank LLMs by a single throughput-efficiency score that combines tokens per second and cost per million tokens. Set a cost-sensitivity weight and acceptable latency to find the best model for speed-critical workloads. It runs free in your browser on Gera Tools, with nothing uploaded.

Tokens-per-Second Speed vs Cost Calculator

Name: Tokens-per-Second Speed vs Cost Calculator
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Pick the right model when both speed and cost matter

For streaming chat, real-time agents, and large batch jobs, the cheapest model is not always the best choice — a model that is twice as fast for a similar price clears your queue sooner and frees infrastructure. This tool normalizes tokens-per-second and cost into one efficiency score you can tune with a single cost-sensitivity slider.

How the leaderboard is built

Across the model set, each model’s throughput and cost are scaled to 0-1. The blended score rewards speed and penalizes cost:

score = weight × norm(speed)
      + (1 − weight) × (1 − norm(cost))

At weight = 1 the ranking is pure speed; at weight = 0 it is pure cost; in between you get a balanced view. A latency threshold separately flags models that start responding too slowly for interactive use.

Why throughput matters beyond user experience

Speed affects more than how fast the text appears on screen. For production systems, throughput shapes infrastructure cost in ways that the per-token price alone does not capture:

Queue clearance. A batch of 10,000 documents processed at 50 tokens/second takes twice as long as one processed at 100 tokens/second. Longer processing means your compute and connection resources are held longer, which raises real infrastructure cost independent of the API bill.

Connection concurrency. Slower models require more simultaneous open connections to maintain the same effective throughput, which can push you into higher service tiers or require more complex pooling logic.

Time-to-first-token vs sustained throughput. First-token latency and generation throughput are different properties. A model with fast first-token but slow generation feels fast to start but delivers the full response slowly — frustrating for long outputs. Conversely, a model with slow first-token but fast generation is fine for non-interactive batch use. The latency threshold in this tool flags models whose first-token behavior exceeds your interactive tolerance, while the throughput score covers sustained generation speed.

How to set the cost-sensitivity weight

The default balanced setting (weight near 0.5) treats speed and cost equally. Adjust it based on your workload:

Interactive chat or voice assistants: Push the weight higher (toward 1.0). Users notice latency acutely; a small cost premium for significantly faster generation usually increases retention more than it costs in API fees.
Bulk document processing or nightly jobs: Push toward 0.0 (pure cost). No user is waiting, so throughput above a minimum threshold just means the job finishes at 2am instead of 3am — not worth paying extra for.
Real-time agents with tool calls: Keep close to balanced. Agents make many sequential API calls, so per-call latency compounds. But agent calls are often short, so cost per call is low and the speed premium is worth it.

How to use the result

If you are building an interactive assistant, keep the cost-sensitivity weight high and watch the latency flag. For overnight batch jobs where users never wait, push the weight toward cost. Always validate the chosen model’s real throughput on your own prompts and region before standardizing on it — provider infrastructure load at peak hours can make published benchmarks mislead.