What is the difference between throughput and time-to-first-token?

Throughput (tokens/sec) is how fast the model streams the full response once it starts. Time-to-first-token (TTFT) is how long you wait before the first word appears — the metric users feel most in chat UIs.

Reasoning models like o1 spend time thinking internally before they answer, which inflates time-to-first-token dramatically. The trade-off buys harder problem-solving, not faster responses.

How accurate are these numbers?

They are representative median values from US regions and are deliberately rounded. Real speed varies with load, region, prompt length and whether you stream — always benchmark your own traffic.

Does this tool measure live speed?

No. It is a static, browser-rendered reference. It does not call any API, so there is no network request and nothing is logged.

What is the AI Speed Benchmark Reference?

A curated reference table of median output throughput (tokens per second) and time-to-first-token latency for leading hosted LLM APIs, filterable by provider and sortable by speed, with notes on region and load variability. It runs free in your browser on Gera Tools, with nothing uploaded.

AI Speed Benchmark Reference

Name: AI Speed Benchmark Reference
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Speed is the metric pricing tables forget

A model can be cheap and capable yet feel sluggish in a chat UI. This reference collects the two numbers that actually drive perceived speed — throughput (tokens per second) and time-to-first-token (TTFT) — for the leading hosted LLM APIs, so you can size latency before you build.

Two numbers, two different UX problems

Throughput (tokens per second) determines how fast a response streams. A model generating 50 tokens/second takes roughly 20 seconds to produce a 1,000-token response. At 150 tokens/second, the same response arrives in about 7 seconds. For long outputs — detailed code, long-form writing, complex analysis — throughput is the number that determines whether the experience feels fluid or frustrating.

Time-to-first-token (TTFT) is how long the user waits before anything appears. Even if a model eventually streams fast, a 4-second TTFT makes the UI feel frozen. Users form their first judgment in the first second; TTFT is what they measure, consciously or not. For short responses that finish in a few seconds, TTFT often dominates the total perceived wait.

These two metrics trade off differently. Batched compute systems often have higher throughput but higher TTFT. Streaming-optimised systems minimise TTFT at some throughput cost. Reasoning models sacrifice TTFT dramatically for deeper problem-solving.

How it works

Each row lists a model’s median output throughput and TTFT alongside a short note explaining its position. Filter to a single provider, or sort by throughput when you care about how fast a long answer streams, or by TTFT when interactive responsiveness is the priority. Specialised inference providers like Groq sit at the top of throughput thanks to custom hardware, while reasoning models like o1 sit at the bottom of TTFT because they think before they speak.

Why the range is so wide across providers

Speed differences between models and providers reflect real infrastructure differences:

Specialised inference hardware. Providers like Groq use custom LPU (Language Processing Unit) chips specifically designed for the matrix multiplications that dominate transformer inference. This is why their throughput can be dramatically higher than a general-purpose GPU cluster running an equivalently capable model.

Model size and architecture. A 7B parameter model runs faster than a 70B parameter model, all else equal. Mixture-of-Experts (MoE) architectures like Mixtral activate only a fraction of parameters per inference, giving speed close to a smaller dense model while maintaining capacity closer to a larger one.

Reasoning vs. standard models. Reasoning models (o1, o3, and similar) generate extensive internal chain-of-thought before producing a visible response. This internal reasoning is where TTFT comes from — the model may be doing seconds or minutes of internal work before outputting a single token.

Server load and region. The same model on the same provider can vary significantly in speed depending on current demand and which datacenter region you’re hitting. The reference shows representative median figures; real deployments should benchmark on representative traffic from the target region.

How to read it

Chat UIs: prioritise low TTFT — users judge responsiveness by how fast the first token appears, not total speed.
Batch / pipeline jobs: prioritise raw throughput; TTFT is irrelevant when no human is waiting.
Reasoning tasks: accept the latency. o1’s slow TTFT is the cost of its step-by-step problem solving.
High-frequency streaming (voice interfaces, copilots): very low TTFT is critical; even 500ms feels like lag in a voice context.

These figures are planning estimates from US regions. Real-world speed shifts with server load, your region, prompt length and streaming, so always run a quick benchmark against your own traffic before relying on a number. Pair this with the AI Model Comparison Table to weigh speed against cost and capability together.