Speed is the metric pricing tables forget
A model can be cheap and capable yet feel sluggish in a chat UI. This reference collects the two numbers that actually drive perceived speed — throughput (tokens per second) and time-to-first-token (TTFT) — for the leading hosted LLM APIs, so you can size latency before you build.
How it works
Each row lists a model’s median output throughput and TTFT alongside a short note explaining its position. Filter to a single provider, or sort by throughput when you care about how fast a long answer streams, or by TTFT when interactive responsiveness is the priority. Specialised inference providers like Groq sit at the top of throughput thanks to custom hardware, while reasoning models like o1 sit at the bottom of TTFT because they think before they speak.
How to read it
- Chat UIs: prioritise low TTFT — users judge responsiveness by how fast the first token appears, not total speed.
- Batch / pipeline jobs: prioritise raw throughput; TTFT is irrelevant when no human is waiting.
- Reasoning tasks: accept the latency. o1’s slow TTFT is the cost of its step-by-step problem solving.
These figures are planning estimates from US regions. Real-world speed shifts with server load, your region, prompt length and streaming, so always run a quick benchmark against your own traffic before relying on a number. Pair this with the AI Model Comparison Table to weigh speed against cost and capability together.