What “inference speed” actually means
Inference speed is how quickly a trained model turns your prompt into a response, and it is best understood as two separate numbers, not one. Time-to-first-token (TTFT) is the delay before the first word appears — this is what makes a chat feel snappy or sluggish. Throughput, measured in tokens per second, is how fast the rest of the answer streams once it starts. A model with a 1.5-second TTFT but 80 tokens/second can feel slower to start yet finish a long answer sooner than one that begins instantly but crawls. Comparing models honestly means looking at both, plus the total time to a complete answer for the length of output you actually need.
What drives the differences
Several factors move these numbers, often more than raw model size does. Hardware matters most: specialised accelerators such as Groq’s LPUs are purpose-built for sequential token generation and can reach hundreds of tokens/second for the open models they host, far above typical GPU serving. Model size and architecture set a floor — each generated token costs computation, so larger dense models are slower, while mixture-of-experts designs activate only part of the network per token and can be faster than their total parameter count suggests. Serving optimisations — quantisation, batching, speculative decoding, KV-cache reuse — can change throughput several fold for the same model. Finally, input length drives TTFT: the model must read every input token in a “prefill” pass before emitting the first output token, so a 50,000-token document noticeably delays the start.
Rough landscape across major models
Exact figures shift with provider tuning and load, so treat these as orders of magnitude rather than fixed truth. Hosted frontier models like GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 typically stream in the range of tens of tokens per second with sub-second to low-second TTFT — fast enough for fluent chat but not the fastest available. Smaller or “mini/flash” variants (GPT-4o mini, Claude Haiku, Gemini Flash) trade some quality for markedly higher speed and are the usual choice for latency-sensitive features. Groq-hosted open models (Llama, Mixtral) lead on raw throughput thanks to their custom hardware. Self-hosted Llama on your own GPUs lands wherever your hardware, quantisation, and batching put it — flexible, but you own the optimisation work.
Choosing for real-time use cases
Match the metric to the experience. For interactive chat and autocomplete, prioritise low TTFT and stream the response so the user sees words immediately — perceived speed beats total time. For voice assistants the whole pipeline must stay under a conversational threshold, so a fast small model often wins over a slower smart one. For batch jobs — summarising thousands of documents overnight — TTFT is irrelevant and you should optimise pure throughput and cost per token, batching aggressively. And always benchmark on your prompt lengths and output sizes under realistic load, because published numbers are measured under ideal conditions that rarely match production.