Question 1

What is the difference between time-to-first-token and tokens per second?

Accepted Answer

Time-to-first-token (TTFT) is how long you wait before any output appears — it determines how responsive a chat feels. Tokens per second (the generation rate) is how fast the rest of the answer streams out once it starts. A model can have a slow TTFT but a fast throughput, or vice versa, so both matter.

Question 2

Why is Groq so much faster than typical GPU inference?

Accepted Answer

Groq runs models on custom LPU hardware designed specifically for sequential token generation, rather than general-purpose GPUs. For supported open models this can push throughput into the hundreds of tokens per second. The tradeoff is that you are limited to the models the provider has deployed.

Question 3

Does a bigger model always mean slower inference?

Accepted Answer

Generally larger models generate tokens more slowly because each token requires more computation, but it is not the only factor. Hardware, batching, quantisation, context length, and serving optimisations like speculative decoding all shift the result, so a well-served large model can outpace a poorly served small one.

Question 4

How much does context length affect speed?

Accepted Answer

Long inputs slow down the initial 'prefill' stage and therefore push out time-to-first-token, because the model must process every input token before producing the first output token. Per-token generation speed afterwards is affected less, but very long contexts also raise memory pressure, which can reduce throughput.

AI Inference Speed Compared: Tokens Per Second for Major Models

What “inference speed” actually means

What drives the differences

Rough landscape across major models

Choosing for real-time use cases