What is time-to-first-token?

Time-to-first-token (TTFT) is the delay between sending your request and receiving the first token of the response. It is dominated by the prefill phase, where the model processes your entire prompt, so longer prompts increase TTFT. With streaming, TTFT is what the user perceives as responsiveness.

Why compare streaming and non-streaming?

Non-streaming makes the user wait for the full response before seeing anything, so perceived latency equals TTFT plus the full generation time. Streaming shows tokens as they arrive, so the user sees output after just the TTFT. For long outputs the difference in perceived responsiveness is dramatic.

How accurate are the estimates?

They use representative published throughput benchmarks and are meant for planning and comparison, not SLAs. Real latency varies with provider load, region, batching, prompt caching, and network conditions. Treat the numbers as ballpark figures and measure your own production traffic.

Does prompt length affect latency?

Yes. The prefill phase processes every input token before generation begins, so a long prompt raises time-to-first-token. The estimator adds a per-input-token prefill cost on top of each model's base TTFT to reflect this.

What is the LLM Latency Estimator?

Pick a model and enter input and expected output token counts to estimate time-to-first-token and total response latency using published throughput benchmarks, and compare the perceived-latency difference between streaming and non-streaming responses. It runs free in your browser on Gera Tools, with nothing uploaded.

LLM Latency Estimator

Name: LLM Latency Estimator
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Estimate before you commit to a model

Cost is not the only axis when choosing an LLM — latency shapes the whole user experience. A model that is cheaper per token but twice as slow can make a chat feel sluggish. This estimator turns a model choice plus token counts into time-to-first-token and total latency, and shows how much streaming changes what the user actually feels.

How it works

Each model carries two benchmark figures: a base time-to-first-token and an output throughput in tokens per second. The estimator adds a small per-input-token prefill cost (longer prompts take longer to process before the first token) to the base TTFT, then divides expected output tokens by throughput to get generation time. It reports both, and contrasts the perceived latency of streaming (user waits only for TTFT) against non-streaming (user waits for the whole response).

Tips for using the numbers

Stream anything over a sentence or two. The perceived-latency gap grows linearly with output length — streaming keeps long answers feeling fast.
Shorten prompts to cut TTFT. Prefill scales with input tokens; trimming a bloated system prompt directly lowers the time before the first token.
Pick the faster model for interactive UX, the cheaper one for batch jobs where total throughput matters more than responsiveness.
These are planning estimates — always measure real production latency, which varies with provider load and region.

The streaming vs. non-streaming difference in practice

For short responses (one or two sentences), the difference between streaming and non-streaming is barely noticeable. The total generation time for 50 tokens at 80 tokens/second is less than a second, so whether you show it word-by-word or all at once makes little practical difference.

For longer responses, the gap becomes significant. Consider a model with a 1-second TTFT generating a 600-token response at 60 tokens/second:

Streaming: user sees the first word after 1 second, then reads as text flows in over 10 seconds.
Non-streaming: user stares at a spinner for 11 seconds, then the full response appears at once.

The total time is the same, but streaming feels roughly 11 times more responsive because the user knows something is happening immediately after the TTFT.

When non-streaming makes sense

Despite the UX advantage of streaming, non-streaming has legitimate uses:

Tool-call pipelines where the model’s output must be parsed and used programmatically before anything is shown to a user. You cannot reliably parse a half-complete JSON tool call.
Batch processing where responses are written to a database or file rather than shown to a human. There is no perceived latency to optimize.
Short classification tasks where the output is one or two tokens and the TTFT dominates the whole response time anyway.

How input length affects TTFT

The prefill phase — where the model processes your entire prompt before generating the first output token — scales with input length. A 500-token system prompt adds more to TTFT than a 50-token one. For latency-sensitive applications, shortening or caching the system prompt is one of the most effective ways to improve TTFT without switching models. Many providers offer prompt caching specifically to mitigate this cost for repeated, identical prompt prefixes.