What is time-to-first-token (TTFT)?

TTFT is the delay between sending a request and the first streamed token arriving. In streaming chat UIs it is what users perceive as responsiveness, far more than total generation time, so it is the key latency metric to optimise.

Where do the TTFT figures come from?

They are representative editable estimates based on published benchmarks and typical observed latencies. Real TTFT depends on region, load, prompt length and provider routing, so confirm against your own measurements.

Why does a faster model sometimes cost more?

Smaller, distilled models stream first tokens quickly and cost less, while large frontier models can be slower and pricier. But not always — the tool shows the actual trade-off so you do not assume cheap means slow.

Does output length affect TTFT?

TTFT itself is about the first token, so output length mostly affects total time and cost, not TTFT. The tool uses output length only to compute per-call cost, which it weighs against the latency requirement.

What is the First-Token Latency vs Cost Optimizer?

Rank LLMs by time-to-first-token (TTFT) against cost per call to find the right model for latency-sensitive streaming apps within your budget. See which models meet your TTFT requirement and which one is cheapest among them. It runs free in your browser on Gera Tools, with nothing uploaded.

First-Token Latency vs Cost Optimizer

Name: First-Token Latency vs Cost Optimizer
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

First-token latency vs cost optimizer

In a streaming chat UI, users judge speed by how fast the first token appears — time-to-first-token (TTFT) — not by total generation time. The fastest-streaming model isn’t always the cheapest, and the cheapest isn’t always fast enough. This optimizer ranks models by TTFT against cost per call so you can pick the cheapest model that still meets your latency bar.

How it works

You set a maximum acceptable TTFT and a per-call cost budget. The tool checks each model’s estimated TTFT and computes its cost for your output length, then filters to the models that satisfy both constraints:

qualifies = (model_ttft ≤ ttft_limit) AND (call_cost ≤ budget)
call_cost = (input + output tokens) priced at the model's rates

Among the qualifying models it highlights the cheapest, so you get the best price without breaking the responsiveness your UX needs.

What counts as TTFT and why it matters more than total latency

Time-to-first-token is the gap between sending your API request and receiving the first streamed character of the response. In a chat interface, this is the pause the user experiences before seeing anything. It determines whether the UI feels “alive” or stuck.

Once the first token arrives, streaming fills in the rest progressively. Users perceive progressive rendering as fast even if total generation takes several seconds, because they see movement rather than a blank screen. This is why TTFT is the critical metric for streaming UIs, while total generation time matters more for batch or background tasks.

Choosing a target TTFT for your use case

As a rough guide based on common UX thresholds:

Under 300ms — feels nearly instant, appropriate for real-time voice interfaces or inline autocomplete where any delay feels disruptive.
300–800ms — acceptable for chat interfaces; users tolerate this without noticing a delay as “lag.”
800ms–2s — noticeable but workable for longer-context queries where users understand a processing moment.
Over 2s — typically perceived as slow in interactive chat; users may abandon or assume something is broken.

Set your TTFT budget based on the interaction type, not on an absolute standard. A research assistant where the user reads a long question before responding can tolerate more latency than an inline autocomplete box.

When to route differently for latency

One powerful pattern is tiered routing: use a smaller, low-latency model for immediate acknowledgement tokens (“I’m looking that up…”), then stream results from a higher-quality model in parallel or serially. The user sees an instant response; the quality work happens behind it.

Another approach: cache common inputs. If a subset of prompts repeat with high frequency (common questions, system-prompt-only calls), prompt caching offered by some providers can cut TTFT drastically for those calls.

Tips and notes

Optimise TTFT, not total latency, for chat — streaming hides total generation time as long as the first token lands fast.
A small “router” model can stream an instant acknowledgement while a larger model works in the background, decoupling perceived latency from quality.
TTFT varies with region and load; pin your provider region close to users and re-measure under real traffic before locking a model in.
The TTFT estimates in this tool are representative figures — measure your actual production TTFT at your traffic levels and preferred region before making infrastructure decisions.