Chat template token overhead analyzer
Every chat API silently wraps your messages in template tokens — role headers, turn delimiters, sequence markers — and bills you for all of them. For long answers this is rounding error; for a high-volume classifier sending five-token messages, the wrapper can cost more than the content. This tool estimates that hidden overhead across four model families so you can see it in numbers.
How it works
The analyzer counts the tokens in your visible message content, then adds each model family’s per-turn template tokens:
- ChatML (OpenAI): a turn-start marker, a role header, and a turn-end marker per message.
- Llama 3: a begin-of-text token once, plus start/end header-id tokens and an end-of-turn token for each message.
- Mistral: instruction-wrapper tokens around user turns plus sequence BOS/EOS.
- Gemma / Gemini: start-of-turn and end-of-turn markers per message.
For each family it reports total template tokens, total prompt tokens, and the overhead percentage = template / (content + template). Content tokens are estimated from character count at roughly four characters per token.
Tips and notes
- Overhead is dominated by turn count, not message length — fewer, larger turns are cheaper per token than many tiny ones.
- A repeated system prompt pays its content and template cost on every call unless your provider caches it; prompt caching is the single biggest lever for chat-heavy workloads.
- For extreme-volume short calls (moderation, classification), consider a non-chat completion endpoint to skip per-turn wrappers entirely.
- The percentages here are estimates; trust the API’s returned
usagefor billing, and use this tool to decide whether the overhead is worth optimizing.