Serving the same content in another language can cost noticeably more with most LLMs — not because of translation fees, but because tokenizers encode some scripts far less efficiently than English. This tool estimates that hidden tax across 20 languages from a single English word count.
How it works
LLM tokenizers are trained on a corpus dominated by English and other Latin-script text. They learn long, frequent English chunks as single tokens, so English encodes efficiently — roughly 1.3 tokens per word. Scripts that were under-represented in training, especially logographic (Chinese, Japanese) and abugida/abjad scripts (Hindi, Thai, Arabic), fall back to much smaller units, often a token per character or even per byte.
The calculator starts from your English token estimate (words × 1.3) and
multiplies it by a per-language factor drawn from published tokenizer studies.
It then prices the inflated token count at your chosen per-million-token rate, so
you can see both the multiplier and the real cost delta side by side.
Worked example
For 1,000 English words at $5 per million tokens:
- English: ~1,300 tokens → $0.0065
- Spanish (×1.1): ~1,430 tokens → $0.0072
- Chinese (×1.6): ~2,080 tokens → $0.0104
- Hindi (×2.7): ~3,510 tokens → $0.0176
The same message costs roughly 2.7× more in Hindi than English for that tokenizer. At scale — millions of multilingual requests — that gap dominates a budget.
Tips
- Newer o200k-family tokenizers narrow these gaps versus older cl100k ones; if you serve many non-English users, the tokenizer version is a real cost lever.
- Multipliers apply to context-window consumption too — size your prompts in the worst-case language, not English.
- Pair this with the LLM API Cost Calculator to model full monthly spend once you know your effective multiplier.