Translation token multiplier
The same paragraph costs wildly different amounts to process depending on its language. Because LLM tokenizers are trained on English-heavy data, translating into Arabic, Hindi, or Thai can double or triple your token bill even though the meaning is identical. This tool estimates that multiplier for 20+ languages so you can budget multilingual features honestly.
How it works
The tool counts the characters in your English source, then applies a per-language token-per- character ratio derived from empirical multilingual tokenization data. English is the baseline at roughly 0.25 tokens per character (about 4 characters per token); other languages have higher or lower ratios depending on script and how well the tokenizer’s vocabulary covers them.
Each language’s estimated token count is target_chars × tokens_per_char, where target_chars
also accounts for typical text-length expansion or contraction during translation (Spanish tends to
run longer than English; Chinese much shorter in character count). The multiplier column is the
estimated target token count divided by the English token count.
Tips and notes
- Switching from
cl100k_basetoo200k_base(GPT-4o and o-series) meaningfully reduces non- English token counts — the newer tokenizer is far friendlier to CJK and other scripts. - The biggest cost surprises are usually Arabic, Hindi, and other Brahmic/abjad scripts, where the multiplier can exceed 2x.
- Output tokens are billed at a higher rate than input on most models, so a multilingual chatbot that responds in a high-multiplier language costs more than one that only reads it.
- Use these figures to set per-market pricing and rate limits — a flat per-message price that works for English can be unprofitable for high-multiplier languages.