Token Density by Content Type Calculator

Compare tokens per word across prose, code, JSON, HTML and more

Ad placeholder (leaderboard)

Token density by content type

“One token is about ¾ of a word” is a useful rule for English prose — and badly wrong for code, JSON or HTML. Structured and symbol-heavy content packs far more tokens per word because tokenizers split on every bracket, quote, indent and camelCase boundary. This calculator shows tokens-per-word ratios for ten common content types and lets you measure your own sample.

How it works

The reference table is based on the typical behaviour of modern BPE tokenizers (GPT, Claude). For a custom sample, the tool counts characters and words and estimates tokens at the standard ~4 characters per token ratio:

est_tokens     ≈ characters / 4
tokens_per_word = est_tokens / word_count

Plain English lands near 1.3 tokens/word; minified JSON, deeply nested data and dense code can climb well above 2-3 tokens/word, which directly inflates your input cost.

Tips and notes

  • When budgeting a mixed app (a chat prompt plus a JSON payload plus code), estimate each part with its own ratio rather than averaging blindly.
  • Stripping insignificant whitespace from JSON and HTML before sending it can meaningfully cut token count for high-volume pipelines.
  • For exact billing, always confirm with a model-specific tokenizer — these ratios are for fast, accurate-enough planning.
Ad placeholder (rectangle)