Why does code cost more tokens than prose?

Tokenizers split on punctuation, symbols, indentation and camelCase boundaries. Code and structured formats like JSON are dense with these, so a single line of code can use far more tokens than a sentence of equivalent length.

How is the custom estimate calculated?

The tool measures your sample's characters and applies a ~4-characters-per-token heuristic, then divides by word count to show tokens per word. It is an estimate that tracks GPT and Claude tokenizers closely but is not an exact count.

Which ratio should I use for budgeting?

Pick the row that matches your dominant content. For mixed apps, blend the ratios by the share of each content type, or paste a representative sample to get one combined figure.

Is my pasted text sent anywhere?

No. The measurement runs entirely in your browser. Nothing you paste is uploaded, stored or logged.

What is the Token Density by Content Type Calculator?

See how token density varies across ten common content types — English prose, code, JSON, HTML, CSV, URLs and more. Paste your own sample to measure its tokens-per-word ratio and estimate costs for mixed-content LLM applications accurately. It runs free in your browser on Gera Tools, with nothing uploaded.

Token Density by Content Type Calculator

Name: Token Density by Content Type Calculator
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Token density by content type

“One token is about ¾ of a word” is a useful rule for English prose — and badly wrong for code, JSON or HTML. Structured and symbol-heavy content packs far more tokens per word because tokenizers split on every bracket, quote, indent and camelCase boundary. This calculator shows tokens-per-word ratios for ten common content types and lets you measure your own sample.

Why content type changes your token budget dramatically

Modern LLM tokenizers use Byte Pair Encoding (BPE), which learns frequent character sequences from training data and merges them into single tokens. Common English words become single tokens; rare, short, or symbolic sequences get split into many tokens. Code and structured formats are full of short symbolic sequences that do not appear frequently enough to be merged, so they tokenize at a much higher rate than natural language.

The practical consequence is that a mixed prompt containing English instructions plus a JSON payload plus a code snippet will cost substantially more than the instruction text alone would suggest. Getting the ratio right for each part of your content prevents systematic underestimates.

Reference ratios by content type

Content type	Approx. tokens per word	Notes
English prose	1.3	Most words are single tokens; common words especially
Markdown	1.4–1.5	Heading markers, asterisks, and backticks add tokens
Python	1.6–2.0	Colons, indentation, underscores, operators
JavaScript / TypeScript	1.8–2.4	Brackets, braces, semicolons, camelCase splits
Java / C#	2.0–2.8	Verbose; type annotations, generics, brackets
Formatted JSON	1.8–2.5	Quotes, braces, colons on every key-value pair
Minified JSON	2.5–4.0	No whitespace; dense symbol sequences
HTML	2.0–3.5	Tags, attributes, angle brackets all split
URLs	3.0–5.0	Slashes, hyphens, dots each tokenize separately
CSV	1.5–2.0	Commas and quotes add overhead

These are approximate ranges based on typical BPE tokenizer behavior. Your specific content will vary — use the custom sample input to measure your own.

How the custom measurement works

The reference table is based on the typical behaviour of modern BPE tokenizers (GPT, Claude). For a custom sample, the tool counts characters and words and estimates tokens at the standard ~4 characters per token ratio:

est_tokens      ≈ characters / 4
tokens_per_word = est_tokens / word_count

Plain English lands near 1.3 tokens/word; minified JSON, deeply nested data and dense code can climb well above 2–3 tokens/word, which directly inflates your input cost.

Practical optimization strategies

Minified JSON is expensive; compact JSON is cheaper. Removing insignificant whitespace from large JSON payloads before sending them can reduce token count by 15–30%. For very high-volume pipelines this is a worthwhile one-time optimization.

HTML sent as-is is very token-dense. If you are extracting information from web pages, converting HTML to Markdown or plain text before passing it to the model is a substantial saving — Markdown preserves most semantically useful structure at a much lower token cost.

URLs are surprisingly costly. A deeply-nested URL path or query string can tokenize at 4–6 tokens per what looks like one “word.” If you are injecting many URLs into prompts, consider shortening or omitting them when the model does not need them.

For budgeting mixed content, estimate each component separately using its own ratio, then sum. Averaging across the whole prompt underestimates structured sections and overestimates prose.