Is this the exact GPT-4o tokenizer?

It is a close approximation of cl100k_base's byte-pair behaviour, not the exact vocabulary. It reproduces the visible rules — leading-space attachment, splitting on case and digits, and subword breaks — so counts are usually within a few percent of the real tokenizer for English.

Why does a leading space belong to the next word?

GPT tokenizers attach a leading space to the token that follows it, so ' token' is one token, not a space plus a word. The visualizer shows this by including the space inside the colored chip.

Why do numbers split into single digits?

cl100k_base often splits long digit runs into chunks, so 2026 may become two or more tokens. The visualizer mirrors this so you can see why numeric-heavy text costs more tokens than you would expect.

Does it handle other languages?

It handles them, but non-English and CJK text often uses many more tokens per character because the byte-pair vocabulary is English-biased. The colored chips make that overhead obvious.

No. Tokenization runs entirely in your browser. Nothing you paste leaves the page.

What is the Token Visualizer?

Renders each token as a distinct color-coded span so you can see word boundaries, subword splits, and whitespace handling. Approximates the GPT-4o (cl100k_base) tokenizer client-side. It runs free in your browser on Gera Tools, with nothing uploaded.

Token Visualizer — Gera Tools

Name: Token Visualizer
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Token visualizer

Models do not read characters or words — they read tokens. Seeing exactly where your text breaks into tokens explains why billing, context limits, and even model behaviour sometimes surprise you. This visualizer colors each token as its own chip so you can spot subword splits, whitespace handling, and the hidden cost of numbers and rare words.

How it works

The tool applies a byte-pair-style splitter modeled on cl100k_base, the tokenizer GPT-4o uses. It first segments on the same boundaries the real regex uses — contractions, runs of letters, runs of digits, punctuation, and leading spaces — then merges short common fragments the way BPE would. Crucially, a leading space stays attached to the word after it, which is why token counts as one token rather than two. Long digit runs and uncommon words break into multiple subword tokens, which the colored chips make immediately visible.

What the chips reveal

Common English words are almost always a single token. “the”, “is”, “you”, “have” each occupy one chip. That is the core efficiency of BPE tokenization for everyday prose.

Uncommon words and technical jargon often split. “Tokenization” itself may become two chips: “Token” and “ization”. A medical term or a product-specific proper noun can split into three or four fragments, costing more than you might expect.

Numbers split unpredictably. Four-digit years like “2024” often tokenize as a single token because they appear frequently in training data. But a 10-digit ID or a long price like “1,234,567.89” breaks into many single-digit or two-digit chunks. If you embed transaction IDs or invoice numbers in prompts at scale, the token cost adds up.

Code is generally efficient for keywords but expensive for long variable names, hashes, and base64 strings. A 64-character hex SHA-256 hash can cost 15–20 tokens on its own.

Non-Latin scripts are the most dramatic case. A single Chinese or Japanese character is often a single token, which sounds good, but one character carries far less information than one English word — so you need many more characters to convey the same meaning, and the effective cost per semantic unit is higher.

Reading the results for prompt optimization

Paste your system prompt and look for chips that surprise you. Clusters of single-character chips inside what should be a common phrase often mean a typo, an unusual encoding, or a Unicode lookalike character that broke the vocabulary match. Long chains of short chips inside a number or an identifier are a signal to consider whether you need that value verbatim.

Tips and notes

If a chip splits a common word into pieces, that word is rare in the vocabulary — rephrasing with more common words can shave tokens off long prompts.
Numbers, IDs, and hashes are token-hungry; consider whether you really need them verbatim in a prompt you send thousands of times.
Watch out for invisible whitespace characters (non-breaking spaces, zero-width joiners) that create unexpected token boundaries and inflate counts silently.
For an exact production count, call the provider’s tokenizer or count_tokens endpoint — use this visualizer for intuition and quick estimates.