Tiktoken in the browser
Token count is the unit you pay in. This tool tokenizes text the way GPT models do — applying the
regex pre-tokenization and byte-level encoding that cl100k_base and o200k_base use —
so you can see how a prompt splits and roughly how many tokens it will cost before you spend a
single API credit.
How it works
OpenAI’s BPE tokenizers run in two stages. First a regex splits the text into candidate pieces
— words, leading-space-plus-word, numbers in groups, punctuation runs, and whitespace. Each piece
is then encoded to UTF-8 bytes and merged with byte-pair encoding against the vocabulary. This
tool reproduces stage one exactly (the official cl100k and o200k split patterns) and applies a
faithful byte-pair merge heuristic to stage two, giving you token boundaries, the byte fragments
inside each token, and a count that tracks the real tokenizer closely.
The two encodings differ mainly in vocabulary size. o200k_base has roughly twice the vocabulary,
which lets it represent common multi-character sequences (and a lot of non-English text and code)
as single tokens, so the same input typically yields fewer tokens than under cl100k_base.
Tips and notes
- A leading space is part of the next token: ” hello” is usually one token, while “hello” at the start of a string may split differently. Watch this when concatenating strings.
- Numbers are chunked in groups of up to three digits, so “1000000” is several tokens, not one.
- Special control tokens such as
<|endoftext|>are reserved — the tool flags them so you do not accidentally inject one. The visible text count here does not include the role and delimiter tokens the chat API adds around your message. - For exact billing always trust the
usagefield returned by the API; use this tool for fast local estimates while you iterate on a prompt.