What a token actually is
A large language model cannot read letters or words the way you do. Before any text reaches the model, a tokenizer chops it into small pieces called tokens, and each token is mapped to an integer ID. The model only ever sees those integers. When it replies, it predicts the next token ID, and the tokenizer turns the sequence back into readable text.
A token is not a letter and not always a word. In English it is typically a common word (“the”, “cat”) or a fragment of a longer or rarer word (“token” might be one piece while “tokenization” splits into “token” + “ization”). On average an English token is about four characters, which works out to roughly three-quarters of a word. So 1,000 tokens is around 750 English words.
How byte-pair encoding builds the vocabulary
Modern LLMs use a scheme called byte-pair encoding (BPE) or a close relative. The idea is simple and clever. Start with individual characters, then scan a huge text corpus and repeatedly merge the most frequent adjacent pair into a new single token. Do this thousands of times and you end up with a vocabulary where the most common words are single tokens, common fragments are single tokens, and rare strings still break down into pieces the model can handle.
This is why frequency matters so much. “Hello” is one token because it appears everywhere; an unusual chemical name or a random string of characters might cost a dozen tokens because none of its fragments are common. BPE guarantees there is always some way to encode any text — worst case, character by character — so the model never hits an unknown word.
Why whitespace, punctuation, and language matter
Tokenizers are sensitive to details people overlook. Leading spaces are usually bundled into the following token, so ” world” (with a space) and “world” are different tokens. Punctuation, line breaks, and even emoji each consume tokens. This is why reformatting a prompt — adding bullet points, extra newlines, or indentation — can change your token count and therefore your cost.
Language is the biggest hidden factor. Tokenizers are trained mostly on English, so English is efficient. The same sentence in Japanese, Arabic, Hindi, or Thai can take two to four times as many tokens because those scripts are under-represented in the vocabulary and often fall back to near-character-level splitting. Multilingual applications should budget for this directly.
Why token counts drive cost and limits
Two of the most practical numbers in working with LLMs are both measured in tokens. Pricing is per token (usually per million), counting both the tokens you send (input) and the tokens the model generates (output). And the context window — the maximum amount of text the model can consider at once — is a token limit, not a word or character limit. A “128K context” means 128,000 tokens, prompt plus response combined.
Because of this, controlling token usage is the single biggest lever on both cost and capability. Trimming boilerplate, removing redundant instructions, and summarising long inputs all reduce spend and free up room in the context window.
Estimating tokens before you send
You do not have to guess. Each provider publishes a tokenizer you can run locally:
OpenAI’s tiktoken, plus the count endpoints from Anthropic and Google. Online
tokenizer playgrounds let you paste text and watch it split into coloured tokens, which
is the fastest way to build intuition about why a given prompt is expensive. For a quick
back-of-the-envelope estimate, divide your English character count by four — close enough
for budgeting before you reach for the exact tools.