Definition
A tokenizer is the pre-processing component that converts raw text into the integer tokens a language model actually consumes. Models do not read characters or whole words directly; they operate on sequences of token IDs drawn from a fixed vocabulary. The tokenizer both encodes input text into these IDs and decodes the model’s output IDs back into readable text. It is a small but consequential piece of plumbing: it determines how many tokens a given input costs and how cleanly the model can represent rare words.
Subword tokenization
Modern tokenizers work at the subword level — a middle ground between characters and whole words. Common words become a single token, while rare or novel words are broken into familiar pieces (for example “tokenization” might split into “token” + “ization”). This keeps the vocabulary a manageable size while still being able to represent any string, including typos, code, and words never seen in training. Pure word-level vocabularies would be enormous and brittle; pure character-level ones would make sequences far too long.
BPE, WordPiece, and Unigram
Three algorithms dominate:
- Byte Pair Encoding (BPE) — starts from characters and repeatedly merges the most frequent adjacent pair into a new token. Used by the GPT family.
- WordPiece — similar to BPE but merges the pair that most increases the likelihood of the training data rather than raw frequency. Used by BERT.
- Unigram — starts from a large candidate vocabulary and prunes tokens to maximise corpus likelihood, choosing the most probable segmentation at run time. Used via SentencePiece by models like T5 and many multilingual LLMs.
Vocabulary-size trade-offs
Vocabulary size is a key design choice. A larger vocabulary means fewer tokens per sentence (cheaper, shorter sequences) but a bigger embedding table and more parameters in the output layer, and rarer tokens get less training signal. A smaller vocabulary is leaner but chops text into more pieces, lengthening sequences and raising compute. Typical LLM vocabularies range from tens of thousands to a few hundred thousand tokens, often byte-level so any input is representable.
Why tokenizers differ across models
Every model ships with its own tokenizer, fit on its own corpus with its own algorithm and vocabulary. As a result the same sentence can cost a different number of tokens on different models, which is why API pricing and context-window limits are always quoted in that model’s tokens. It also explains quirks such as languages with non-Latin scripts consuming more tokens, and why counting tokens accurately requires using the exact tokenizer the target model uses.