Question 1

What is a tokenizer?

Accepted Answer

A tokenizer is the pre-processing component that splits raw text into tokens and maps each to an integer ID before the model sees it. The model never processes characters or words directly — it operates on these token IDs, and the same tokenizer converts the output IDs back to text.

Question 2

What is byte pair encoding (BPE)?

Accepted Answer

BPE is a subword tokenization algorithm that starts from individual characters and repeatedly merges the most frequent adjacent pair into a new token, building a vocabulary of common subwords. It is used by GPT models and balances vocabulary size against the number of tokens per sentence.

Question 3

How do WordPiece and Unigram differ from BPE?

Accepted Answer

WordPiece (used by BERT) merges pairs that most increase the training data's likelihood rather than raw frequency. Unigram (used by SentencePiece and T5) starts with a large vocabulary and prunes tokens to maximise likelihood. All three produce subword vocabularies but by different criteria.

Question 4

Why do different models tokenize the same text differently?

Accepted Answer

Each model is trained with its own tokenizer, fit on its own corpus with its own algorithm and vocabulary size. So the same sentence can become a different number of tokens on GPT-4 versus Claude versus Llama, which is why token counts and costs are model-specific.

Tokenizer (AI Glossary)

Definition

Subword tokenization

BPE, WordPiece, and Unigram

Vocabulary-size trade-offs

Why tokenizers differ across models