Question 1

What exactly is a token?

Accepted Answer

A token is the unit of text an LLM actually processes — usually a common word, a word fragment, or a single character. The model never sees letters or whole sentences directly; it sees a sequence of integer token IDs. For English, a token averages about four characters, so a token is roughly three-quarters of a word.

Question 2

Why does the same text use more tokens in some languages?

Accepted Answer

Most tokenizers were trained heavily on English, so English words map to few tokens. Languages with different scripts or less training representation — like Japanese, Arabic, or Hindi — often split into many more tokens per word, sometimes one token per character. That makes the same meaning two to four times more expensive to process.

Question 3

Do spaces and punctuation count as tokens?

Accepted Answer

Yes. Whitespace is usually attached to the word that follows it, so " cat" with a leading space is a different token from "cat". Punctuation and newlines also consume tokens. This is why minor formatting changes can shift your token count, and why trailing spaces or repeated newlines quietly add cost.

Question 4

How do I count tokens before sending a request?

Accepted Answer

Use the provider's official tokenizer library — tiktoken for OpenAI models, or the tokenizer endpoints Anthropic and Google publish. Online tokenizer playgrounds let you paste text and see the split visually. As a rough rule of thumb, divide your character count by four to estimate English tokens.

Tokenization in LLMs: How Text Becomes Numbers

What a token actually is

How byte-pair encoding builds the vocabulary

Why whitespace, punctuation, and language matter

Why token counts drive cost and limits

Estimating tokens before you send