Does this run a real tokenizer?

No. It generates test scaffolding that asserts your own countTokens function returns the expected values. You wire it to your real tokenizer (tiktoken, the Anthropic SDK, etc.). The tool only writes the test boilerplate.

Why test token counts at all?

Token counts drive your billing estimates, context-window truncation, and rate-limit budgeting. A library upgrade or model swap can silently change counts, breaking cost forecasts and truncating prompts — a test catches that in CI.

How do I get the expected counts to put in?

Run your current tokenizer once against each sample and record the count as the expected value. The test then pins that behaviour so future changes are caught.

Is my sample data sent anywhere?

No. The generator runs entirely in your browser. Your sample texts and counts are never uploaded, stored, or logged.

Can I test multiple models?

Yes. Generate one test block per tokenizer and rename the function reference, or parameterize the test with a model column — the generated structure is easy to extend.

What is the Token Counting Unit Test Generator?

Paste sample texts and their expected token counts and instantly generate Jest, Vitest, or PyTest test code that asserts your tokenizer returns the right counts — so tokenization regressions in your LLM integration layer fail CI, not production. It runs free in your browser on Gera Tools, with nothing uploaded.

Token Counting Unit Test Generator

Name: Token Counting Unit Test Generator
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Pin your token counts with generated unit tests

Token counts are load-bearing: they set your cost estimates, decide when you truncate context, and gate your rate limits. When a tokenizer library upgrade or a model swap silently shifts those counts, your budgets drift and prompts get cut. This tool generates ready-to-run unit tests that assert your countTokens function returns the exact values you expect.

How it works

You provide sample texts and the token count each one should produce. The generator emits one assertion per case in your chosen framework:

expect(countTokens("Hello, world")).toBe(3);

The test calls your tokenizer wrapper, so it pins real behaviour. Wire countTokens to tiktoken, the Anthropic SDK’s counting endpoint, or whatever your integration uses, and the suite fails loudly the moment counts change.

Why token-count tests catch real bugs

Token count regressions are surprisingly common and disproportionately damaging:

Library upgrades. The tiktoken library and the Anthropic SDK each have their own release schedule. A minor version bump that changes the encoding vocabulary can silently shift token counts by 1–5% across your corpus. Without a pinned test, this becomes a billing surprise or a truncation problem that surfaces only in production.

Model swaps. When you switch from GPT-4 to GPT-4o, or from one Claude version to another, the tokenizer may differ even if the model names look similar. A test suite that runs against the new model immediately tells you whether your context budgets still hold.

Input pre-processing changes. If your pipeline trims whitespace, normalises Unicode, or strips HTML before sending to the model, a change in that preprocessing step changes token counts. A test pinned against the preprocessed text detects this before it reaches production.

Context truncation at runtime. If your code truncates prompts to fit a context window based on a token count, a regression means you are either sending too much (causing API errors) or too little (losing important context). A unit test is far cheaper to catch this than a production incident.

Example generated test (Jest)

import { countTokens } from '../src/tokenizer';

describe('countTokens', () => {
  test.each([
    ['Hello, world!', 4],
    ['The quick brown fox jumps over the lazy dog', 10],
    ['', 0],
    ['emoji test: \u{1F600}', 6],
  ])('counts "%s" as %i tokens', (text, expected) => {
    expect(countTokens(text)).toBe(expected);
  });
});

The generator produces this structure for your sample inputs, with your actual expected values filled in. You replace the countTokens import with your real implementation path and the suite is ready to run.

Tips for a useful test suite

Cover edge cases. Include empty strings, emoji, code blocks, and non-Latin text — these are where tokenizers diverge most between versions.
Record counts from your current tokenizer. Run it once, capture the numbers, and pin them. The test guards against change, not against a theoretical “correct” answer.
Run it in CI. A token-count test is cheap and catches dependency drift before it reaches your billing dashboard or truncates a user’s prompt.
Keep one test per token format. If you use both GPT and Claude tokenizers, maintain separate test files for each so a failure immediately identifies which family changed.