How is the token count estimated?

The GitHub tree API returns each file's byte size. The tool converts bytes to tokens using a code-aware ratio of roughly 3.5 characters per token. This is an estimate within a few percent, not a tokenizer-exact count.

Does it read the file contents?

No. It only reads the file tree and per-file byte sizes, which keeps requests fast and stays well under GitHub's rate limits. Token estimates are derived from sizes, not contents.

Why does it only work on public repos?

It calls the unauthenticated GitHub REST API from your browser, which can only see public repositories. Private repos would require a personal token, which this tool deliberately does not ask for.

I hit a rate limit — what now?

Unauthenticated GitHub requests are capped per IP per hour. Wait a few minutes and retry, or test a smaller repo. The tool surfaces the API error so you know it was a limit and not a bad URL.

Why filter by extension?

Repos often contain lockfiles, images, fonts, and vendored bundles that you would never send to an LLM. Filtering to source extensions gives a realistic estimate of what you would actually feed the model.

What is the GitHub Repo Token Estimator?

Enter a public GitHub repo URL to fetch its file tree and estimate total token count by file type, helping you decide if the codebase fits in a long-context model or needs chunking before you pay for inference. It runs free in your browser on Gera Tools, with nothing uploaded.

GitHub Repo Token Estimator

Name: GitHub Repo Token Estimator
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

GitHub repo token estimator

Before you pipe a whole codebase into a long-context model, it helps to know roughly how many tokens it is — both to decide whether it fits and to estimate cost. The GitHub repo token estimator reads a public repository’s file tree (not its contents), sums file sizes by type, and converts bytes to an estimated token count so you can plan chunking and budget up front.

Why you need to know this before you call the API

Long-context models have large windows — 128K, 200K, or even 1M tokens — but cost scales with what you send. A moderately sized codebase at 50,000 lines of TypeScript or Python can easily exceed 500,000 input tokens once everything is concatenated. At a model price of $5 per million input tokens, a single pass costs $2.50. If your agentic tool re-reads the codebase 10 times across a task, that is $25 for one session. Estimating before you start prevents the surprise.

Separately, context-window limits matter even for large-window models: if the repo overflows the window, you need to decide what to exclude or how to chunk it before writing the code that calls the API.

How it works

You paste a repo URL. The tool calls GitHub’s public REST API for the repository’s default branch and requests the recursive git tree, which lists every file with its byte size. It groups files by extension, optionally filters to the extensions you care about, and converts total bytes to tokens using a code-aware ratio of about 3.5 characters per token. It then compares the total against a context window you select and gives a fit-or-chunk verdict. Everything runs in your browser against the public API — no file contents are downloaded.

How byte size converts to tokens

The conversion uses a ratio of approximately 3.5 characters per token for source code, derived from the typical behavior of BPE tokenizers on mixed code. This is slightly denser than English prose (about 4 characters per token) because code contains many short, frequently-tokenized symbols. The estimate tends to run within about 10–15% of a true tokenizer count for clean source files.

Binary files (images, compiled objects, fonts) have their byte sizes included in the raw count but are not meaningful input tokens — filtering them out by extension is important for accuracy.

Filtering by extension

Without filtering, a typical monorepo includes:

package-lock.json or yarn.lock — enormous files that tokenize very densely but are rarely useful context
dist/ or build/ compiled output — duplicate of the source but larger
Images, fonts, and other binary assets — not useful text tokens at all
Test fixture files and seed data — often large JSON or CSV

Filtering to just your source extensions (for example .ts,.tsx,.py,.go) often cuts the estimated token count by 50–80%.

Tips and notes

Filter aggressively. Excluding lockfiles, dist/, and binary assets often cuts the estimate by half or more.
Estimates trend high for minified code and low for verbose comments — treat the number as a planning figure, not a bill.
Mind the rate limit. Unauthenticated GitHub API calls are limited per hour per IP; space out large repos or retry after a short wait.
Chunk by directory. If the repo overflows your window, the per-extension breakdown shows where the bulk lives so you can split sensibly.