How is the embedding cost calculated?

The planner estimates tokens as characters divided by about four (a common English heuristic), multiplies by your document count to get total tokens, then multiplies by the model's per-million-token price. Embeddings are priced on input tokens only, so there is no output cost to add — the total is simply total tokens times the rate.

What is the optimal batch size and why does it matter?

Most embedding APIs accept many inputs in a single request, up to a model-specific cap (often 2048). Batching slashes the number of HTTP round-trips and helps you stay under your requests-per-minute limit. The planner picks the largest batch that fits under both the model's input cap and the per-request token budget implied by your TPM/RPM ratio.

Why does the estimated time depend on a bottleneck?

Throughput is limited by whichever ceiling you hit first — requests per minute or tokens per minute. With small documents you usually run out of requests; with large documents you run out of tokens. The planner computes the time implied by each limit and reports the larger one, labelling which limit binds so you know what to ask your provider to raise.

How accurate is the token estimate?

The four-characters-per-token rule is a rough average for English; code, other languages, and unusual formatting can differ substantially. For a precise figure, run a sample through the model's real tokenizer. Treat the planner's output as a solid budgeting estimate, not an invoice.

Does the tool account for batch-API discounts?

No — it prices at the standard synchronous rate. Many providers offer roughly a 50% discount on asynchronous batch endpoints. If you can tolerate the higher latency of a batch job, halve the displayed cost as a starting estimate and confirm against current pricing.

What is the Embedding Batch Cost Planner?

Enter document count, average size, embedding model, and your RPM/TPM limits to get total token cost, the optimal batch size, request count, and estimated run time — with the binding rate-limit bottleneck flagged. Runs in your browser. It runs free in your browser on Gera Tools, with nothing uploaded.

Embedding Batch Cost Planner

Name: Embedding Batch Cost Planner
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Before you embed a large corpus you want two numbers up front: how much will it cost, and how long will it take under your rate limits. This planner gives you both, plus the batch size that gets you there fastest — all computed locally in your browser.

How it works

Token volume is estimated as characters ÷ ~4 (a standard English heuristic) times your document count. Cost is that token total times the selected model’s per-million-token price — embeddings bill on input tokens only, so there’s no output cost to add.

For scheduling, the tool finds the optimal batch size: the most inputs it can pack into one request without exceeding the model’s input cap (commonly 2048) or the per-request token budget implied by your tokens-per-minute ÷ requests-per-minute ratio. From there it derives the number of requests and the run time, computing the time under both your RPM and TPM ceilings and reporting whichever is larger. The binding bottleneck is labelled so you know exactly which limit to ask your provider to raise.

Reading the bottleneck

Request (RPM) bound — you have lots of small documents and run out of requests before tokens. Bigger batches help most here.
Token (TPM) bound — your documents are large and you saturate the token budget. Only a higher TPM tier (or fewer/shorter chunks) speeds things up.

Worked example: 100,000 documents at 300 characters each

Suppose you have 100,000 chunks of 300 characters average, using a model with a rate limit of 3,000 RPM and 1,000,000 TPM:

estimated tokens = 100,000 × 300 / 4 = 7,500,000 tokens total
optimal batch    = min(2048 inputs, 1,000,000 TPM / 3,000 RPM) = min(2048, 333) = 333 inputs per request
number of requests = ceil(100,000 / 333) = 301 requests
time under RPM   = 301 / 3,000 RPM = 0.10 minutes  (RPM is not the bottleneck)
time under TPM   = 7,500,000 / 1,000,000 TPM = 7.5 minutes  (TPM binds)
estimated runtime = 7.5 minutes

In this case, the binding bottleneck is TPM. Making the batches larger would not help — you still saturate the token budget regardless of how many items are in each request. The actionable fix is to request a higher TPM tier from the provider, or to chunk documents into shorter segments.

Strategies for reducing embedding time

When the planner shows a long estimated runtime, these techniques speed things up:

Request a rate limit increase. Providers often grant higher limits on request, especially for verified business accounts. Identify whether RPM or TPM is the bottleneck first — asking for the wrong one has no effect.
Use the async batch API. Most providers offer an asynchronous batch endpoint (submitted as a job, polled for completion) that processes at higher throughput than synchronous calls and often at lower cost. The trade-off is latency — results arrive in minutes to hours rather than instantly.
Shorter chunks reduce TPM pressure. Smaller chunks use fewer tokens per document, shifting the constraint toward RPM. Whether this is worthwhile depends on the retrieval quality trade-off.
Parallelize across API keys. In some architectures, embedding across multiple accounts multiplies the effective rate limit. Check your provider’s terms of service before doing this.

Tips

Chunk size is a lever on both axes: smaller chunks cost fewer tokens each but produce more documents, shifting you toward the RPM bottleneck.
If you can tolerate latency, use the provider’s asynchronous batch endpoint for roughly half the cost.
Re-run with your real tokenizer count once for a representative document to calibrate the estimate.