Why does batch size matter for embeddings?

Embedding APIs let you send many inputs in one request. Larger batches mean fewer HTTP round-trips and less per-request overhead, so a big job finishes much faster — until you hit the tokens-per-minute rate limit, which caps how much you can push per window.

Does batching change the dollar cost?

No — you pay per token regardless of how the tokens are grouped. Batching changes job duration and request count, not the token bill. This tool shows both so you optimise speed without surprises on cost.

What batch size should I pick?

The largest batch your provider allows that still keeps each request comfortably inside the TPM limit and within any per-request input cap. The tool highlights the recommended size based on your numbers.

Is my data sent anywhere?

No. All calculations run in your browser. Nothing you enter is uploaded, stored or logged.

What is the Batch Size Optimizer for Embedding Jobs?

Calculate throughput, total job time and cost efficiency across embedding batch sizes. Enter total documents, your tokens-per-minute rate limit, average tokens per document and latency to find the batch size that finishes fastest without hitting limits. It runs free in your browser on Gera Tools, with nothing uploaded.

Batch Size Optimizer for Embedding Jobs

Name: Batch Size Optimizer for Embedding Jobs
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Get one useful tool a week

Like this tool? Enter your email and we'll send you one genuinely useful Gera tool a week — plus a link to come back to this one. No spam, one-click unsubscribe any time.

Optimize embedding batch size for speed and cost

Embedding a large corpus is throughput-bound: the token bill is fixed, but how fast the job finishes depends on batch size. Bigger batches cut per-request overhead and finish sooner — until you hit the model’s tokens-per-minute (TPM) rate limit. This tool sweeps batch sizes for your corpus and shows job duration, request count and cost so you can pick the fastest size that stays inside the limit.

How it works

For a corpus of N documents at t tokens each, a batch of b documents needs:

total_requests = ceil(N / b)
tokens_per_request = b × t

Job time is bounded by two things — request latency (total_requests × latency, reducible with parallelism) and the TPM ceiling (total_tokens / TPM minutes, which no batching can beat). The tool reports the larger of the two as the realistic floor, and flags batch sizes whose per-request token count would exceed the TPM window or a typical input cap.

Tips for fast embedding jobs

Max out the batch within limits. Most embedding endpoints accept dozens to thousands of inputs per call — use the largest the API allows.
Run batches in parallel up to TPM. Once the batch is large, concurrency fills the remaining TPM headroom; beyond that you only earn 429s.
Pre-dedupe and chunk well. Embedding identical or near-identical chunks wastes tokens; clean the corpus before the job.
Use the batch API for non-urgent jobs. Asynchronous batch endpoints are often cheaper and dodge real-time rate limits entirely.

Understanding the two ceilings

When embedding a large corpus you are racing against two independent limits simultaneously, and the slower one sets your floor:

Request-latency ceiling. Each API call takes some fixed time regardless of batch size. If you have 10,000 documents batched at 100 per call, that is 100 calls. At 0.3 seconds per call (single-threaded) that is 30 seconds minimum. Adding concurrency reduces this proportionally until you hit the second ceiling.

TPM ceiling. Every token you send consumes your tokens-per-minute budget. 10,000 documents averaging 500 tokens each is 5,000,000 tokens. At a 1,000,000 TPM limit that is at least 5 minutes of wall-clock time, no matter how many parallel requests you fire. No degree of concurrency or batch size can compress this further.

The tool reports both bounds so you can see which one is binding for your specific numbers. When the TPM ceiling dominates, adding more parallelism only earns 429 errors; when latency dominates, concurrency genuinely helps.

Worked example

Suppose you have 50,000 product descriptions averaging 250 tokens each, with a TPM limit of 500,000 and a per-call latency of 200 ms. Total tokens: 12,500,000 — so the TPM floor alone is 25 minutes. A batch of 100 inputs means 500 calls; at 200 ms each that is 100 seconds single-threaded, but 10 parallel workers shrink this to 10 seconds, well under the 25-minute TPM floor. Increasing the batch to 500 inputs cuts the call count to 100 and barely moves the needle, because the TPM ceiling is already the bottleneck. The tool plots this across batch sizes so you can see the flat region where bigger batches stop helping.