Optimize embedding batch size for speed and cost
Embedding a large corpus is throughput-bound: the token bill is fixed, but how fast the job finishes depends on batch size. Bigger batches cut per-request overhead and finish sooner — until you hit the model’s tokens-per-minute (TPM) rate limit. This tool sweeps batch sizes for your corpus and shows job duration, request count and cost so you can pick the fastest size that stays inside the limit.
How it works
For a corpus of N documents at t tokens each, a batch of b documents needs:
total_requests = ceil(N / b)
tokens_per_request = b × t
Job time is bounded by two things — request latency (total_requests × latency,
reducible with parallelism) and the TPM ceiling (total_tokens / TPM minutes,
which no batching can beat). The tool reports the larger of the two as the
realistic floor, and flags batch sizes whose per-request token count would exceed
the TPM window or a typical input cap.
Tips for fast embedding jobs
- Max out the batch within limits. Most embedding endpoints accept dozens to thousands of inputs per call — use the largest the API allows.
- Run batches in parallel up to TPM. Once the batch is large, concurrency fills the remaining TPM headroom; beyond that you only earn 429s.
- Pre-dedupe and chunk well. Embedding identical or near-identical chunks wastes tokens; clean the corpus before the job.
- Use the batch API for non-urgent jobs. Asynchronous batch endpoints are often cheaper and dodge real-time rate limits entirely.