Parallel Requests Cost & Rate Limit Planner

Plan parallel LLM calls to maximize throughput within budget

Ad placeholder (leaderboard)

Plan parallel LLM calls without hitting rate limits

When you call an LLM API at scale, two ceilings fight each other: the provider’s tokens-per-minute (TPM) limit and per-request latency. Fire requests serially and latency starves your throughput; fire too many in parallel and you slam into 429 rate-limit errors. This planner finds the number of parallel workers that hits your target throughput while staying inside both ceilings, and shows what it costs per minute.

How it works

The maximum sustainable throughput allowed by your rate limit is:

max_requests_per_min = TPM_budget / avg_tokens_per_request

To actually sustain your target requests per minute, you need enough concurrent workers to cover request latency:

workers_needed = ceil(target_rpm × avg_latency_seconds / 60)

If your target exceeds max_requests_per_min, you are rate-limited — no amount of concurrency helps and you must raise your TPM tier. Otherwise you are latency-limited, and the planner reports the worker count and queue depth needed to keep the pipeline full.

Tips for high-throughput pipelines

  • Batch where the API supports it. A single batched call carries more tokens per HTTP round-trip, raising effective throughput under the same TPM ceiling.
  • Add jittered backoff. When you do hit 429s, exponential backoff with jitter prevents a thundering-herd retry storm that wastes TPM.
  • Right-size the worker pool. Over-provisioning workers does nothing once you are TPM-bound — it just adds idle connections.
  • Watch the bill. Higher throughput means proportionally higher cost; the cost-per-minute figure here keeps the budget conversation honest.
Ad placeholder (rectangle)