Plan parallel LLM calls without hitting rate limits
When you call an LLM API at scale, two ceilings fight each other: the provider’s tokens-per-minute (TPM) limit and per-request latency. Fire requests serially and latency starves your throughput; fire too many in parallel and you slam into 429 rate-limit errors. This planner finds the number of parallel workers that hits your target throughput while staying inside both ceilings, and shows what it costs per minute.
How it works
The maximum sustainable throughput allowed by your rate limit is:
max_requests_per_min = TPM_budget / avg_tokens_per_request
To actually sustain your target requests per minute, you need enough concurrent workers to cover request latency:
workers_needed = ceil(target_rpm × avg_latency_seconds / 60)
If your target exceeds max_requests_per_min, you are rate-limited — no
amount of concurrency helps and you must raise your TPM tier. Otherwise you are
latency-limited, and the planner reports the worker count and queue depth
needed to keep the pipeline full.
Tips for high-throughput pipelines
- Batch where the API supports it. A single batched call carries more tokens per HTTP round-trip, raising effective throughput under the same TPM ceiling.
- Add jittered backoff. When you do hit 429s, exponential backoff with jitter prevents a thundering-herd retry storm that wastes TPM.
- Right-size the worker pool. Over-provisioning workers does nothing once you are TPM-bound — it just adds idle connections.
- Watch the bill. Higher throughput means proportionally higher cost; the cost-per-minute figure here keeps the budget conversation honest.