Tokens per minute (TPM) is the maximum number of tokens a provider will process for your account in a 60-second window. Exceed it and you get HTTP 429 rate-limit errors, so throughput is capped by TPM, not just by how many requests you fire.

How do I find the right parallel worker count?

Parallel workers ≈ target requests per minute × average latency in minutes. If each request takes 6 seconds (0.1 min) and you want 600/min, you need roughly 60 concurrent workers — assuming TPM allows it.

What does 'rate-limited vs latency-limited' mean?

If your TPM ceiling is hit before you reach target throughput, you are rate-limited and must request a higher tier. If TPM is fine but latency forces more workers than you can run, you are latency-limited and need more concurrency.

Is my data sent anywhere?

No. Everything is computed in your browser. Nothing you enter is uploaded, stored or logged.

What is the Parallel Requests Cost & Rate Limit Planner?

Given a tokens-per-minute rate limit and a target throughput, this planner calculates the optimal parallel request count, whether you are rate-limited or latency-limited, queue depth, and cost per minute for your chosen model. It runs free in your browser on Gera Tools, with nothing uploaded.

Parallel Requests Cost & Rate Limit Planner

Name: Parallel Requests Cost & Rate Limit Planner
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Plan parallel LLM calls without hitting rate limits

When you call an LLM API at scale, two ceilings fight each other: the provider’s tokens-per-minute (TPM) limit and per-request latency. Fire requests serially and latency starves your throughput; fire too many in parallel and you slam into 429 rate-limit errors. This planner finds the number of parallel workers that hits your target throughput while staying inside both ceilings, and shows what it costs per minute.

The two constraints: TPM and latency

TPM (tokens per minute) is the hard ceiling set by your API provider. Every token you send as input and every token returned as output counts toward it. The rate limit is usually published per model tier — higher-tier accounts get higher limits. If you hit it, you receive HTTP 429 responses and must wait for the limit to reset (usually in a rolling 60-second window).

Per-request latency is how long a single call takes, typically dominated by the model’s generation speed (time to first token plus token generation rate). Even if your TPM budget is large, a slow model forces you to hold concurrent connections open longer to sustain a given requests-per-minute rate.

The relationship between the two is what this planner resolves.

How it works

The maximum sustainable throughput allowed by your rate limit is:

max_requests_per_min = TPM_budget / avg_tokens_per_request

To actually sustain your target requests per minute, you need enough concurrent workers to cover request latency:

workers_needed = ceil(target_rpm × avg_latency_seconds / 60)

If your target exceeds max_requests_per_min, you are rate-limited — no amount of concurrency helps and you must raise your TPM tier or reduce token consumption per request. Otherwise you are latency-limited, and the planner reports the worker count and queue depth needed to keep the pipeline full.

Worked example

Suppose your TPM budget is 100,000 tokens/minute, each request uses an average of 500 tokens (input + output), and each call takes about 3 seconds:

max_rpm     = 100,000 / 500 = 200 requests/min
workers     = ceil(200 × 3 / 60) = ceil(10) = 10 workers

With 10 workers processing 200 RPM at the current cost per token, the cost-per-minute figure scales directly with volume. To push to 400 RPM you would need to double your TPM tier to 200,000 — adding more workers alone will only generate 429 errors.

Tips for high-throughput pipelines

Batch where the API supports it. A single batched call carries more tokens per HTTP round-trip, raising effective throughput under the same TPM ceiling.
Add jittered exponential backoff. When you do hit 429s, exponential backoff with jitter prevents a thundering-herd retry storm that wastes your remaining TPM budget in the same 60-second window.
Right-size the worker pool. Over-provisioning workers does nothing once you are TPM-bound — it just adds idle connections and memory overhead on the client side.
Shorten prompts. Cutting the average token count per request is the most direct lever for raising effective RPM under a fixed TPM budget. Removing verbose system prompt text and compressing few-shot examples often yields 20–30% savings without quality loss.
Watch the bill. Higher throughput means proportionally higher cost. The cost-per-minute and cost-per-hour figures here make the budget implication concrete before you commit to a tier upgrade.