What is the difference between TPM and RPM?

TPM caps total tokens (prompt plus completion) processed per minute; RPM caps the number of requests per minute. Whichever you hit first is your binding limit, which this tool identifies for you.

How is safe concurrency calculated?

Concurrency is the safe requests-per-minute divided by 60, multiplied by the average latency in seconds (Little's Law: in-flight requests = arrival rate x time in system). That tells you how many parallel workers you can keep busy without exceeding the rate.

Why leave headroom below the limit?

Bursts, retries, and token-count variance push you over a hard ceiling and trigger 429s. The planner applies a safety margin so a spike does not blow past the limit and cause cascading retries.

Does the completion token count really count against TPM?

Yes. Most providers count prompt plus completion tokens toward TPM, and they reserve max_tokens up front. Setting a tight max_tokens lets you fit more requests inside the same TPM ceiling.

What if my limits change per tier?

Re-enter the new TPM and RPM from your dashboard. Limits scale with usage tier, so re-plan whenever you are promoted to a higher tier or request a raise.

What is the Rate Limit & Throughput Planner?

Given your TPM and RPM limits, calculate the maximum sustainable throughput, average response time budget, and recommended request queue depth for your workload. Fully client-side. It runs free in your browser on Gera Tools, with nothing uploaded.

Rate Limit & Throughput Planner

Name: Rate Limit & Throughput Planner
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Get one useful tool a week

Like this tool? Enter your email and we'll send you one genuinely useful Gera tool a week — plus a link to come back to this one. No spam, one-click unsubscribe any time.

Rate limit and throughput planner

Every LLM API enforces two ceilings at once: tokens per minute (TPM) and requests per minute (RPM). Hit either and you get 429s, retries, and a stalled queue. This planner finds which limit binds first for your workload and tells you the maximum sustainable throughput and how many requests you can keep in flight at once.

How the math works

Each request consumes prompt + completion tokens, so your TPM budget allows TPM / tokens-per-request calls per minute. RPM caps requests directly. The smaller of the two is the binding limit. To translate a safe arrival rate into worker concurrency, the tool applies Little’s Law: the number of in-flight requests equals the arrival rate times the average time each request spends in the system (your P95 latency). A safety margin is subtracted so retries and token-count variance do not push you over the hard ceiling.

Worked example

Say your account allows 60,000 TPM and 600 RPM. Your average request uses 500 prompt tokens and 300 completion tokens, so 800 tokens per call. Your average latency is 3 seconds.

TPM path: 60,000 / 800 = 75 requests per minute.
RPM path: 600 requests per minute.
Binding limit: TPM at 75 RPM (well below the RPM cap).
Apply a safety margin (say 20%): safe throughput = 60 RPM.
Concurrency via Little’s Law: 60 / 60 × 3 = 3 parallel workers to keep the queue fed without bursting.

If you reduced average tokens to 400 per call, the TPM ceiling would rise to 150 RPM and you could safely run 6 workers. That is why prompt compression is worth measuring before adding parallel workers.

Common pitfalls

Oversized max_tokens — most providers reserve the full max_tokens value against your TPM budget up front, before the model generates a single token. A max_tokens of 4,096 on a response that typically uses 300 tokens wastes 3,796 tokens of your per-minute budget per request. Set max_tokens as tight as your use case allows.

Instant retry on 429 — a queue that retries immediately on a rate-limit error turns one overrun into a sustained flood that keeps hitting the ceiling. Add jittered exponential backoff: wait, then wait longer, with random jitter to prevent thundering-herd effects across parallel workers.

Planning only once — LLM provider tiers promote automatically or on request. A concurrency setting that was safe at Tier 1 will under-utilize at Tier 2. Re-run the planner after every tier change or limit raise.

Tips

Set a tight max_tokens — an oversized cap silently shrinks your effective throughput.
Add jittered exponential backoff on 429s.
Re-plan after every tier promotion or limit raise.