What is the difference between RPM and TPM?

RPM is the maximum number of requests per minute you may send; TPM is the maximum number of tokens per minute you may process. Providers enforce both at once, so your real throughput is limited by whichever ceiling you hit first. A high RPM is useless if your large prompts saturate the TPM budget.

Why does the calculator apply a safety margin?

Token counts vary per request and clocks drift, so running at exactly 100% of your limit guarantees occasional 429 errors. The default 90% safety margin leaves headroom for variance and bursty traffic while keeping throughput high. Lower it if your workload is very predictable.

How does request size change my effective rate?

Your effective rate is the lower of the RPM ceiling and TPM divided by tokens-per-request. Small requests are RPM-bound, large requests are TPM-bound. The calculator shows which constraint binds so you know whether to shrink prompts or request a higher RPM.

What concurrency level should I use?

Concurrency is roughly your safe RPM multiplied by the average request latency in minutes. The tool estimates a safe number of parallel in-flight requests from your sustainable rate so you saturate the limit without overshooting it. Pair it with exponential backoff for safety.

Does staying under the limit guarantee no 429s?

It makes them rare, not impossible. Token estimates can be off, other clients may share the quota, and providers sometimes apply burst limits. Always keep exponential backoff and Retry-After handling in your client even when your average rate is well under the ceiling.

What is the LLM Rate Limit Calculator?

Enter your requests-per-minute and tokens-per-minute limits plus your typical request size, and get safe concurrency, batch size and sleep intervals to stay under both ceilings without 429 errors. Runs in your browser. It runs free in your browser on Gera Tools, with nothing uploaded.

LLM Rate Limit Calculator

Name: LLM Rate Limit Calculator
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Get one useful tool a week

Like this tool? Enter your email and we'll send you one genuinely useful Gera tool a week — plus a link to come back to this one. No spam, one-click unsubscribe any time.

Hitting 429 errors means you are sending requests faster than your tier allows. This calculator turns your RPM and TPM limits plus your typical request size into concrete, safe settings: a sustainable request rate, a concurrency level, and a minimum delay between calls.

How it works

Providers enforce two ceilings simultaneously — requests per minute and tokens per minute — and your real throughput is bounded by whichever you hit first:

RPM-bound rate = your RPM limit.
TPM-bound rate = TPM limit ÷ tokens per request.

The calculator takes the smaller of the two, applies a safety margin (default 90%) for token-count variance and clock drift, and reports the binding constraint plus a safe concurrency estimate and inter-request delay. All math runs locally in your browser.

Illustrative example

Suppose your plan has a limit of 500 RPM and 100,000 TPM, and your typical request uses 400 tokens total (prompt + completion):

RPM ceiling: 500 requests/minute
TPM ceiling: 100,000 ÷ 400 = 250 requests/minute

The TPM limit is the binding constraint — you are TPM-bound. With a 90% safety margin: 250 × 0.90 = 225 sustainable requests/minute.

This translates to a minimum delay of 60 ÷ 225 ≈ 0.27 seconds between requests when running sequentially. For concurrent requests, the safe concurrency count is roughly 225 × (average_latency_seconds ÷ 60) — so at 1 second average latency, about 3–4 concurrent in-flight requests.

Reading the binding constraint

Scenario	Binding limit	Fix
Small prompts, many requests	RPM	Batch multiple items per request
Large prompts or long contexts	TPM	Trim prompts, use retrieval, compress history
Balanced load	Either	Apply safety margin and monitor both

The calculator highlights which ceiling binds so you know where optimisation effort will have the most effect.

Safe concurrency and inter-request delay

Inter-request delay (the minimum sleep between sequential calls) is:

delay_seconds = 60 / safe_rpm

Safe concurrency (for async/parallel callers) is:

concurrency = safe_rpm × avg_latency_minutes

If your requests average 2 seconds and your safe rate is 180 RPM, safe concurrency is 180 × (2/60) = 6 concurrent requests. Exceed this and requests pile up faster than they complete, pushing you over the TPM or RPM cap.

Why the 90% safety margin

Token counts per request vary — a user asking a longer question sends more input tokens than expected. System clocks drift slightly. Other processes sharing the same API key add unpredictable bursts. The 10% headroom absorbs this variance. If your workload is very uniform (batch processing fixed templates), you can safely raise the margin to 95%. If requests vary widely in prompt length, lower it to 80%.

Keep backoff even when under the limit

Your average rate can be within limits while individual bursts still trip the rate limiter. Always implement exponential backoff on 429 responses and honour the Retry-After header if the provider returns one — this is your last line of defence when estimates are off or quotas are shared across processes.