Tokens per minute is the throughput cap most LLM providers enforce. If the tokens you send in a rolling minute exceed it, requests get 429 rate-limit errors and must be retried with backoff.

How is drain time calculated?

The planner compares the burst's token demand against your TPM capacity. If demand exceeds capacity, the backlog drains at the rate of spare capacity per second, and the tool reports how many seconds that takes.

Why does retrying cost more?

Failed requests that get retried still consume some overhead, and aggressive backoff can mean re-sending prompt tokens. The retry cost multiplier lets you model that extra spend versus a clean queued approach.

Does this replace a real load test?

No. It is a fast planning estimate to size queues and limits before you build. Always validate against your provider's actual behaviour and your real token distribution.

What is the Rate Limit Burst Capacity Planner?

Enter a traffic burst pattern and your tokens-per-minute (TPM) limit, and this planner calculates queue depth, drain time and the retry cost needed to absorb spikes without hitting rate-limit errors. Built for engineers sizing LLM throughput. It runs free in your browser on Gera Tools, with nothing uploaded.

Rate Limit Burst Capacity Planner

Name: Rate Limit Burst Capacity Planner
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Get one useful tool a week

Like this tool? Enter your email and we'll send you one genuinely useful Gera tool a week — plus a link to come back to this one. No spam, one-click unsubscribe any time.

Rate limit burst capacity planner

Traffic rarely arrives evenly. A scheduled job, a viral moment, or a batch import can dump thousands of requests into a few seconds — and if that exceeds your provider’s tokens-per-minute (TPM) limit, you get a wall of 429 errors. This planner models the burst, tells you whether it fits, and sizes the queue depth and drain time you need to absorb it gracefully.

How it works

The planner converts your burst into a token demand and compares it to your TPM capacity expressed per second:

burst_tokens   = requests × avg_tokens_per_request
capacity/sec   = TPM / 60
demand/sec     = burst_tokens / burst_duration

If demand per second exceeds capacity per second, a backlog forms. The peak queue depth is the overflow, and it drains at the rate of spare capacity once the burst ends. The tool also applies your retry cost multiplier to estimate the extra spend a naive retry-on-429 strategy would incur versus smooth queuing.

Worked example

Say you run a nightly summarization job that fires 200 requests in 10 seconds, each using an average of 800 tokens (600 prompt + 200 completion). That is a burst demand of 200 × 800 = 160,000 tokens in 10 seconds — 16,000 tokens/sec. If your TPM tier allows 60,000 TPM, you have 1,000 tokens/sec capacity. The burst exceeds capacity by 15,000 tokens/sec, so a backlog forms almost immediately and takes roughly 160,000 / 1,000 = 160 seconds to drain. Every queued request that had to be retried also adds cost overhead via the retry multiplier.

The fix here is not more retries — it is smoothing the job so it spreads 200 requests over 160+ seconds, or upgrading to a higher TPM tier, or splitting across two API keys.

When to use this planner

Use it before writing queuing code, not after hitting 429 walls in production. It is especially useful when:

Moving from a small pilot to batch production workloads
Deciding whether to buy a higher tier or just add a queue
Comparing the cost of a smooth queue versus naive retry-on-backoff
Sizing a BullMQ, Celery, or similar worker concurrency setting

Tips and notes

Size your queue for the peak burst, not the average — averages hide the spikes that actually trip the limit.
Prefer a token-bucket queue with jittered exponential backoff over blind retries; it keeps you just under the limit and avoids retry storms.
If drain time is unacceptably long, the real fix is a higher TPM tier or splitting traffic across multiple keys or providers, not more retries.
LLM APIs count both prompt and completion tokens against TPM. Factor in the completion length when estimating average tokens per request, especially for verbose tasks like translation or summarization.
Some providers offer separate RPM (requests per minute) limits alongside TPM; whichever cap is hit first determines whether you queue on token volume or request count.