Rate limit and throughput planner
Every LLM API enforces two ceilings at once: tokens per minute (TPM) and requests per minute (RPM). Hit either and you get 429s, retries, and a stalled queue. This planner finds which limit binds first for your workload and tells you the maximum sustainable throughput and how many requests you can keep in flight at once.
How the math works
Each request consumes prompt + completion tokens, so your TPM budget allows
TPM / tokens-per-request calls per minute. RPM caps requests directly. The smaller of
the two is the binding limit. To translate a safe arrival rate into worker concurrency,
the tool applies Little’s Law: the number of in-flight requests equals the arrival
rate times the average time each request spends in the system (your P95 latency). A
safety margin is subtracted so retries and token-count variance do not push you over the
hard ceiling.
Tips and notes
- Set a tight
max_tokens— providers reserve the full max against your TPM up front, so an oversized cap silently shrinks your throughput. - Add jittered exponential backoff on 429s; a queue that retries instantly turns one rate-limit hit into a self-inflicted storm.
- Re-plan after every tier promotion or limit raise — concurrency that was safe at one TPM ceiling will throttle or waste capacity at another.