Rate Limit & Throughput Planner

Plan concurrency to stay within TPM and RPM limits

Ad placeholder (leaderboard)

Rate limit and throughput planner

Every LLM API enforces two ceilings at once: tokens per minute (TPM) and requests per minute (RPM). Hit either and you get 429s, retries, and a stalled queue. This planner finds which limit binds first for your workload and tells you the maximum sustainable throughput and how many requests you can keep in flight at once.

How the math works

Each request consumes prompt + completion tokens, so your TPM budget allows TPM / tokens-per-request calls per minute. RPM caps requests directly. The smaller of the two is the binding limit. To translate a safe arrival rate into worker concurrency, the tool applies Little’s Law: the number of in-flight requests equals the arrival rate times the average time each request spends in the system (your P95 latency). A safety margin is subtracted so retries and token-count variance do not push you over the hard ceiling.

Tips and notes

  • Set a tight max_tokens — providers reserve the full max against your TPM up front, so an oversized cap silently shrinks your throughput.
  • Add jittered exponential backoff on 429s; a queue that retries instantly turns one rate-limit hit into a self-inflicted storm.
  • Re-plan after every tier promotion or limit raise — concurrency that was safe at one TPM ceiling will throttle or waste capacity at another.
Ad placeholder (rectangle)