AI API Rate Limits Explained: OpenAI, Anthropic, and Google

How token-per-minute and request-per-minute limits work

Ad placeholder (leaderboard)

What a rate limit actually is

A rate limit is a cap on how much work you can ask an AI API to do in a given window of time. Providers impose them to protect shared infrastructure, ensure fair access, and prevent a single account from monopolising capacity. The key insight for developers is that AI APIs usually enforce several limits at once, and any one of them can throttle you. The two you will meet most often are requests per minute (RPM) and tokens per minute (TPM), but daily request caps, daily token caps, and concurrent-request limits also exist. Hitting any ceiling returns an HTTP 429 “Too Many Requests” response.

TPM vs RPM: the two limits that matter most

RPM counts how many separate API calls you make in a rolling minute. TPM counts the total tokens — input and output combined — processed in that minute. These measure different things and you can hit either independently. A workload of many tiny prompts (say, classifying short strings) tends to exhaust RPM first, because each call is cheap in tokens but still counts as one request. A workload of a few enormous prompts (summarising long documents) tends to exhaust TPM first, because each call burns thousands of tokens. Because output tokens count too, requesting a large max_tokens reserves against your TPM budget even before the model has written anything, which can throttle you unexpectedly.

How limits differ across providers and tiers

OpenAI, Anthropic, and Google all use usage tiers: new accounts start with low limits, and those limits rise as your account matures and your cumulative spend grows. OpenAI advances accounts through numbered tiers tied to how much you have paid; Anthropic and Google operate similar graduated systems. Limits are also per model — a flagship model usually has tighter limits than a smaller, faster one — and may differ by region or product (consumer vs enterprise). Two practical consequences follow: a brand-new key will throttle far sooner than you expect, and you should always check the specific model’s limits rather than assuming one number applies everywhere.

Handling 429s gracefully

A 429 is a signal to slow down, not a crash. The standard pattern is exponential backoff with jitter: on a 429, wait a short interval, then retry, doubling the wait on each subsequent failure and adding a small random offset so many clients do not retry in lockstep. Respect the Retry-After header when the provider sends one — it tells you exactly how long to wait. Cap the number of retries so a genuine outage does not loop forever, and never retry instantly, which only deepens the overload. Treating 429s as a normal, expected part of the control flow is what separates a resilient integration from a fragile one.

Strategies to stay within your limits

The best way to handle rate limits is to rarely hit them. Client-side queuing is the most effective tool: track your per-minute token and request budget and release calls at a rate that stays just under the ceiling, smoothing bursts instead of firing everything at once. Read the rate-limit headers the API returns (remaining requests and tokens, reset time) to drive that throttle dynamically. Batch small tasks into fewer larger calls where the API supports it, and use smaller or cheaper models for high-volume work to ease TPM pressure. For genuinely large or asynchronous jobs, prefer the provider’s batch endpoints, which offer far higher throughput at lower cost in exchange for delayed results. Combined, these let you scale comfortably without living at the edge of a 429.

Ad placeholder (rectangle)