What a rate limit actually is
A rate limit is a cap on how much work you can ask an AI API to do in a given window of time. Providers impose them to protect shared infrastructure, ensure fair access, and prevent a single account from monopolising capacity. The key insight for developers is that AI APIs usually enforce several limits at once, and any one of them can throttle you. The two you will meet most often are requests per minute (RPM) and tokens per minute (TPM), but daily request caps, daily token caps, and concurrent-request limits also exist. Hitting any ceiling returns an HTTP 429 “Too Many Requests” response.
TPM vs RPM: the two limits that matter most
RPM counts how many separate API calls you make in a rolling minute. TPM
counts the total tokens — input and output combined — processed in that minute.
These measure different things and you can hit either independently. A workload
of many tiny prompts (say, classifying short strings) tends to exhaust RPM
first, because each call is cheap in tokens but still counts as one request. A
workload of a few enormous prompts (summarising long documents) tends to exhaust
TPM first, because each call burns thousands of tokens. Because output tokens
count too, requesting a large max_tokens reserves against your TPM budget even
before the model has written anything, which can throttle you unexpectedly.
How limits differ across providers and tiers
OpenAI, Anthropic, and Google all use usage tiers: new accounts start with low limits, and those limits rise as your account matures and your cumulative spend grows. OpenAI advances accounts through numbered tiers tied to how much you have paid; Anthropic and Google operate similar graduated systems. Limits are also per model — a flagship model usually has tighter limits than a smaller, faster one — and may differ by region or product (consumer vs enterprise). Two practical consequences follow: a brand-new key will throttle far sooner than you expect, and you should always check the specific model’s limits rather than assuming one number applies everywhere.
Handling 429s gracefully
A 429 is a signal to slow down, not a crash. The standard pattern is exponential
backoff with jitter: on a 429, wait a short interval, then retry, doubling the
wait on each subsequent failure and adding a small random offset so many clients
do not retry in lockstep. Respect the Retry-After header when the provider sends
one — it tells you exactly how long to wait. Cap the number of retries so a genuine
outage does not loop forever, and never retry instantly, which only deepens the
overload. Treating 429s as a normal, expected part of the control flow is what
separates a resilient integration from a fragile one.
Strategies to stay within your limits
The best way to handle rate limits is to rarely hit them. Client-side queuing is the most effective tool: track your per-minute token and request budget and release calls at a rate that stays just under the ceiling, smoothing bursts instead of firing everything at once. Read the rate-limit headers the API returns (remaining requests and tokens, reset time) to drive that throttle dynamically. Batch small tasks into fewer larger calls where the API supports it, and use smaller or cheaper models for high-volume work to ease TPM pressure. For genuinely large or asynchronous jobs, prefer the provider’s batch endpoints, which offer far higher throughput at lower cost in exchange for delayed results. Combined, these let you scale comfortably without living at the edge of a 429.