Two ways to run a model
Once you have an AI model behind an API, you must decide how you call it. There are two fundamentally different modes. Real-time (synchronous) inference sends a single request and blocks until the model replies, usually within a second or two. Batch (asynchronous) inference bundles many requests into one job, submits it, and lets the provider process them over a longer window — often up to 24 hours — before you collect the results. The model itself is the same; what changes is the scheduling, the latency you accept, and the price you pay.
Real-time inference: optimised for latency
Real-time inference is what most people picture when they think of an AI app. A user types a message, your server makes one API call, and the answer streams back token by token. The whole point is responsiveness, so providers dedicate live capacity to your request and you pay a premium for it. Use this mode whenever something is waiting on the answer — a chatbot, code autocomplete, live search, a voice assistant, or an agent that must act on the result before continuing. The cost is per-request and the latency is low, but you cannot lean on it for huge volumes without your bill and your rate limits becoming a problem.
Batch inference: optimised for cost and throughput
Batch inference flips the priorities. You assemble thousands or millions of requests into a single file, submit it, and the provider runs them on spare capacity whenever it is available. Because you have relaxed the latency requirement, both OpenAI’s Batch API and Anthropic’s Message Batches charge roughly half the price of their synchronous endpoints, and they grant far higher throughput limits. The catch is turnaround: results may take minutes or the full multi-hour window. This is perfect for work no human is watching — embedding a document corpus for retrieval, summarising every record in a database, generating product descriptions overnight, classifying a backlog, or running an evaluation suite against a test set.
How to choose
Ask one question: is anything waiting on this answer? If yes — a person, a live UI, or a synchronous downstream step — use real-time. If no, and you can tolerate a delayed result, use batch and pocket the discount. Volume reinforces the decision: a handful of live requests is cheap either way, but a million records is dramatically cheaper and more reliable in batch, where you also dodge the rate-limit dance of firing requests one at a time.
Combine them in practice
Real systems rarely pick just one. The standard pattern is a real-time path for the interactive experience and a batch path for bulk back-office work. A support product answers tickets live but re-tags its entire history overnight in batch. A search product serves queries synchronously but embeds new documents in nightly batch jobs. By routing each workload to the mode that fits its latency tolerance, you keep the user experience fast while keeping the infrastructure bill low — the best of both modes without compromise.