How do I stream tokens from Flask?

Set stream=True on the OpenAI call and return a Flask Response wrapping a generator that yields each chunk as it arrives, with mimetype text/event-stream for Server-Sent Events. The client reads the stream and renders tokens incrementally. Because Flask's dev server is single-threaded, use gunicorn with multiple workers in production so streaming requests do not block each other.

Why do I need rate limiting on an AI endpoint?

Every request spends real money and quota, so an unthrottled endpoint is an open door to a surprise bill or a denial-of-wallet attack. Limit by API key or IP — for example a fixed number of requests per minute — and return 429 when exceeded. A library like Flask-Limiter wires this in with a decorator. Rate limiting protects both your budget and your provider quota from a single abusive caller.

Where does the API key live?

In an environment variable, loaded from a .env file in development and from the platform's secret manager in production. Never commit it to Git or hardcode it in source. The Flask process reads it via os.environ at startup, and Railway or Render injects it as an environment variable so the deployed app reads the same way without code changes.

Should I use gunicorn or the Flask dev server in production?

Always gunicorn (or another WSGI server) in production. The built-in Flask server is single-threaded and meant only for development, so it cannot handle concurrent streaming requests or real traffic. Run gunicorn with a few workers, point your platform's start command at it, and keep the dev server for local debugging only.

How much traffic can my plan handle?

That depends on your rate limit, your provider's tokens-per-minute quota, and your budget. A tight rate limit protects you but caps throughput; a loose one risks cost spikes. The planner below estimates your daily request ceiling and monthly cost from your per-key rate limit, average tokens per request, and model price so you can size both together.

How to Build an AI App with Python and Flask

What you are building

This tutorial builds a production-shaped AI backend in Python. You create a Flask app with a POST endpoint that takes a prompt, calls OpenAI, and returns the completion. You add a streaming response so clients see tokens as they arrive, rate limiting so one caller cannot drain your budget or quota, clean environment-based config for the API key, and a deploy to Railway or Render straight from Git. By the end you have a small, secure, deployable AI API — not a notebook demo.

How it works

A Flask route reads the prompt from the request JSON and calls the OpenAI client, which it constructs with a key pulled from an environment variable via os.environ. For streaming, you set stream=True and return a Flask Response that wraps a generator yielding each chunk with the text/event-stream mimetype, so the client renders output incrementally. A rate limiter — keyed by API key or IP — rejects callers who exceed a per-minute budget with a 429, protecting your spend and your provider quota. In production you run the app under gunicorn with several workers, because the built-in dev server is single-threaded and cannot serve concurrent streaming requests. Railway or Render builds from your repo, injects the key as a secret, and runs your gunicorn start command.

Tips and the planner below

Never hardcode the key — load it from the environment and let the platform’s secret manager supply it in production. Always deploy behind gunicorn, never the dev server, so concurrent streaming requests do not block. Rate limit by key, not just IP, so a shared-IP office and a per-customer key each get fair, enforceable limits. Return 429 cleanly so clients can back off. The planner below estimates your daily request ceiling and monthly cost from your per-key rate limit, average tokens per request, and model price, helping you size your rate limit and budget together rather than discovering the mismatch in a billing alert.