What you are building
This tutorial builds a production-shaped AI backend in Python. You create a Flask app with a POST endpoint that takes a prompt, calls OpenAI, and returns the completion. You add a streaming response so clients see tokens as they arrive, rate limiting so one caller cannot drain your budget or quota, clean environment-based config for the API key, and a deploy to Railway or Render straight from Git. By the end you have a small, secure, deployable AI API — not a notebook demo.
How it works
A Flask route reads the prompt from the request JSON and calls the OpenAI client,
which it constructs with a key pulled from an environment variable via
os.environ. For streaming, you set stream=True and return a Flask Response
that wraps a generator yielding each chunk with the text/event-stream mimetype,
so the client renders output incrementally. A rate limiter — keyed by API key or IP
— rejects callers who exceed a per-minute budget with a 429, protecting your spend
and your provider quota. In production you run the app under gunicorn with
several workers, because the built-in dev server is single-threaded and cannot
serve concurrent streaming requests. Railway or Render builds from your repo,
injects the key as a secret, and runs your gunicorn start command.
Tips and the planner below
Never hardcode the key — load it from the environment and let the platform’s secret manager supply it in production. Always deploy behind gunicorn, never the dev server, so concurrent streaming requests do not block. Rate limit by key, not just IP, so a shared-IP office and a per-customer key each get fair, enforceable limits. Return 429 cleanly so clients can back off. The planner below estimates your daily request ceiling and monthly cost from your per-key rate limit, average tokens per request, and model price, helping you size your rate limit and budget together rather than discovering the mismatch in a billing alert.