A thin, production-safe layer over the model
Most real AI features are not a fresh model — they are a small, well-behaved API in front of someone else’s model. FastAPI is the natural choice in Python: it is async-first, validates requests with Pydantic out of the box, generates OpenAPI docs automatically, and runs anywhere a container runs. The goal of this tutorial is a wrapper you would actually ship: it keeps your API key server-side, validates input, streams output, and deploys to a real host in about thirty minutes. Use the generator below to produce the starter code for your chosen provider and streaming preference, then read how each part works.
How it works
The generator builds three things. First, a Pydantic ChatRequest model with a length-bounded prompt field, so malformed or abusive requests are rejected with a clean 422 before any token is spent. Second, an async def POST handler that calls the provider’s async client — either a single completion or, if you enable streaming, a StreamingResponse wrapping an async generator that yields chunks as the model produces them. Third, a /health route so your platform’s probes can tell the service is alive. The accompanying Dockerfile uses a slim Python base and runs the app with uvicorn, binding to the port your host injects. Everything reads the API key from the environment, never from code.
The streaming path is where async earns its keep. A synchronous handler would block a worker for the full duration of a multi-second generation; the async generator instead yields text as it arrives, letting one process serve many concurrent streams while users watch the answer appear.
Tips for a real deployment
Treat the wrapper as a boundary, not a passthrough. Add authentication so only your clients can call it, and rate limiting so one user cannot drain your budget. Log every request with a request ID, the model used, and token counts — you cannot control cost you do not measure. Set sensible timeouts and handle the provider returning errors or rate-limit responses, because they will, retrying with backoff only where it is safe. Keep max_tokens and the prompt length bounded so a single call has a known worst-case cost. Finally, pin your dependencies and run the slim Docker image both locally and in CI so the artifact you test is the artifact you ship. With those guardrails in place, a hundred lines of FastAPI is genuinely enough to run an LLM feature in production.