How to Deploy an AI App to Production

From localhost to live — Vercel, Railway, and secrets

Ad placeholder (leaderboard)

From localhost to live

Shipping an AI app is mostly the same as shipping any web app, with three extra concerns: secrets you cannot afford to leak, costs that scale with usage, and external model calls that can be slow or fail. This guide covers how to handle each so your launch is reliable, affordable, and secure. The patterns apply whether you deploy a frontend on Vercel and a backend on Railway, or run everything on one platform.

Secrets and environment configuration

Your model API key is the single most sensitive value in the app. Keep it in a server-side environment variable on your host — Vercel project settings, Railway variables, or your secrets manager — and never in client-side code, a public bundle, or a committed .env. The browser should call your own backend route, which holds the key and proxies the request to the provider. Separate keys per environment (development, preview, production) so a leaked dev key never touches production, and rotate keys on a schedule. A classic mistake is baking a localhost URL or a key into the production bundle through a misnamed .env file, so audit what actually ships to the client before launch.

Cost guardrails and rate limiting

AI calls cost money per token, so unbounded usage is a financial risk. Put rate limiting on every public endpoint — per IP and per authenticated user — to stop abuse and accidental loops. Cap max_tokens on every request so a single call cannot generate an enormous, expensive response. Set a hard monthly spend limit in the provider dashboard as a backstop. Log token usage per request and wire an alert when daily spend exceeds a threshold, so you find a runaway before the bill does. Prompt caching for large stable prefixes can cut both cost and latency dramatically on repeat calls.

Reliability, logging, and zero-downtime deploys

External model calls fail sometimes. Wrap them in retries with exponential backoff, set request timeouts, and degrade gracefully with a clear message rather than a hung spinner. For critical features, configure a fallback model or provider. Move any work longer than a second or two behind a background queue so web requests stay fast and failures can be retried. Emit structured logs with a request id, the model, token counts, and latency — never raw user content or keys — so you can debug and attribute cost. Finally, lean on your platform’s zero-downtime deploys: it builds the new version, health-checks it, and switches traffic atomically, keeping in-flight streamed responses alive. Roll forward with small, frequent deploys and keep the previous version one click away for instant rollback.

Ad placeholder (rectangle)