From localhost to live
Shipping an AI app is mostly the same as shipping any web app, with three extra concerns: secrets you cannot afford to leak, costs that scale with usage, and external model calls that can be slow or fail. This guide covers how to handle each so your launch is reliable, affordable, and secure. The patterns apply whether you deploy a frontend on Vercel and a backend on Railway, or run everything on one platform.
Secrets and environment configuration
Your model API key is the single most sensitive value in the app. Keep it in a
server-side environment variable on your host — Vercel project settings, Railway
variables, or your secrets manager — and never in client-side code, a public
bundle, or a committed .env. The browser should call your own backend route,
which holds the key and proxies the request to the provider. Separate keys per
environment (development, preview, production) so a leaked dev key never touches
production, and rotate keys on a schedule. A classic mistake is baking a
localhost URL or a key into the production bundle through a misnamed .env
file, so audit what actually ships to the client before launch.
Cost guardrails and rate limiting
AI calls cost money per token, so unbounded usage is a financial risk. Put rate
limiting on every public endpoint — per IP and per authenticated user — to stop
abuse and accidental loops. Cap max_tokens on every request so a single call
cannot generate an enormous, expensive response. Set a hard monthly spend limit
in the provider dashboard as a backstop. Log token usage per request and wire an
alert when daily spend exceeds a threshold, so you find a runaway before the bill
does. Prompt caching for large stable prefixes can cut both cost and latency
dramatically on repeat calls.
Reliability, logging, and zero-downtime deploys
External model calls fail sometimes. Wrap them in retries with exponential backoff, set request timeouts, and degrade gracefully with a clear message rather than a hung spinner. For critical features, configure a fallback model or provider. Move any work longer than a second or two behind a background queue so web requests stay fast and failures can be retried. Emit structured logs with a request id, the model, token counts, and latency — never raw user content or keys — so you can debug and attribute cost. Finally, lean on your platform’s zero-downtime deploys: it builds the new version, health-checks it, and switches traffic atomically, keeping in-flight streamed responses alive. Roll forward with small, frequent deploys and keep the previous version one click away for instant rollback.