Why AI apps need their own monitoring
A normal web service fails loudly: it throws an exception, returns a 500, or
times out, and your dashboards light up. An AI feature fails quietly. The model
returns a confident, well-formatted answer with a 200 OK — and that answer is
wrong, off-brand, or hallucinated. Standard APM tooling sees a healthy request.
Your users see a broken product. Monitoring AI in production therefore means
watching three layers at once: the system (latency, errors, cost), the
output (quality, format, safety), and the drift (how all of those move
over time as models and inputs change).
What to log on every call
Observability starts with structured logs. For each LLM call, capture the model and version, the full prompt and completion, input and output token counts, latency, the computed cost, the user or tenant ID, and a request ID that links to your wider distributed trace. Redact or hash PII before it lands in storage. These fields are the raw material for everything else: you debug a bad answer by reading its exact prompt, you attribute spend by summing cost per tenant, and you compute quality metrics by replaying logged completions through evaluators.
Log metadata for 100 percent of calls — it is cheap and you need it for billing and alerting. Sample the full prompt and completion bodies on high-volume paths to control storage, but always capture them for errored or flagged requests so you can reconstruct any incident.
Tracking quality and drift
Quality is the hard part because “good answer” is fuzzy. Make it measurable with proxies you can chart: JSON-parse failure rate for structured outputs, refusal rate, average response length, citation or tool-call rate, and a user feedback signal like thumbs up/down. For deeper signal, run an LLM-as-judge evaluator on a sample of production traffic that scores answers against a rubric, and maintain a small golden set of inputs you replay after every prompt or model change. Drift is simply these metrics moving — often after a silent provider model update — so baseline them at launch and watch the trend, not just the instant value.
Alerting without the noise
Keep the page-worthy alert set small and trustworthy. Page on hard failures (provider 5xx, timeouts, empty completions), on cost anomalies (spend-per-hour over a budget threshold, which usually means a runaway loop or prompt-injection abuse), on latency p95 regressions, and on quality-proxy breaches such as a sudden spike in JSON-parse failures or refusals. Route everything softer — slow drift, minor feedback dips — to a daily digest a human reviews. The fastest way to get a team to ignore AI monitoring is to flood them with alerts that do not map to real user harm, so tie every page to a concrete failure a customer would notice.