How to Test AI Applications

Unit tests, integration tests, and evals for LLM-powered apps

Ad placeholder (leaderboard)

Why AI apps need a different testing strategy

A traditional app is deterministic: given an input, you assert the exact output. An LLM-powered app is not — the same prompt can return different wording every time, so a naive exact-match assertion fails for reasons that have nothing to do with a bug. The fix is not to abandon testing but to split it into three layers, each suited to a different part of the system: ordinary unit tests for the deterministic code around the model, mocked integration tests for the API boundary, and a separate eval harness for the quality of the model’s actual output. Conflating these is what leads people to believe AI apps cannot be tested.

Unit tests for the deterministic parts

Most of an AI app is ordinary, deterministic code: prompt templating, token counting, parsing the model’s structured output, chunking documents, retry logic, and post-processing. Test all of it the normal way. A prompt builder that injects a user’s name and a schema should produce an exact, asserted string. A JSON parser should reject malformed output and accept valid output. A chunker should split a known input into the expected pieces. These tests are fast, free, and run on every commit — and they catch the majority of real bugs, because the model is rarely the thing that breaks; the plumbing around it is.

Integration tests with mocked providers

Where your code calls OpenAI or Anthropic, mock the provider client so tests assert how you build the request and handle the response without spending money or touching the network. Capture a handful of real responses once and replay them as fixtures. The high-value cases here are the unhappy paths: a rate-limit error, a timeout, a refusal, malformed JSON, or a truncated response. Verify your code backs off, retries sensibly, surfaces a clean error, and never crashes the request. These tests prove your integration is robust against the messy reality of model APIs while staying fast enough to run in the main suite.

Evals and CI/CD

Output quality needs evals, not tests. An eval harness runs your prompt over a fixed dataset of inputs with known-good expectations and scores the results — exact accuracy where outputs are constrained, format compliance for structured tasks, or an LLM-as-judge rubric for open-ended ones. Because evals make real, paid, non-deterministic model calls, you run the fast deterministic suite on every commit and schedule evals on pull requests to main or nightly, gating merges on the score not regressing. Pin temperature to zero where you want stability, assert on properties rather than exact strings, and never retry-until-green — that just hides regressions. Treated this way, your evals become a true regression suite for quality: when someone tweaks a prompt or swaps a model, the harness tells you in CI whether the change helped or hurt before it reaches users.

Ad placeholder (rectangle)