Question 1

Why can't I just unit-test an AI app the normal way?

Accepted Answer

Because the model is non-deterministic — the same prompt can yield different wording each time — so an exact-match assertion fails for the wrong reasons. You test the deterministic plumbing around the model with ordinary unit tests, mock the model in integration tests, and judge the model's actual output quality with a separate eval harness that tolerates variation. Mixing those three concerns into one assertion is what makes people think AI apps are untestable.

Question 2

What is an eval and how is it different from a test?

Accepted Answer

A test gives a pass or fail on deterministic behaviour; an eval scores the quality of a non-deterministic output against criteria. An eval harness runs your prompt over a fixed dataset of inputs with known-good expectations and measures things like accuracy, format compliance, or an LLM-as-judge score. Evals tell you whether a prompt or model change made outputs better or worse — they are your regression suite for quality.

Question 3

How do I test code that calls the OpenAI or Anthropic API?

Accepted Answer

Mock the provider client in integration tests so you assert on how your code builds the request and handles the response — including errors, timeouts, and rate limits — without spending money or depending on the network. Capture a few real responses once and replay them as fixtures. Reserve live API calls for a small, separately scheduled eval run, not your fast unit suite.

Question 4

Should evals run on every commit?

Accepted Answer

Run fast deterministic unit and mocked integration tests on every commit so the pipeline stays quick and cheap. Run the eval harness — which makes real model calls and costs money — on a smaller cadence, such as on pull requests to main or nightly, and gate merges on the score not regressing. This keeps CI fast while still catching quality regressions before they ship.

Question 5

How do I handle flaky tests caused by model randomness?

Accepted Answer

Set temperature to zero for tests where you want the most stable output, assert on structure and properties rather than exact strings, and use an eval score with a threshold rather than a single hard equality. For genuinely variable outputs, an LLM-as-judge or rubric check that allows acceptable variation beats brittle string matching. Never retry until green — that hides real regressions.

How to Test AI Applications

Why AI apps need a different testing strategy

Unit tests for the deterministic parts

Integration tests with mocked providers

Evals and CI/CD