Prompt testing harness
Tweaking a prompt and eyeballing one output is how prompts silently regress. A proper change should be checked against several representative inputs at once. This harness lets you write a template with a placeholder, supply a list of test inputs, and run them all through your own API key — then it grades each result against keywords you choose so you can see at a glance which inputs still pass.
How it works
Your template includes the token {{input}}. For each line in your test-input
list, the tool substitutes that line into the template and sends the result to
your chosen provider — OpenAI via api.openai.com/v1/chat/completions or
Anthropic via api.anthropic.com/v1/messages — directly from your browser using
your own key. Cases run one at a time to stay within rate limits.
If you supply pass keywords, each output is graded pass when it contains all of them (case-insensitive) and fail otherwise; the full output is always shown so you can judge for yourself. Loading and per-case errors are handled inline, and nothing leaves your browser except the provider request.
Tips and notes
Pick test inputs that span the edges, not just the happy path: an empty-ish case, a very long one, a tricky one that previously broke the prompt. Keep your pass keywords specific enough to catch real regressions but loose enough to survive harmless rewording — a single decisive token often works better than a whole phrase. When you change the prompt, rerun the whole set rather than the one case you were thinking about; that is precisely how you catch the regression you did not expect. Use a small, cheap model for routine iteration and reserve a larger one for a final confirmation pass.