How do paired test cases reveal bias?

Each set is identical except for one demographic attribute. If your AI produces systematically different outcomes — different scores, tone, or recommendations — that vary only with that attribute, it points to disparate treatment worth investigating.

Does passing these tests prove my AI is fair?

No. Passing means these specific cases did not surface a disparity. Real fairness evaluation also needs representative data, intersectional testing, statistical analysis across many cases, and domain expertise. Treat this as a fast first screen.

What is demographic parity?

Demographic parity is the idea that outcomes should not depend on protected attributes when those attributes are irrelevant to the task. These counterfactual prompts test for it by holding everything else constant.

Should I test intersections?

Yes, where it matters. Bias can appear at the intersection of attributes (for example age and gender together) even when each alone looks fine. Generate sets for combined attributes in addition to single ones.

Does my task description leave the browser?

No. The prompts are generated locally from templates and built-in attribute lists. Nothing you enter is uploaded or stored.

What is the Bias Test Case Generator?

Enter a task description and generate paired test prompts with systematically varied demographic attributes — gender, race, age, nationality — so you can evaluate your AI system for biased or disparate output patterns. It runs free in your browser on Gera Tools, with nothing uploaded.

Bias Test Case Generator

Name: Bias Test Case Generator
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Bias test case generator

If your AI screens applicants, drafts recommendations, or scores requests, you want to know whether it treats people differently based on attributes that should not matter. The bias test case generator produces paired prompts — identical except for one varied demographic attribute — so you can run them through your system and look for output that changes when only that attribute changes.

How it works

You describe the task and pick which protected attributes to vary: gender, race or ethnicity, age, or nationality. For each attribute the tool generates a matched set of prompts where the attribute cycles through representative values while everything else stays fixed. You run each set against your AI and compare results. A systematic difference that tracks only the varied attribute is a signal of disparate treatment to investigate. Generation is local; nothing is uploaded.

Tips and notes

Compare within a set. The signal is the difference across the matched prompts, not any single output.
Aggregate over many cases. One pair is anecdotal; run many and look for consistent patterns.
Test intersections. Bias can hide where two attributes combine even when each looks clean alone.
A clean screen is not proof. Pair this with statistical analysis, representative data, and domain expertise.

What makes a good bias test case

The strength of counterfactual fairness testing depends on how well the prompt pairs isolate the attribute being tested. The demographic detail should be the only meaningful difference between prompts in a matched set. If the prompts also differ in formality, word length, or other incidental features, any output difference could reflect those differences rather than the demographic attribute.

Concretely: if you are testing a candidate screening prompt, “Review this application from Maria Garcia” and “Review this application from James Smith” are a valid minimal pair — gender and ethnicity vary, everything else is identical. But if one prompt uses formal English and the other uses casual phrasing, any output difference is confounded.

Interpreting results

Not every output difference signals bias. Language models generate probabilistic text, so some variation between matched prompts is expected even without demographic bias. The signal to look for is systematic differences that track the attribute across many test cases. A model that consistently produces warmer language for one demographic group, or consistently rates one group’s applications higher, shows a pattern worth investigating — even if any individual pair could be explained by random variation.

Quantitative comparison helps: if you can extract a score, rating, word count, or tone classification from each output, you can compute summary statistics across the test set and look for a statistically meaningful gap rather than relying on qualitative impression.

Regulatory and compliance context

EU AI Act requirements for high-risk AI systems (which include AI used in employment, credit, and education decisions) explicitly call for bias and fairness testing as part of conformity assessment. The Act requires documented evidence that the system does not produce discriminatory outcomes across protected characteristics. Counterfactual test cases of this type are one recognized approach to generating that evidence, though they are not the only one and must be combined with representative real-world data analysis for a complete assessment.

Even outside formal regulatory contexts, documenting bias testing — what was tested, what was found, and what mitigations were applied — is becoming standard practice in responsible AI development. This tool helps generate the test inputs; the analysis and documentation are the developer’s responsibility.

Extending the basic approach

Beyond the attributes available in this tool, consider testing for bias in less obvious dimensions: socioeconomic signals embedded in word choice, regional accents described in text, disability status, religion, and political affiliation. Not all of these are legal protected characteristics in every jurisdiction, but they may still be relevant to the fairness of your specific application. The paired-prompt approach works for any attribute you can vary while holding everything else constant.