What is paired testing for AI discrimination?

Paired testing sends the AI two prompts that are identical in every way except one protected attribute, such as a name that signals a different ethnicity. If the outputs differ systematically, the attribute is influencing the decision, which is the signature of discrimination.

What is the four-fifths rule?

The four-fifths (or 80%) rule is a US EEOC threshold. If the selection rate for any group is less than 80% of the rate for the highest group, that is treated as evidence of adverse impact. It is a screening signal, not proof, but it triggers a duty to investigate.

Does this prove my AI is discriminatory?

No. It generates the test cases and tells you the analysis method, but a single run is noisy. You must run each prompt many times, aggregate outcome rates per group, and treat any disparity as a prompt to investigate the model, prompt, or training data.

Which laws make this testing relevant?

In the EU, the AI Act classifies hiring and credit AI as high-risk and requires bias testing. In the US, Title VII, the ECOA, and NYC Local Law 144 all create exposure for biased automated decisions. Documented testing is part of a defensible compliance posture.

Does my prompt data leave my browser?

No. The builder assembles every test prompt locally in your browser and outputs plain text. Nothing is uploaded — you run the generated harness against your own model.

What is the AI Discrimination Test Builder?

Describe your AI use case and select protected characteristics to generate paired test prompts — identical except for one demographic attribute — for detecting disparate impact in AI-driven hiring, lending, pricing, and moderation decisions. It runs free in your browser on Gera Tools, with nothing uploaded.

AI Discrimination Test Builder

Name: AI Discrimination Test Builder
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

AI discrimination test builder

If an AI system makes or informs a decision about a person — who gets hired, what price they pay, whether their content is removed — you have a legal and ethical duty to check that it is not treating protected groups differently. The cleanest way to detect that is paired testing: send the model two requests that are identical in every respect except a single protected attribute, then see whether the outputs diverge. This tool builds those matched prompt sets for you.

How it works

You paste your real decision prompt and pick the protected characteristics you want to probe — gender, race, age, disability, religion, pregnancy, or name-based proxy signals. For each characteristic the builder produces a set of otherwise-identical variant prompts that change only that one attribute and explicitly hold all other qualifications constant. It then assembles a runnable test harness with the recommended method: run each prompt at least twenty times, record the outcome, compute the positive-outcome rate per group, and apply the four-fifths rule to flag adverse impact.

How to apply the four-fifths rule

After running the test harness, you will have an outcome count per group. The four-fifths (or 80%) rule says: if the selection rate for any group is less than 80% of the rate for the highest-scoring group, that gap is treated as evidence of adverse impact and triggers a duty to investigate.

For example: if the AI shortlists 60% of applications from Group A and 45% from Group B, Group B’s rate is 75% of Group A’s (45 ÷ 60 = 0.75). This is below the 0.80 threshold and would flag in a US EEOC analysis. The four-fifths rule is a screening signal, not legal proof — but it tells you where to look and provides documentation that you looked.

Why name proxies matter most

The most common source of real-world AI bias in hiring and credit decisions is not explicit attributes (models are often trained to ignore explicit race or gender fields) but proxy signals: names, postcodes, school names, and gaps in employment history. A model that never sees the word “race” can still learn that certain name patterns predict outcomes, and that learning reproduces exactly the same discrimination by a different route.

The builder produces test pairs that vary name patterns associated with different demographic groups while holding all other qualifications fixed. This is where the most diagnostic signal tends to appear.

Intersectional testing

Protected characteristics often interact. A candidate who is both female and over 55 may face discrimination that neither single-axis test would reveal. After completing single-characteristic tests, consider running a small intersectional set — for example, combining gender with age — to check that the combination does not produce a disparate outcome that the individual tests missed.

Tips and notes

Test name proxies, not just explicit attributes. Models often infer ethnicity or gender from a name even when you never state it — that is where real-world bias hides.
Volume matters. A single generation is noise. Aggregate over many runs at a fixed temperature so you are measuring the model, not luck.
A passing test is not a clean bill of health. Paired testing catches first-order disparities; it cannot catch intersectional or context-dependent bias. Treat it as one layer of a broader fairness audit.