What Is AI Bias? Types, Causes, and How to Detect It

Understanding and measuring bias in AI and LLMs

Ad placeholder (leaderboard)

What AI bias means

AI bias is a systematic and unfair distortion in a model’s behaviour — a consistent skew that advantages or disadvantages particular people, groups, or viewpoints. It is different from ordinary error, which is random; bias shows up as a repeatable pattern. Examples include a résumé screener that consistently downranks one demographic, a facial analysis system that performs worse on darker skin tones, or a language model that completes “the nurse said…” with stereotyped assumptions. Because models learn from data that reflects the real world — including its inequalities — bias is the default outcome unless it is actively measured and corrected.

The main types of bias

Data bias is the most common source. If the training data over-represents some groups and under-represents others, the model performs unevenly across them. Historical data also encodes past human decisions, so a model trained on it can learn and reproduce prior discrimination.

Algorithmic bias arises from modelling choices: the objective being optimised, how labels were assigned, feature selection, or thresholds that happen to affect groups differently. A technically “accurate” model can still be unfair if accuracy is unevenly distributed.

Societal and deployment bias comes from how a system is used in the world. A tool that is fair on average can still cause harm if it is applied in a context it was never validated for, or if its outputs reinforce a feedback loop — for instance, predictive systems that send more scrutiny to already over-scrutinised populations.

How bias is detected and measured

Detection blends statistics with human review. Quantitative fairness metrics compare outcomes or error rates across subgroups — for example, demographic parity or equalised odds — and require disaggregating results rather than reporting one global accuracy number. For language models, purpose-built benchmarks such as StereoSet and BBQ probe stereotypical associations directly. Beyond metrics, teams use red-teaming and audits, deliberately stress-testing the model with adversarial and edge-case inputs to surface harms that aggregate numbers miss.

How leading labs mitigate it

Mitigation happens at every stage. Before training, teams curate and rebalance data and document datasets so gaps are visible. During training, fairness constraints and careful objective design can reduce skew. After training, alignment techniques such as RLHF and constitutional methods steer models away from harmful or stereotyped outputs, and guardrails filter the worst cases at inference time. Crucially, mitigation involves trade-offs — improving fairness for one group can affect accuracy elsewhere — so the honest standard across the field is continuous measurement, transparency about limitations, and monitoring in production rather than a one-time fix.

Ad placeholder (rectangle)