Why does collection method matter?

Web-scraped data carries higher source-terms and poisoning risk, user-generated data needs training-specific consent, and licensed data depends on the licence scope. The tool highlights the relevant cautions for what you select.

Is my progress saved?

No. The checklist state lives only in the page while it is open, and nothing is uploaded. Reloading resets it, so export or record your answers if you need a permanent audit record.

Does ticking every box make my dataset compliant?

No. The checklist drives a thorough review, but compliance depends on jurisdiction, the specific data, and how controls are actually implemented. Use it alongside a DPIA and legal review.

Why audit for data poisoning?

Untrusted training data can contain deliberately crafted samples that implant backdoors or degrade model behaviour. Tracking provenance and running integrity checks reduces that risk, especially for scraped data.

What is the AI Training Data Audit Checklist?

Work through a structured training data audit covering consent for training use, demographic representation, data freshness, poisoning risk, PII minimization, opt-out compliance, and documentation. Progress tracked in your browser. It runs free in your browser on Gera Tools, with nothing uploaded.

AI Training Data Audit Checklist

Name: AI Training Data Audit Checklist
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

The dataset you train on determines almost everything about how a model behaves — and most of its legal and ethical risk is set the moment the data is collected. The AI Training Data Audit Checklist walks you through the controls that responsible AI teams apply to a dataset before it is ever fed into training.

Why training data audits need to happen before training

The most expensive data problems are the ones discovered after a model is already deployed. A biased training set produces biased outputs at scale; a dataset that included data subjects who never consented creates ongoing legal exposure every time the model runs; a poisoned sample can embed a backdoor that survives through fine-tuning. None of these are fixable by adjusting the model after the fact — they require going back to the data.

An audit before training is not just a compliance exercise. It is the earliest and cheapest point to catch problems that would otherwise cost orders of magnitude more to remediate. This checklist is structured to surface those problems before they are baked in.

How it works

Tell the tool the dataset type, how it was collected, and the intended use case. It then presents a structured checklist grouped into seven sections: consent and legal basis, representation and bias, quality and freshness, security and poisoning risk, PII and data minimization, opt-out and data-subject rights, and documentation and governance.

Each item is a concrete, verifiable control — “robots.txt opt-out signals were honoured during collection,” “demographic composition is measured and recorded,” “a tamper-evident hash of the final dataset is stored.” As you tick items off, a running coverage score shows how much of the audit is complete. The tool also surfaces context-specific warnings, for example flagging biometric and likeness concerns for image datasets or source-terms and poisoning risk for web-scraped data.

The sections that get skipped most often

Training-specific consent is the most frequently missed. Consent given by users for a service — agreeing to terms of use, for example — typically does not extend to training AI models on that data. A separate, specific consent for training use is the correct approach, and it is what regulators and courts are increasingly scrutinising.

Opt-out compliance is the second most commonly skipped. Collecting and then using data from sources that signalled opt-out via robots.txt, API terms, or Creative Commons restrictions is a legal risk that grows with the size and profile of your model.

Data poisoning checks are often dismissed as theoretical, but untrusted web-scraped data is a realistic attack surface. Checking provenance and maintaining tamper-evident hashes is the minimum practical defence.

The datasheet — a written record of what data is in the dataset, where it came from, why it was chosen, and what known limitations it has — is the single most useful governance artefact and the one most often nonexistent until an audit or regulatory request forces the question.

Tips and notes

Treat every unchecked item as an open risk with a named owner, not a box to be quietly skipped. The most commonly neglected controls are training-specific consent (consent for a service rarely covers model training), opt-out handling after collection, and a maintained datasheet that records each source’s licence and collection date.

This is general guidance rather than legal advice, and obligations differ sharply by jurisdiction and data category — pair the audit with a data protection impact assessment and a review by someone who knows your regulatory environment. Your selections stay in the browser and are never uploaded, so the checklist is safe to use against real, sensitive datasets.