The dataset you train on determines almost everything about how a model behaves — and most of its legal and ethical risk is set the moment the data is collected. The AI Training Data Audit Checklist walks you through the controls that responsible AI teams apply to a dataset before it is ever fed into training.
How it works
Tell the tool the dataset type, how it was collected, and the intended use case. It then presents a structured checklist grouped into seven sections: consent and legal basis, representation and bias, quality and freshness, security and poisoning risk, PII and data minimization, opt-out and data-subject rights, and documentation and governance.
Each item is a concrete, verifiable control — “robots.txt opt-out signals were honoured during collection,” “demographic composition is measured and recorded,” “a tamper-evident hash of the final dataset is stored.” As you tick items off, a running coverage score shows how much of the audit is complete. The tool also surfaces context-specific warnings, for example flagging biometric and likeness concerns for image datasets or source-terms and poisoning risk for web-scraped data.
Tips and notes
Treat every unchecked item as an open risk with a named owner, not a box to be quietly skipped. The most commonly neglected controls are training-specific consent (consent for a service rarely covers model training), opt-out handling after collection, and a maintained datasheet that records each source’s licence and collection date.
This is general guidance rather than legal advice, and obligations differ sharply by jurisdiction and data category — pair the audit with a data protection impact assessment and a review by someone who knows your regulatory environment. Your selections stay in the browser and are never uploaded, so the checklist is safe to use against real, sensitive datasets.