AI training data license checker
“Open” does not mean “free to do anything with.” A dataset under CC BY-NC cannot lawfully train a model you sell; a CC BY-SA corpus may drag share-alike obligations into your outputs; a CC BY source just needs proper credit. This checker lets you list every training source, pick its license, and instantly see which sources are clear, which need care, and which block commercial use.
How it works
For each source you select a license from the common open data and content licenses — public domain dedications, permissive licenses, attribution-only, non-commercial, and share-alike variants. The tool combines that with your use case (commercial vs. research, redistributed vs. internal) and applies the standard obligations of each license. Non-commercial licenses flag as blocking for a monetised product; share-alike licenses flag for attention because their reach into model weights and outputs is contested; attribution licenses flag as clear-with-obligations. The result is a per-source rating plus a summary so you know your overall exposure at a glance.
Tips and notes
- The license name is the contract. Don’t assume “open” equals “anything goes” — read the specific terms behind each one.
- Treat NC as blocking for products. Training a monetised model on non-commercial data is widely considered a commercial use.
- Document required attributions now. Keep the list with your data records so credit obligations don’t get lost before launch.
- This is triage, not legal advice. AI-training license law is unsettled; get counsel on anything flagged before you ship.