Question 1

Is everything on the internet used to train AI models?

Accepted Answer

No. Most models train on a filtered subset of web crawls like Common Crawl, not the entire internet. Builders remove spam, adult content, duplicate pages, and low-quality text, then often weight high-quality sources like Wikipedia and books more heavily. The raw web is the starting material, not the final dataset.

Question 2

Do AI models memorise their training data?

Accepted Answer

Mostly no, but partially yes. Models learn statistical patterns rather than storing text verbatim, which is why they can generalise. However, content that appears many times — famous quotes, popular code, repeated boilerplate — can be memorised and reproduced. Deduplication during data preparation reduces this, but it is a known and studied risk.

Question 3

Why is copyright such a contested issue for training data?

Accepted Answer

Because web crawls inevitably include copyrighted text, images, and code whose authors did not explicitly consent. Whether training on such material is fair use is an unsettled legal question being tested in courts worldwide. Some builders now license data, offer opt-outs, or restrict to permissively licensed sources to reduce legal exposure.

Question 4

What is synthetic data and why is it growing?

Accepted Answer

Synthetic data is text generated by an existing AI model (or simple programs) rather than scraped from humans. It is increasingly used because high-quality human text is finite and partly exhausted, and because synthetic examples can target specific skills like reasoning or coding. The risk is model collapse — quality degrading if models train too heavily on their own outputs.

Where Do AI Models Get Their Training Data?

The raw material: web crawls

Curated high-quality sources

Code, math, and specialised corpora

Synthetic and licensed data

The tradeoffs that matter