Where Do AI Models Get Their Training Data?

What web data, books, and code go into training LLMs

Ad placeholder (leaderboard)

The raw material: web crawls

The single largest ingredient in most large language models is the open web, captured by crawlers. The best-known source is Common Crawl, a free, public archive that has scraped billions of web pages over years. But no serious model trains on raw Common Crawl directly — it is full of spam, navigation menus, duplicate pages, and machine-generated junk. Builders run extensive filtering and deduplication first, often discarding the large majority of the raw bytes and keeping only the cleaner, more informative text. The web provides scale and breadth; the cleaning provides quality.

Curated high-quality sources

On top of the filtered web, builders layer sources known to be dense with reliable information and well-formed language. These typically include Wikipedia, large collections of books, news and reference material, and curated question-answer or instruction datasets. Such sources are often “upweighted” — shown to the model more often relative to their size — because a paragraph of an encyclopedia teaches more reliable knowledge than a paragraph of forum chatter. This curation is a major reason two models trained on similar amounts of data can differ sharply in factual accuracy and tone.

Code, math, and specialised corpora

Modern models are deliberately trained on large amounts of source code, scraped from public repositories like those on GitHub, along with mathematics, scientific papers, and structured data. This is not only to make the model good at programming. Training on code and math appears to improve general reasoning and the ability to follow precise, multi-step instructions, because those domains demand exact, logically structured output. A model’s coding and reasoning strength is often a direct reflection of how much high-quality code and math it saw during pretraining.

Synthetic and licensed data

Two newer ingredients are reshaping the mix. The first is synthetic data — text generated by other models to teach specific skills such as reasoning, instruction-following, or safe refusals — which is growing because high-quality human text is a finite resource. The second is licensed data, where builders pay publishers, forums, or media companies for the right to train on their content, partly to improve quality and partly to reduce copyright risk. Both trends signal a shift from “scrape everything” toward more deliberate, accountable data sourcing.

The tradeoffs that matter

Training data choices ripple through everything the model does. Skewed or low-quality data produces bias, factual errors, and gaps in coverage — a model knows little about topics poorly represented in its corpus. Copyright is an open legal question, prompting opt-outs, licensing deals, and lawsuits. And the looming constraint is exhaustion: the supply of fresh, high-quality human text is limited, which is why synthetic data and careful curation now matter as much as sheer scale. Understanding where a model’s knowledge comes from is the first step to understanding both its strengths and its blind spots.

Ad placeholder (rectangle)