Question 1

What exactly is model collapse?

Accepted Answer

Model collapse is the degradation that occurs when generative models are trained repeatedly on data produced by earlier generations of models rather than on original human data. Across generations the model forgets the rare, tail parts of the distribution and converges toward bland, less diverse, sometimes nonsensical output. It is a gradual loss of fidelity caused by learning from copies of copies.

Question 2

Why does training on AI output cause degradation?

Accepted Answer

Each model is an imperfect approximation of the data it learned from, so its outputs slightly under-represent rare cases and amplify common ones. Train the next model on those outputs and the errors compound: the tails of the distribution shrink, variance collapses, and after several rounds the model represents an increasingly narrow, distorted version of reality.

Question 3

Is model collapse already happening to today's AI?

Accepted Answer

It is a demonstrated risk in controlled experiments rather than a proven collapse of current frontier models. The concern is real because more of the internet is now AI-generated, which can pollute future training data, but labs mitigate it with data curation, provenance filtering, and mixing in fresh human data, so it is a managed risk, not an inevitability.

Question 4

How can model collapse be prevented?

Accepted Answer

The main defences are keeping a strong supply of genuine human data, filtering or detecting synthetic data before training, tracking data provenance, and using synthetic data deliberately and in moderation rather than scraping an increasingly AI-polluted web indiscriminately. High-quality human data becomes more valuable, not less, as a safeguard.

What Is Model Collapse? Why AI Training on AI-Generated Data Fails

What model collapse is

Why it happens

The research evidence

Why it matters and how it is mitigated