AI Models Collapse When Trained on Recursively Generated Data

Based on: Nature (2024), Volume 631, Pages 755–759

Original paper:

https://www.nature.com/articles/s41586-024-07566-y

The Core Finding

A 2024 study published in Nature shows a fundamental limitation of modern AI systems:
when language models are repeatedly trained on data generated by other AI models, performance begins to degrade.

Instead of improving over time, the model gradually loses diversity, nuance, and rare statistical patterns.
What emerges is a kind of “feedback loop of imitation,” where models increasingly learn from distorted reflections of reality.

What “Model Collapse” Means

The process can be understood as a recursive training loop:

A model is trained on real-world data
It generates synthetic text
That synthetic data is reused for training future models
Each iteration reduces informational diversity

Over time, rare and complex information disappears first, leaving behind increasingly generic and homogenized outputs.

Why This Matters

This issue becomes especially relevant as more of the internet is now being generated or rewritten by AI systems.
If future training datasets contain a large proportion of synthetic content, models may begin to learn primarily from themselves.

In that scenario, data quality—not just data quantity—becomes the limiting factor for progress in machine learning.

Interpretation: The Library of Babel

The phenomenon strongly resembles Jorge Luis Borges’ thought experiment

“The Library of Babel”
.

In Borges’ concept, an infinite library contains every possible combination of letters—meaning all true, false, and meaningless texts exist simultaneously.

Similarly, recursive AI training risks creating a data environment where signal and noise blur together,
and the model’s internal representation of reality gradually loses grounding in the original source of truth.