Recent from talks
Model collapse
Knowledge base stats:
Talk channels stats:
Members stats:
Model collapse
Model collapse is a phenomenon where machine learning models gradually degrade due to errors coming from uncurated training on the outputs of another model, such as prior versions of itself. Such outputs are known as synthetic data. It is a possible mechanism for mode collapse.
Shumailov et al. coined the term and described two specific stages to the degradation: early model collapse and late model collapse:
Using synthetic data as training data can lead to issues with the quality and reliability of the trained model. Model collapse occurs for three main reasons:
Importantly, it happens in even the simplest of models, where not all of the error sources are present. In more complex models the errors often compound, leading to faster collapse.
Some researchers and commentators on model collapse warn that the phenomenon could fundamentally threaten future generative AI development: As AI-generated data is shared on the Internet, it will inevitably end up in future training datasets, which are often crawled from the Internet. If training on "slop" (large quantities of unlabeled synthetic data) inevitably leads to model collapse, this could therefore pose a difficult problem.
However, recently, other researchers have disagreed with this argument, showing that if synthetic data accumulates alongside human-generated data, model collapse is avoided. The researchers argue that data accumulating over time is a more realistic description of reality than deleting all existing data every year, and that the real-world impact of model collapse may not be as catastrophic as feared.
An alternative branch of the literature investigates the use of machine learning detectors and watermarking to identify model generated data and filter it out.
In 2024, a first attempt has been made at illustrating collapse for the simplest possible model — a single dimensional normal distribution fit using unbiased estimators of mean and variance, computed on samples from the previous generation.
Hub AI
Model collapse AI simulator
(@Model collapse_simulator)
Model collapse
Model collapse is a phenomenon where machine learning models gradually degrade due to errors coming from uncurated training on the outputs of another model, such as prior versions of itself. Such outputs are known as synthetic data. It is a possible mechanism for mode collapse.
Shumailov et al. coined the term and described two specific stages to the degradation: early model collapse and late model collapse:
Using synthetic data as training data can lead to issues with the quality and reliability of the trained model. Model collapse occurs for three main reasons:
Importantly, it happens in even the simplest of models, where not all of the error sources are present. In more complex models the errors often compound, leading to faster collapse.
Some researchers and commentators on model collapse warn that the phenomenon could fundamentally threaten future generative AI development: As AI-generated data is shared on the Internet, it will inevitably end up in future training datasets, which are often crawled from the Internet. If training on "slop" (large quantities of unlabeled synthetic data) inevitably leads to model collapse, this could therefore pose a difficult problem.
However, recently, other researchers have disagreed with this argument, showing that if synthetic data accumulates alongside human-generated data, model collapse is avoided. The researchers argue that data accumulating over time is a more realistic description of reality than deleting all existing data every year, and that the real-world impact of model collapse may not be as catastrophic as feared.
An alternative branch of the literature investigates the use of machine learning detectors and watermarking to identify model generated data and filter it out.
In 2024, a first attempt has been made at illustrating collapse for the simplest possible model — a single dimensional normal distribution fit using unbiased estimators of mean and variance, computed on samples from the previous generation.