Large language models (LLMs) like those produced by OpenAI and Google need extensive training data to function. However, there are concerns that new data may not be available to train future iterations, leading to a proposal to train new AI systems on old AI outputs. However, research shows that doing this could result in AIs generating nonsensical content, known as “model collapse,” which occurs when AI models trained on their own outputs gradually create meaningless gibberish.
Researchers have tested this by creating their own language model and making nine generations of models, each trained on output from the previous model. The result was surrealist-sounding gibberish that bore no relation to the original text. This process leads to the model becoming “poisoned with its own projection of reality.” Model collapse can manifest in both early and late stages. Early collapse occurs when AI models forget outliers in the training data, leading to a reduction in diversity of outputs. Late-stage collapse occurs when models trained on other models forget key aspects of the original training and generate complete gibberish.
The researchers argue that this cascading degradation and eventual model collapse are inevitable for large models trained on their own data. To prevent model collapse, one suggestion is to preserve original human text and mark AI-generated content with a watermark to indicate its authenticity. However, this approach may be more challenging for AI-generated text compared to images. Another solution could involve thoroughly vetting material for signs of AI manipulation and using high-quality human training data to avoid the internet being flooded with AI-generated content. Technology companies are also working on a “content credential” badge to indicate whether content was produced by a machine in order to tag and skip it in AI training sessions.
The whytry.ai article you just read is a brief synopsis; the original article can be found here: Read the Full Article…