Advancements in artificial intelligence (AI) have hit a roadblock as AI companies struggle to gather high-quality training data, according to recent reports from The Wall Street Journal and The New York Times. The methods used by AI technology companies to address this issue have raised challenges to the enforceability of AI copyright law and privacy law. OpenAI, for example, reportedly transcribed over a million hours of YouTube videos to train its advanced language model, GPT-4, despite potential legal concerns. Furthermore, the New York Times is said to have heard recordings of Meta employees discussing their unauthorized use of copyrighted materials in a desperate attempt to compete with OpenAI. Having exhausted most other public internet sources, Meta is now considering paying for book licenses or even purchasing major publishers outright.
Google’s response to OpenAI’s scraping activities and Meta’s problems in securing training data have only added to the legitimacy debate regarding AI training data. There are a few solutions to the diminishing AI training data problem, including training models on synthetic data or implementing curriculum learning, although neither of these approaches have been validated. The use of unauthorized data by companies has led to multiple lawsuits, indicating the contentious ramifications of violating copyright law. Therefore, some AI companies are beginning to enter into agreements to use or purchase copyrighted materials from content owners either for cash or in exchange for AI technology or services.
The whytry.ai article you just read is a brief synopsis; the original article can be found here: Read the Full Article…