Can AI take advantage of content that is publicly available on the internet? What if these contents are protected by copyright? The answer to the first question is clear: he is doing it. Things are more delicate for the second, but everything indicates that AI companies are also using copyrighted content to train their models. Now it remains to be seen if this will have consequences.
Newspapers that accuse ChatGPT of stealing their content. Raw Story and Alternet are two online publications that, as indicated in Reuters They sued OpenAI last February. According to the lawsuit, the company used thousands of his articles without permission to train its popular chatbot, ChatGPT. Not only that: they accuse OpenAI of reproducing their copyrighted content when said content was requested from its AI model.
OpenAI wins a battle. For now, OpenAI can breathe easy. In the United States, a federal judge from New York, Colleen McMahon, has indicated in his sentence that these news outlets have not been harmed enough to support their lawsuit. However, she gave them the opportunity to appeal but made it clear that she was “skeptical” about the possibilities that such media could “claim demonstrable harm.”
But not war. The case is the latest in a string of lawsuits for copyright violation that, above all, organizations and entities in the publishing, literary, musical or artistic industries are filing against artificial intelligence companies.
Lawsuits everywhere. In recent months we have seen lawsuits such as the one from Getty to Stable Diffusion, which affects GitHub Copilot and other legal threats such as those from The Author’s Guild. David Holz, founder of Midjourney, admitted that when training his model “There’s not really a way to take a hundred million images and know where they come from. It would be nice if the images had embedded metadata about the copyright owner or something. But that doesn’t exist; there’s no record” .
The New York Times is on the prowl. These two publications join previous demands from especially powerful media and publishing groups. In February 2023, The Wall Street Journal and already expressed their concern regarding the use of their content in AI models.
Even more notable was the lawsuit from The New York Times, which accused Microsoft and OpenAI of copyright violation for this same type of activity. According to the lawsuit, millions of articles published by the NYT were used to train AI models. In April 2024, eight other newspapers sued those same companies for exactly the same reasons.
Zero transparency. The secrecy regarding the data sets used for training is total both at OpenAI and its competitors. They hardly give details about this content but in recent times they have made statements that make it clear that they take advantage of everything they can.
But they need that material, they argue in OpenAIGoogle explained that it can “collect publicly available information online” for training its AI models, Meta has long been using everything its users publish on Facebook and Instagram, and OpenAI he came to say before the British Parliament that “it would be impossible to train today’s leading AI models without using copyrighted materials.”
If you want to use my content, pay me. AI companies are beginning to realize the enormous risk they are exposing themselves to, and some are beginning to cover their backs with a simple method: financial agreements. Google licensed content from Reddit, and OpenAI has also reached some economic agreements with publishing groups such as Prisa (El País) and Le Monde.
Perplexity and ChatGPT Search have a bigger problem. We are seeing the latest cases of this dangerous situation in search engines with AI. Perplexity and ChatGPT Search are capable of browsing the Internet, taking a handful of sources and answering our questions by summarizing the information from those sources. That is very good for the user, who gets the answer to what they want clearly, but these “search engines” thus make it unnecessary for the user to click on the original link most of the time. Content creators, therefore, lose traffic that those AI models gain, which further aggravates the situation.
Image | Hümâ H. Yardım |Marco Lenti
In Xataka | AI companies are playing with fire with copyrighted content. And Perplexity is about to burn
Add Comment