Science and Tech

OpenAI used millions of YouTube videos to train its AIs: Google is very angry

OpenAI presents Voice Engine, an AI that clones your voice just by listening to you for 15 seconds

Google last week accused OpenAI of using YouTube videos to train Sora. Now an investigation of the environment The New York Times ensures that OpenAI also used over a million hours of YouTube videos to train Whisper.its AI that converts audio to text.

As expected, it has not been good for Google, because not only is it its data, but OpenAI is also its most direct rival in the field of artificial intelligence.

We will see if this case goes to court, or an agreement is reached between companies, so that both win.

OpenAI used YouTube videos to train its AIs

Artificial intelligence needs real-world data to improve. And the more perfect this AI is, the more data it needs.

According to inform the newspaper The New York Times, via The Verge, major AI companies have already consumed all available public data to train AIas well as the private collections with which they have reached an agreement.

According to the research, OpenAI ran out of data in 2021. So its executives discussed the possibility of using YouTube videos, podcasts and audiobooks, even knowing they were in “a gray area” of the law.

Finally, they made the decision to use nearly a million hours of YouTube videos to extract audio and train Whisperyour AI that convert speech to text. They would fall under the term “fair use,” using only a portion of the hundreds of billions of hours of videos on YouTube.

Allegedly, OpenAI's own president, Greg Brockman, was involved in obtaining those videos.

Google spokesperson Matt Bryant confirms in The Verge that the company has “seen unconfirmed reports” of OpenAI activity, and assures that “both our robots.txt files and terms of service prohibit scraping or unauthorized downloading of YouTube content.”

How YouTube AI helps you find the best moments in any video

Research The New York Times also ensures that Meta ran out of data a long time agoand considered the possibility of licensing books, and even buy a big publisher.

According to some experts, AI companies will need more data than can be generated, by 2028.

The solution is to create synthetic data, that is, artificially designed for use with AI, or use other training models that do not require as much data. But so far none of this has worked.

AI companies are competing in a mad race to dominate a market that will make a lot of money, and They do not hesitate to bypass copyright, in order to train their AIs faster than their rivals. A suicidal race that sows doubts about the supposed safety of this AI, when it comes to preventing us from annihilating us, or turning us into its slaves…

Source link