Nvidia trains AI with extensive YouTube video collection

Tech Read Team
3 Min Read

NVIDIA’s Controversial AI Training Methods

NVIDIA Gr00t Robotik AI
  • Nvidia gathered videos from YouTube and other sources to train AI models.
  • The company defended its practice as compliant with copyright law.
  • Internal discussions revealed employees’ concerns about legal issues related to the use of datasets.
  • The project, known as Cosmos, aimed to create an advanced video model for various Nvidia products.
  • Employees used yt-dlp and virtual machines to avoid blocking from YouTube.
  • Nvidia used 20-30 virtual machines to download 80 years of videos per day.
  • The company planned to use videos from Netflix and other sources despite legal risks.
  • Nvidia claimed that their data usage was protected under “fair use.”
  • Google and Netflix opposed Nvidia’s unauthorized data collection.
  • Internal discussions revealed that Nvidia had no plans to publish research results to avoid negative attention.

According to leaked internal communications obtained by 404 Media, Nvidia scraped 80 years.

NVIDIA has recently come under scrutiny for allegedly scraping videos from platforms like YouTube and Netflix without permission. This practice is said to be part of their efforts to compile training data for AI projects. The company, valued at around $2.4 trillion, has been accused of instructing employees to download a significant amount of copyrighted material to enhance their AI capabilities.

Videos were sourced from various platforms including Netflix, but mainly from YouTube. Netflix stated that they have no agreement with Nvidia for content collection, and their terms of service do not allow scraping either.

Nvidia used the YouTube downloader yt-dlp on 20 to 30 virtual machines that updated their IP addresses to avoid blocking.

Details of the Allegations

The scope of this operation is remarkable, with claims that NVIDIA, along with other tech giants like Apple and Anthropic, used a dataset of over 173,000 YouTube videos and transcripts to train their AI models. This dataset is said to include content from channels that have since been removed, raising further ethical questions about the use of such data without consent from content creators or the platforms themselves.

The consequences of these actions are significant, as they highlight ongoing issues of copyright and data usage in the tech industry. The practice of scraping content without permission has sparked debates about companies’ ethical responsibilities in the AI sector. Critics argue that this could set a dangerous precedent for how AI models are trained and the rights of content creators.

Share This Article
Leave a comment