NVIDIA, the leading AI chip maker, is reportedly developing a sophisticated AI model capable of understanding and generating video content.
An exclusive investigation by 404 Media reveals that NVIDIA has collected vast amounts of data from platforms like Netflix and YouTube to train its new AI model, “Cosmos.” This approach has sparked legal and ethical concerns about using copyrighted material for AI training.
NVIDIA’s internal AI Project
According to documents reviewed by 404 Media and discussions with NVIDIA employees, the Cosmos project aims to create a comprehensive video foundation model. This model would integrate simulations of light transport, physics, and intelligence to enable various applications crucial to NVIDIA’s product lineup. These applications include the Omniverse 3D world generator, self-driving car systems, and digital human products.
To achieve this, NVIDIA has reportedly instructed its employees to use tools like the open-source YouTube video downloader, yt-dlp. Employees allegedly use virtual machines to download full-length videos while evading detection and avoiding blocks by YouTube. Additionally, virtual machines on Amazon Web Services are employed to refresh IP addresses, enabling the download of approximately 80 years’ worth of video content per day.
Legal and ethical concerns
NVIDIA’s data acquisition methods have raised significant legal and ethical questions. A former NVIDIA employee disclosed that the company also targeted Netflix, despite Netflix’s terms of service explicitly prohibiting such scraping activities. The approach extended beyond public content, as NVIDIA reportedly mined academic datasets and other resources for research purposes.
In a Slack conversation, project leaders like Ming-Yu Liu discussed the benefits of using high-quality content, including Hollywood films, Discovery Channel documentaries, and gaming footage, for training. Liu highlighted Hollywood films’ gaming-like 3D consistency and fictional content, noting their superior quality. However, he acknowledged the sensitivity of using such content, referencing concerns similar to those raised by artists following the release of Stable Diffusion (SD).
Despite these concerns, project managers reassured employees they had top-level approval to scrape data from websites, labeling it an “executive decision.” NVIDIA has defended its data scraping practices, asserting that they are “in full compliance with the letter and the spirit of copyright law.”
Implications on how AI is developed
NVIDIA’s ambitious AI project underscores the ongoing challenges and complexities surrounding the development of advanced AI technologies. As AI models become increasingly capable of understanding and generating sophisticated content, data acquisition methods’ ethical and legal implications must be carefully considered. The Cosmos project exemplifies the tension between technological innovation and the need to respect intellectual property rights and ethical standards.
While NVIDIA’s efforts to develop cutting-edge AI models are commendable, the company’s data scraping practices highlight the need for clear guidelines and regulations in the AI industry. As NVIDIA continues to push the boundaries of AI technology, it remains to be seen how the legal and ethical issues surrounding the Cosmos project will be addressed and resolved.