an investigation found that tech companies, universities, and research groups have collected at least 15.8 million YouTube videos from more than 2 million channels and placed them in at least 13 public data sets. nearly 1 million of those videos are how-to clips. many entries are anonymized, but researchers identified videos by extracting unique YouTube identifiers from the data sets. among the most represented sources are news and educational channels, with the BBC appearing at least 33,000 times and TED nearly 50,000 times. the downloads are distinct from YouTube’s subscriber download features: videos are being ripped en masse, a practice that violates YouTube’s terms of service, and the platform did not respond to requests for comment.
the collected footage is being prepared for training generative Artificial Intelligence models by splitting videos into short clips and pairing them with English-language captions. creators of data sets used view counts, automated models, or human curation to prioritize content described as cinematic or high quality, and curators often avoid videos with overlaid text or logos. captions are produced either by paid workers or by other models. companies and research teams that have used or published such data sets include Microsoft, Meta, Amazon, Nvidia, Runway, ByteDance, Snap, and Tencent. Meta, Amazon, and Nvidia responded to inquiries saying they respect creators and view their work as legally usable under current copyright law, while several other companies did not comment.
the presence of these videos in training corpora has immediate industry and legal implications. generative Artificial Intelligence videos are already competing with human-made content on YouTube, and the article links that shift to earlier disruptions caused by text-generation tools in online publishing. creators and rights holders have mounted lawsuits and public complaints, including major studio suits against image generators and a recent incident in which a deepfaked TED talk was repurposed in an ad that lost an award and prompted litigation. developers and platforms are simultaneously building commercial video-generation tools, offering consumer editing and face-swap products, and in some cases paying users to post synthetic content. the uncertainty over whether training on downloaded videos is lawful could reshape creators’ incentives to publish on YouTube and similar platforms.