Artificial intelligence is coming for YouTube creators

More than 15.8 million YouTube videos from over 2 million channels appear in at least 13 public data sets used to train generative Artificial Intelligence video tools, often without creators’ permission. creators and legal advocates are contesting whether such mass downloading and training is lawful or ethical.

an investigation found that tech companies, universities, and research groups have collected at least 15.8 million YouTube videos from more than 2 million channels and placed them in at least 13 public data sets. nearly 1 million of those videos are how-to clips. many entries are anonymized, but researchers identified videos by extracting unique YouTube identifiers from the data sets. among the most represented sources are news and educational channels, with the BBC appearing at least 33,000 times and TED nearly 50,000 times. the downloads are distinct from YouTube’s subscriber download features: videos are being ripped en masse, a practice that violates YouTube’s terms of service, and the platform did not respond to requests for comment.

the collected footage is being prepared for training generative Artificial Intelligence models by splitting videos into short clips and pairing them with English-language captions. creators of data sets used view counts, automated models, or human curation to prioritize content described as cinematic or high quality, and curators often avoid videos with overlaid text or logos. captions are produced either by paid workers or by other models. companies and research teams that have used or published such data sets include Microsoft, Meta, Amazon, Nvidia, Runway, ByteDance, Snap, and Tencent. Meta, Amazon, and Nvidia responded to inquiries saying they respect creators and view their work as legally usable under current copyright law, while several other companies did not comment.

the presence of these videos in training corpora has immediate industry and legal implications. generative Artificial Intelligence videos are already competing with human-made content on YouTube, and the article links that shift to earlier disruptions caused by text-generation tools in online publishing. creators and rights holders have mounted lawsuits and public complaints, including major studio suits against image generators and a recent incident in which a deepfaked TED talk was repurposed in an ad that lost an award and prompted litigation. developers and platforms are simultaneously building commercial video-generation tools, offering consumer editing and face-swap products, and in some cases paying users to post synthetic content. the uncertainty over whether training on downloaded videos is lawful could reshape creators’ incentives to publish on YouTube and similar platforms.

75

Impact Score

How Intel became central to America’s Artificial Intelligence strategy

The Trump administration took a 10 percent stake in Intel in exchange for early CHIPS Act funding, positioning the struggling chipmaker at the core of U.S. Artificial Intelligence ambitions. The high-stakes bet could reshape domestic manufacturing while raising questions about government overreach.

NextSilicon unveils processor chip to challenge Intel and AMD

Israeli startup NextSilicon is developing a RISC-V central processor to complement its Maverick-2 chip for precision scientific computing, positioning it against Intel and AMD and in competition with Nvidia’s systems. Sandia National Laboratories has been evaluating the technology as the company claims faster, lower power performance without code changes on some workloads.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.