Major artificial intelligence training dataset found to contain millions of personal documents

July 19, 2025

A leading open-source artificial intelligence training dataset likely contains hundreds of millions of images exposing personal and sensitive information, according to new research.

Millions of images containing passports, credit cards, birth certificates, and many other documents with personally identifiable information are likely present in one of the largest open-source artificial intelligence training datasets, DataComp CommonPool, according to recent research. The findings were based on an audit of only 0.1% of the dataset, revealing thousands of confirmed sensitive images—leading researchers to estimate that hundreds of millions of such images are present in the full set, including identifiable faces and official identity documents. The study, published on arXiv, highlights that extensive web scraping practices have swept up both public and private data without meaningful restrictions or filtering mechanisms.

The broad scope of DataComp CommonPool mirrors that of its predecessor, the LAION-5B dataset, both of which source images and captions from the nonprofit Common Crawl. Originally intended for academic research, CommonPool’s licensing does not prohibit commercial use and the dataset has been downloaded over 2 million times since its 2023 release. Efforts to mitigate privacy risks—like automated face blurring—proved inadequate; researchers found the system missed millions of faces and that the captions and metadata often contained additional identifying details such as names and addresses. Some images within the data included redacted résumé materials, exposing sensitive personal data ranging from disability status to government identifiers, alongside the contact information of third parties like references.

The presence of personally identifiable information in such datasets raises significant privacy and ethical challenges, especially given the ease with which scraped data propagates across downstream models and platforms. Legal protections are inconsistent: while frameworks such as the GDPR and California’s consumer privacy act exist, enforcement gaps persist and carve-outs for ´publicly available´ information are common, often leaving private data exposed and unprotected if sourced from the internet. Researchers and ethicists are urging the artificial intelligence community to rethink indiscriminate web scraping as standard practice, while consumers and policymakers grapple with the limitations of current privacy laws. The research underscores that web-scraped datasets contain far more sensitive data than most realize, that current privacy mechanisms are insufficient, and that the concept of ´consent´ is rendered moot when data is reused in unforeseen, often commercial, artificial intelligence applications.

Source

82

Impact Score

Latest News

China expands secure procurement list with domestic Artificial Intelligence chips

May 29, 2026

China has added domestically designed Artificial Intelligence processors to its Anke security certification framework for the first time, broadening the procurement path for state buyers. Huawei, Alibaba, and five other local vendors received approvals as Beijing deepens its shift away from foreign hardware.

South Korea launches K-Moonshot for Artificial Intelligence-led science

May 29, 2026

South Korea is rolling out K-Moonshot to accelerate scientific breakthroughs with Artificial Intelligence and has named mission leads to guide the effort. The government is also activating NAIS to support faster Artificial Intelligence-powered research across disciplines.

UK and EU Artificial Intelligence regulatory outlook for May 2026

May 29, 2026

The UK is moving ahead with targeted Artificial Intelligence measures in policing, online safety, cyber security and copyright policy, while the EU is refining how the EU Artificial Intelligence Act will apply in practice. Consultations, new offences and implementation deadlines are shaping the next phase of compliance on both sides.

Germany sets out national implementation of the Artificial Intelligence Act

May 29, 2026

Germany has published a draft law to implement the European Artificial Intelligence Act through new supervisory structures, clearer institutional responsibilities, and measures designed to support innovation. The proposal puts the Federal Network Agency at the center of enforcement while preserving sector-specific oversight in sensitive fields.

ECB warns banks about new Artificial Intelligence security risks

May 28, 2026

The European Central Bank has called major banks to an emergency meeting over cybersecurity risks tied to advanced Artificial Intelligence models. Regulators want banks to speed up security updates as newer tools make it easier to find and exploit vulnerabilities.

Major artificial intelligence training dataset found to contain millions of personal documents

82

Impact Score

Latest News

China expands secure procurement list with domestic Artificial Intelligence chips

South Korea launches K-Moonshot for Artificial Intelligence-led science

UK and EU Artificial Intelligence regulatory outlook for May 2026

Germany sets out national implementation of the Artificial Intelligence Act

ECB warns banks about new Artificial Intelligence security risks

Contact Us