Millions of images containing passports, credit cards, birth certificates, and many other documents with personally identifiable information are likely present in one of the largest open-source artificial intelligence training datasets, DataComp CommonPool, according to recent research. The findings were based on an audit of only 0.1% of the dataset, revealing thousands of confirmed sensitive images—leading researchers to estimate that hundreds of millions of such images are present in the full set, including identifiable faces and official identity documents. The study, published on arXiv, highlights that extensive web scraping practices have swept up both public and private data without meaningful restrictions or filtering mechanisms.
The broad scope of DataComp CommonPool mirrors that of its predecessor, the LAION-5B dataset, both of which source images and captions from the nonprofit Common Crawl. Originally intended for academic research, CommonPool’s licensing does not prohibit commercial use and the dataset has been downloaded over 2 million times since its 2023 release. Efforts to mitigate privacy risks—like automated face blurring—proved inadequate; researchers found the system missed millions of faces and that the captions and metadata often contained additional identifying details such as names and addresses. Some images within the data included redacted résumé materials, exposing sensitive personal data ranging from disability status to government identifiers, alongside the contact information of third parties like references.
The presence of personally identifiable information in such datasets raises significant privacy and ethical challenges, especially given the ease with which scraped data propagates across downstream models and platforms. Legal protections are inconsistent: while frameworks such as the GDPR and California’s consumer privacy act exist, enforcement gaps persist and carve-outs for ´publicly available´ information are common, often leaving private data exposed and unprotected if sourced from the internet. Researchers and ethicists are urging the artificial intelligence community to rethink indiscriminate web scraping as standard practice, while consumers and policymakers grapple with the limitations of current privacy laws. The research underscores that web-scraped datasets contain far more sensitive data than most realize, that current privacy mechanisms are insufficient, and that the concept of ´consent´ is rendered moot when data is reused in unforeseen, often commercial, artificial intelligence applications.