Web-scraping bots overwhelm scientific publishers amid generative artificial intelligence boom

Automated bots gathering training data for artificial intelligence models are straining scientific databases and academic publishers, posing operational and financial risks.

In early 2025, the image repository DiscoverLife found its website bombarded with millions of daily requests, drastically slowing site performance. The surge was attributed to a flood of automated web-scraping bots, designed to harvest large volumes of digital content. Researchers and publishers operating journals, databases, and open-access repositories are increasingly facing similar crises, as bot traffic now routinely exceeds that from human users. These bots, often masked behind anonymized IP addresses, are widely believed to be collecting data to train the latest generation of artificial intelligence tools, such as chatbots and image generators.

Industry leaders highlight the unprecedented scale of disruption. Andrew Pitts, CEO of PSI in Oxford, describes the situation as a ´wild west,´ noting the overwhelming volume of requests costs money and disrupts access for genuine users. Organizations with limited technical or financial resources are especially vulnerable—some even risk shutting down entirely if the trend continues. Ian Mulvany from BMJ journals and Jes Kainth from the publication platform Highwire Press both report that bot traffic now routinely surpasses legitimate access, repeatedly crashing servers and interrupting services for researchers and professionals relying on timely access to scholarly materials.

The Confederation of Open Access Repositories (COAR) observed that over 90% of repositories in a recent survey experienced scraping from artificial intelligence bots, with service outages and significant operational headaches as a result. Executive director Kathleen Shearer notes that while open access is central to these platforms´ missions, the sheer aggressiveness of the bots is causing major technical and financial stress. The spike in scraper activity is traced, in part, to breakthroughs such as the DeepSeek language model, which demonstrated powerful artificial intelligence can be developed using publicly scraped data at lower computational costs. As the arms race for training data accelerates, scientific publishers and communities are scrambling to develop mitigating strategies, but viable solutions remain elusive for many operators caught in the data crawl crossfire.

72

Impact Score

Big Tech and startups push deeper into Artificial Intelligence infrastructure

Big Tech is lifting infrastructure spending plans again as cloud growth supports heavier investment in Artificial Intelligence. At the same time, startups including Parag Agrawal’s Parallel and Softbank’s planned Roze venture are targeting major opportunities in agent networks, data centers, and robotics.

Egypt unveils Artificial Intelligence-powered USD 27bn city project

Egypt is advancing a technology-led urban development strategy with The Spine, a mixed-use city built around digital twin infrastructure, edge computing and data-driven planning. The project is designed to combine urban services, economic management and governance within a single Artificial Intelligence-native environment.

CXL and HBM reshape memory competition in data centers

CXL is emerging as a complementary technology to HBM in Artificial Intelligence servers, promising larger memory pools, lower costs, and more flexible scaling. Samsung, SK Hynix, Micron, Intel, AMD, NVIDIA, and Google are all pushing the ecosystem toward broader deployment.

Artificial Intelligence agents face memory limits in wealth management

Citi is pushing deeper into Artificial Intelligence for wealth management with a new digital advisor, but industry executives say agent memory remains a major constraint. Better short-term and long-term recall could eventually help advisors serve more clients and maintain more continuous relationships.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.