New prompt injection papers: agents rule of two and the attacker moves second

Two recent papers examine prompt injection risks and defenses: Meta Artificial Intelligence's Agents Rule of Two proposes limiting agent capabilities to reduce high-impact attacks, while a large arXiv study shows adaptive attacks can bypass most published jailbreak and prompt injection defenses.

Two new works on prompt injection and language model security are highlighted. The first, published October 31 on the Meta Artificial Intelligence blog and shared by Meta Artificial Intelligence security researcher Mick Ayzenberg, proposes an “Agents Rule of Two.” The rule says that until robust detection of prompt injection exists, an agent session should satisfy no more than two of three properties to avoid the highest impact consequences: (A) process untrustworthy inputs, (B) have access to sensitive systems or private data, and (C) change state or communicate externally. If all three are required without starting a fresh session, the agent should not operate autonomously and needs human supervision or reliable validation. The post includes a Venn diagram illustrating the danger where all three properties overlap and frames the rule as practical advice for system design.

The second paper, dated October 10, 2025 on arXiv, is a multi-author study with contributors from organizations including OpenAI, Anthropic, and Google DeepMind. The paper evaluates 12 published defenses against prompt injection and jailbreaks using extensive adaptive attacks. The authors report that by tuning and scaling optimization methods they bypassed 12 recent defenses with attack success rates above 90% for most, despite many defenses previously reporting near-zero success rates. A human red-teaming setting achieved 100% success; that red-team involved 500 participants in an online competition with a prize fund. The report argues that static example attacks are insufficient for evaluating defenses and that adaptive evaluations reveal far higher vulnerability.

The arXiv paper describes three automated adaptive techniques used by the attackers: gradient-based methods, reinforcement learning methods that interact with defended systems, and search-based methods that generate and iteratively refine candidates using language models as judges. The paper urges defense authors to release simple defenses amenable to human analysis and to adopt higher standards for evaluation. The author of the blog post finds the paper a forceful reminder of how much remains to be done and endorses the Agents Rule of Two as pragmatic guidance for building more secure language-model agents today, while expressing skepticism that reliable defenses will appear soon.

63

Impact Score

Europe weighs technology sovereignty push amid internal debate

Europe is preparing a new policy push to reduce reliance on major technology platforms, but internal disagreements are shaping the scope and pace of the effort. The Artificial Intelligence Development Act is due to be unveiled on June 3 after repeated delays.

EU Artificial Intelligence Act omnibus deal delays high-risk rules

A provisional EU agreement would push back key high-risk Artificial Intelligence Act deadlines while keeping major transparency duties on track for 2 August 2026. The deal also adds a new ban on non-consensual intimate imagery and child sexual abuse material generated by Artificial Intelligence systems.

UK and EU Artificial Intelligence regulatory outlook for May 2026

The UK is moving ahead with targeted Artificial Intelligence measures in policing, online safety, cyber security and copyright policy, while the EU is refining how the EU Artificial Intelligence Act will apply in practice. Consultations, new offences and implementation deadlines are shaping the next phase of compliance on both sides.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.