The security challenge of building trustworthy artificial intelligence assistants

New tools like OpenClaw show the appeal of always-on artificial intelligence assistants with deep access to personal data, but they also spotlight unresolved security risks, especially prompt injection attacks. Researchers are racing to design guardrails that protect users without stripping these agents of their usefulness.

Artificial intelligence agents that can act as personal assistants are rapidly moving from concept to reality, but their growing power brings significant security risks. OpenClaw, a tool created by independent software engineer Peter Steinberger, lets users strap existing large language models into a kind of “mecha suit,” granting them persistent memory and the ability to run ongoing tasks via messaging apps like WhatsApp. Unlike more constrained offerings from major artificial intelligence companies, OpenClaw agents are designed to run 24-7 and can manage inboxes, plan trips, and even write code or spin up new applications. To do so, they often require deep access to users’ emails, credit card information, and local files, which has alarmed security experts and even prompted a public warning from the Chinese government about its vulnerabilities.

Some of the most immediate dangers involve basic operational mistakes and traditional hacking. One user’s Google Antigravity coding agent reportedly wiped his entire hard drive, illustrating how easily automated tools can cause catastrophic damage when given broad permissions. Security researchers have also demonstrated multiple ways attackers could compromise OpenClaw instances using conventional techniques to exfiltrate sensitive data or run malicious code. Users can partially mitigate these problems by isolating agents on separate machines or in the cloud, and many weaknesses could be addressed with established security practices. However, specialists are especially concerned about prompt injection, an insidious form of large language model hijacking in which malicious text or images embedded in websites or emails are misinterpreted as instructions, allowing attackers to redirect an artificial intelligence assistant that holds private user information.

Prompt injection has not yet led to publicly known disasters, but the proliferation of OpenClaw agents potentially makes this attack vector more enticing to cybercriminals. The core issue is that large language models do not inherently distinguish user instructions from data such as web pages or emails, treating everything as text and making them easy to trick. Researchers are pursuing three broad defense strategies, each with trade-offs. One is to train models during post-training to ignore known forms of prompt injection, though pushing too hard risks rejecting legitimate requests and cannot fully overcome the models’ inherent randomness. Another approach uses detector models to screen inputs for injected prompts before they reach the main assistant, but recent studies show that even the best detectors miss entire categories of attacks. A third strategy focuses on output policies that constrain what agents are allowed to do, such as limiting email recipients, which can block harmful actions but also sharply reduces utility and is difficult to define precisely.

The broader artificial intelligence ecosystem is still wrestling with when these agents will be secure enough for mainstream deployment. Dawn Song of UC Berkeley, whose startup Virtue AI develops an agent security platform, believes safe deployment is already possible with the right safeguards, while Duke University’s Neil Gong argues that the field is not there yet. Even if full protection against prompt injection remains elusive, partial mitigations can meaningfully reduce risk, and Steinberger has recently brought a security specialist onto the OpenClaw project to strengthen its defenses. Many enthusiasts, such as OpenClaw maintainer George Pickett, continue to embrace the tool while taking basic precautions like running it in the cloud and locking down access, although some admit they have not implemented specific protections against prompt injection and are effectively betting they will not be the first to be hacked.

66

Impact Score

Anumana wins FDA clearance for pulmonary hypertension ECG Artificial Intelligence tool

Anumana has received FDA 510(k) clearance for an Artificial Intelligence-enabled pulmonary hypertension algorithm designed for use with standard 12-lead electrocardiograms. The company says the software can help clinicians spot early signs of disease within existing workflows and without moving patient data outside the health system environment.

Anu Bradford on tech sovereignty and regulatory fragmentation

Anu Bradford argues that Europe is wavering in its role as the world’s digital rule-setter just as governments everywhere move toward more state control over technology. Global companies are being pushed to treat geopolitical risk, data sovereignty, and Artificial Intelligence governance as core strategic issues.

Mistral launches text-to-speech model

Mistral has expanded its Voxtral family with a text-to-speech system aimed at enterprise voice applications. The company is positioning the open-weights model as a flexible alternative for organizations that want more control over deployment, cost and customization.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.