The security challenge of building trustworthy artificial intelligence assistants

New tools like OpenClaw show the appeal of always-on artificial intelligence assistants with deep access to personal data, but they also spotlight unresolved security risks, especially prompt injection attacks. Researchers are racing to design guardrails that protect users without stripping these agents of their usefulness.

Artificial intelligence agents that can act as personal assistants are rapidly moving from concept to reality, but their growing power brings significant security risks. OpenClaw, a tool created by independent software engineer Peter Steinberger, lets users strap existing large language models into a kind of “mecha suit,” granting them persistent memory and the ability to run ongoing tasks via messaging apps like WhatsApp. Unlike more constrained offerings from major artificial intelligence companies, OpenClaw agents are designed to run 24-7 and can manage inboxes, plan trips, and even write code or spin up new applications. To do so, they often require deep access to users’ emails, credit card information, and local files, which has alarmed security experts and even prompted a public warning from the Chinese government about its vulnerabilities.

Some of the most immediate dangers involve basic operational mistakes and traditional hacking. One user’s Google Antigravity coding agent reportedly wiped his entire hard drive, illustrating how easily automated tools can cause catastrophic damage when given broad permissions. Security researchers have also demonstrated multiple ways attackers could compromise OpenClaw instances using conventional techniques to exfiltrate sensitive data or run malicious code. Users can partially mitigate these problems by isolating agents on separate machines or in the cloud, and many weaknesses could be addressed with established security practices. However, specialists are especially concerned about prompt injection, an insidious form of large language model hijacking in which malicious text or images embedded in websites or emails are misinterpreted as instructions, allowing attackers to redirect an artificial intelligence assistant that holds private user information.

Prompt injection has not yet led to publicly known disasters, but the proliferation of OpenClaw agents potentially makes this attack vector more enticing to cybercriminals. The core issue is that large language models do not inherently distinguish user instructions from data such as web pages or emails, treating everything as text and making them easy to trick. Researchers are pursuing three broad defense strategies, each with trade-offs. One is to train models during post-training to ignore known forms of prompt injection, though pushing too hard risks rejecting legitimate requests and cannot fully overcome the models’ inherent randomness. Another approach uses detector models to screen inputs for injected prompts before they reach the main assistant, but recent studies show that even the best detectors miss entire categories of attacks. A third strategy focuses on output policies that constrain what agents are allowed to do, such as limiting email recipients, which can block harmful actions but also sharply reduces utility and is difficult to define precisely.

The broader artificial intelligence ecosystem is still wrestling with when these agents will be secure enough for mainstream deployment. Dawn Song of UC Berkeley, whose startup Virtue AI develops an agent security platform, believes safe deployment is already possible with the right safeguards, while Duke University’s Neil Gong argues that the field is not there yet. Even if full protection against prompt injection remains elusive, partial mitigations can meaningfully reduce risk, and Steinberger has recently brought a security specialist onto the OpenClaw project to strengthen its defenses. Many enthusiasts, such as OpenClaw maintainer George Pickett, continue to embrace the tool while taking basic precautions like running it in the cloud and locking down access, although some admit they have not implemented specific protections against prompt injection and are effectively betting they will not be the first to be hacked.

66

Impact Score

EU Artificial Intelligence Act amendments delay some deadlines and add new bans

A provisional Digital Omnibus on Artificial Intelligence would push back several EU Artificial Intelligence Act deadlines, refine how the law interacts with sector rules, and introduce new prohibited practices. The package also expands limited bias-testing allowances and strengthens centralized oversight for some high-impact systems.

Qwen 3.5 raises concerns about censorship embedded in model weights

A technical analysis of Alibaba Cloud’s Qwen 3.5 points to political censorship circuits embedded directly in the model’s learned weights. The findings highlight operational, compliance, and product risks for startups building on third-party Artificial Intelligence models.

Laptop prices rise as memory shortages hit PCs

Laptop prices are climbing as memory makers redirect production toward data center demand driven by Artificial Intelligence. The squeeze is spreading beyond RAM to graphics memory and SSDs, raising costs across the PC market.

Artificial Intelligence models split on job disruption estimates

A new working paper finds that leading Artificial Intelligence models give sharply different answers when asked which jobs they are most likely to disrupt. The findings raise doubts about using model-generated exposure scores to guide labor policy or economic analysis.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.