New Echo Chamber jailbreak circumvents large language model safeguards

Researchers have identified a new ´Echo Chamber´ technique that easily manipulates leading artificial intelligence models into bypassing safety guardrails.

A newly disclosed jailbreak method dubbed ´Echo Chamber´ has been shown to bypass the safety guardrails of prominent large language models (LLMs) by subtly poisoning conversational context over multiple turns, according to research from NeuralTrust. Unlike earlier approaches that rely on direct question-and-answer trickery or signposting prohibited queries, Echo Chamber employs so-called ´steering seeds´—innocuous-sounding prompts that guide the model´s responses toward harmful or restricted outputs.

Echo Chamber was discovered serendipitously by NeuralTrust researcher Ahmad Alobaid while investigating LLM vulnerabilities. The attack operates by remaining in the so-called green zone (permissible queries) while deploying contextually appropriate prompts that incrementally nudge the model toward a malicious objective. For instance, rather than directly asking about creating a prohibited item, the attacker splits the request into safe fragments and uses each subsequent response as a new staging point, gradually assembling the forbidden information while avoiding trigger words that would activate safety filters.

NeuralTrust’s evaluation tested the Echo Chamber technique across several major LLMs, including GPT-4.1-nano, GPT-4o-mini, GPT-4o, Gemini-2.0-flash-lite, and Gemini-2.5-flash. Each model underwent 200 test attempts. The exploitation proved alarmingly efficient: the success rate for generating sexism, violence, hate speech and pornographic content exceeded 90%, while attempts involving misinformation or self-harm reached about 80%. Even prompts for profanity or illegal activities succeeded over 40% of the time. Strikingly, successful jailbreaks often occurred after just one to three conversational turns. Experts noted the approach requires minimal expertise and is fast to execute, making it particularly worrisome in the context of global, public access to artificial intelligence platforms.

NeuralTrust warns that as context-poisoning attacks like Echo Chamber become more refined and easier to operationalize, the risks of artificial intelligence-driven harassment, misinformation, and illegal activities are poised to escalate. Their findings reaffirm an ongoing arms race between LLM developers deploying new safety mechanisms and attackers relentlessly probing for subtle vectors to defeat them. The research underscores the urgent need for advanced, context-aware safety systems capable of detecting not just isolated malicious queries, but also pattern-based manipulation strategies that unfold over the course of extended conversations.

87

Impact Score

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.

Please check your email for a Verification Code sent to . Didn't get a code? Click here to resend