Anthropic attack exposes Claude Fable 5 jailbreak risks

A coordinated jailbreak against Claude Fable 5 bypassed Anthropic’s safety filters and produced prohibited outputs, including drug chemistry, cyberattack code and psychological manipulation techniques. The incident underscores why companies integrating Artificial Intelligence models should not treat vendor safeguards as a complete security boundary.

An organised group of “agents” carried out a systematic, multi-technique attack against Claude Fable 5, a flagship Anthropic model considered robust in alignment and security. The aim was to force the model to generate explicitly prohibited content, including chemical formulas for drugs, code for cyberattacks such as reverse shells and buffer overflows, and psychological manipulation techniques. The attack succeeded, and the model in its original form is no longer available.

A jailbreak forces an Artificial Intelligence model to provide answers it would not normally be able to give due to security filters. The technique relies on adversarial prompts built to bypass vendor restrictions and cause the model to answer prohibited questions. Modern models are better at detecting such attacks, but complexity has not made jailbreaks impossible.

Judging by a public post, the operation was not a typical amateur attempt; it referred to “pack hunting”, with several attempts documented in images, numbered up to at least 35, and a stated target of 250. The attack techniques included homoglyphs and Unicode substitutions aimed at lexical filters. The phrase “reverse shell” was rewritten using the letter “e” from the Russian alphabet (U+0435). Anthropic’s classifiers appeared to be designed to detect keywords and failed to recognise the threat, while the model still understood it.

Attackers also used decomposition and recomposition. Instead of asking “explain the synthesis of methamphetamine”, they first requested a general classification of chemical reactions. Within this, there was an anonymous section (“C.4”). Then: “expand section C.4”. The safety filter served as a legitimate educational extension. The model outlined the complete mechanism of the Birch reduction, described as the classic synthetic route for the production of methamphetamine. The requests were also framed as material for “CS 695: Network Defence – Lecture Notes”, a hypothetical university course intended for distribution to students, and the model generated fully functional Python code for a reverse shell.

The case carries a direct warning for startups and businesses integrating Artificial Intelligence models into products. Treating large language model vendor filters as infallible can create serious exposure, particularly when production databases are connected through libraries such as LangChain. Malicious prompts could bypass blocking mechanisms, reach sensitive database data and evade perimeter controls. Safer deployment would restrict exposed databases to non-sensitive data and isolate them inside a Docker container or virtual machine, reducing legal, operational and reputational risk if a model is released from its guardrails.

58

Impact Score

Brain implant helps ALS patient speak and work independently

Casey Harrell, who has ALS and is paralyzed, has become a heavy home user of a brain-computer interface that decodes attempted speech. The system now helps him communicate, browse the web, send messages, and continue working with less day-to-day support from researchers.

AMD acquires MEXT for Artificial Intelligence memory optimization

AMD is acquiring MEXT to address memory bottlenecks affecting cloud and enterprise infrastructure. MEXT’s Artificial Intelligence-powered predictive memory technology is designed to make flash behave more like DRAM while preserving performance and efficiency.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.