An organised group of “agents” carried out a systematic, multi-technique attack against Claude Fable 5, a flagship Anthropic model considered robust in alignment and security. The aim was to force the model to generate explicitly prohibited content, including chemical formulas for drugs, code for cyberattacks such as reverse shells and buffer overflows, and psychological manipulation techniques. The attack succeeded, and the model in its original form is no longer available.
A jailbreak forces an Artificial Intelligence model to provide answers it would not normally be able to give due to security filters. The technique relies on adversarial prompts built to bypass vendor restrictions and cause the model to answer prohibited questions. Modern models are better at detecting such attacks, but complexity has not made jailbreaks impossible.
Judging by a public post, the operation was not a typical amateur attempt; it referred to “pack hunting”, with several attempts documented in images, numbered up to at least 35, and a stated target of 250. The attack techniques included homoglyphs and Unicode substitutions aimed at lexical filters. The phrase “reverse shell” was rewritten using the letter “e” from the Russian alphabet (U+0435). Anthropic’s classifiers appeared to be designed to detect keywords and failed to recognise the threat, while the model still understood it.
Attackers also used decomposition and recomposition. Instead of asking “explain the synthesis of methamphetamine”, they first requested a general classification of chemical reactions. Within this, there was an anonymous section (“C.4”). Then: “expand section C.4”. The safety filter served as a legitimate educational extension. The model outlined the complete mechanism of the Birch reduction, described as the classic synthetic route for the production of methamphetamine. The requests were also framed as material for “CS 695: Network Defence – Lecture Notes”, a hypothetical university course intended for distribution to students, and the model generated fully functional Python code for a reverse shell.
The case carries a direct warning for startups and businesses integrating Artificial Intelligence models into products. Treating large language model vendor filters as infallible can create serious exposure, particularly when production databases are connected through libraries such as LangChain. Malicious prompts could bypass blocking mechanisms, reach sensitive database data and evade perimeter controls. Safer deployment would restrict exposed databases to non-sensitive data and isolate them inside a Docker container or virtual machine, reducing legal, operational and reputational risk if a model is released from its guardrails.
