Moving the decision boundary of large language model safety classifiers

March 30, 2026

A new displacement-based fine tuning method called delta sim shifts how large language model safety classifiers recognize malicious prompts, improving detection but degrading probability calibration on benign inputs.

The article explores why safety classifiers for large language models often fail against prompt injection attacks and proposes a new fine tuning method called delta sim that operates on activation geometry rather than input tokens. Building on earlier work about the geometry of safety failures, the author argues that bypass is not primarily about what the input text says, but about where it travels in activation space. Semantics preserving transformations such as leetspeak, homoglyphs, casing changes, wrappers and DSI barely affect meaning but still move activations significantly, and the amount of movement predicts bypass success better than the final activation location. The classifier is therefore not fooled because the input appears benign, but because the input followed a path in activation space that the model was never trained to recognize as malicious.

The proposed delta sim training methodology assumes that if a classifier already detects some attacks, then internal representations for malicious behavior already exist, and the problem is a misplaced decision boundary. The procedure runs a baseline classifier on a large dataset in which all data points are malicious, then splits the results into detected attacks and missed attacks. The missed attacks are treated as blind spots, and the model is trained to pull those blind spot representations toward the detected malicious cluster in activation space, effectively widening the route labeled as malicious. On the LLMail Inject dataset, which features 208,095 real attacks, baseline Prompt Guard 2 performance was described as far from perfect, with only ±30% of the attacks detected. After delta sim fine tuning the results reported were 99.2% detection on in distribution attacks and 1,826 out of 1,841 bypasses recovered, indicating a dramatic gain in recall on that dataset.

However, when the fine tuned model is evaluated across other datasets, a major trade off appears in the form of degraded probability calibration. Delta sim achieves the highest AUC ROC and separates malicious and benign representations better than other models, but benign inputs start receiving extremely high scores, and operators must use absurdly high thresholds to maintain low false positive rates. At very strict operating points such as 0.1% FPR, well calibrated models like ProtectAI actually perform better. The author concludes that displacement based training improves geometry, not confidence, and that calibration is a separate problem. Delta sim is positioned as useful in settings where missing attacks is worse than occasional false positives, such as detection, logging, triage or response with human review or a low FPR prefilter, but risky for automated blocking where false positives are costly. The deeper lesson is that classifiers already contain useful representations and failures often stem from where decision boundaries are placed, and while geometric fine tuning can widen malicious routes and sharpen separation between benign and malicious inputs, it can also trade one failure mode for another, with no free lunch for prompt injection defenses.

Source

52

Impact Score

Latest News

What businesses need to know about the EU cyber resilience act

May 13, 2026

The EU cyber resilience act is turning product cybersecurity into a legal requirement for companies that sell digital products into the European Union. A key compliance milestone arrives in September 2026, well before the full regulation takes effect in 2027.

Claude Mythos and cyber insurance’s next inflection point

May 13, 2026

Claude Mythos is being treated by governments and regulators as a potential systemic cyber risk with implications for financial stability and insurance markets. Its emergence is intensifying pressure on insurers to clarify whether Artificial Intelligence-enabled cyber losses are covered, excluded, or require new stand-alone products.

OpenAI expands ChatGPT ads with self-serve manager

May 13, 2026

OpenAI is widening its ChatGPT ads pilot with a beta self-serve Ads Manager, new bidding options and broader measurement tools. The push signals a deeper move into advertising as the company expands the program into several international markets.

OpenAI launches Artificial Intelligence deployment consulting unit

May 13, 2026

OpenAI has created a new consulting and deployment business aimed at helping enterprises build and roll out Artificial Intelligence systems. The move mirrors a similar push by Anthropic and signals a broader effort by model providers to capture more of the enterprise services market.

SK Group warns DRAM shortages could curb memory use

May 13, 2026

SK Group chairman Chey Tae-won warned that customers may reduce memory consumption through infrastructure and software optimization if DRAM suppliers fail to raise output. Demand from Artificial Intelligence data centers is keeping the market tight as memory makers weigh expansion against the long timelines for new fabs.

Moving the decision boundary of large language model safety classifiers

52

Impact Score

Latest News

What businesses need to know about the EU cyber resilience act

Claude Mythos and cyber insurance’s next inflection point

OpenAI expands ChatGPT ads with self-serve manager

OpenAI launches Artificial Intelligence deployment consulting unit

SK Group warns DRAM shortages could curb memory use

Contact Us