The article explores why safety classifiers for large language models often fail against prompt injection attacks and proposes a new fine tuning method called delta sim that operates on activation geometry rather than input tokens. Building on earlier work about the geometry of safety failures, the author argues that bypass is not primarily about what the input text says, but about where it travels in activation space. Semantics preserving transformations such as leetspeak, homoglyphs, casing changes, wrappers and DSI barely affect meaning but still move activations significantly, and the amount of movement predicts bypass success better than the final activation location. The classifier is therefore not fooled because the input appears benign, but because the input followed a path in activation space that the model was never trained to recognize as malicious.
The proposed delta sim training methodology assumes that if a classifier already detects some attacks, then internal representations for malicious behavior already exist, and the problem is a misplaced decision boundary. The procedure runs a baseline classifier on a large dataset in which all data points are malicious, then splits the results into detected attacks and missed attacks. The missed attacks are treated as blind spots, and the model is trained to pull those blind spot representations toward the detected malicious cluster in activation space, effectively widening the route labeled as malicious. On the LLMail Inject dataset, which features 208,095 real attacks, baseline Prompt Guard 2 performance was described as far from perfect, with only ±30% of the attacks detected. After delta sim fine tuning the results reported were 99.2% detection on in distribution attacks and 1,826 out of 1,841 bypasses recovered, indicating a dramatic gain in recall on that dataset.
However, when the fine tuned model is evaluated across other datasets, a major trade off appears in the form of degraded probability calibration. Delta sim achieves the highest AUC ROC and separates malicious and benign representations better than other models, but benign inputs start receiving extremely high scores, and operators must use absurdly high thresholds to maintain low false positive rates. At very strict operating points such as 0.1% FPR, well calibrated models like ProtectAI actually perform better. The author concludes that displacement based training improves geometry, not confidence, and that calibration is a separate problem. Delta sim is positioned as useful in settings where missing attacks is worse than occasional false positives, such as detection, logging, triage or response with human review or a low FPR prefilter, but risky for automated blocking where false positives are costly. The deeper lesson is that classifiers already contain useful representations and failures often stem from where decision boundaries are placed, and while geometric fine tuning can widen malicious routes and sharpen separation between benign and malicious inputs, it can also trade one failure mode for another, with no free lunch for prompt injection defenses.
