Researchers at North Carolina State University identified key components in large language models that play a critical role in ensuring these Artificial Intelligence systems provide safe responses to user queries. The work focuses on improving safety alignment while minimizing the “alignment tax,” meaning the model becomes safer without significantly affecting performance. The researchers argue that this is increasingly important as large language models are used for advice, instructions, and other tasks where unsafe outputs could cause harm.
The team framed the problem through the Superficial Safety Alignment Hypothesis, or SSAH. The hypothesis holds that current safety alignment often treats a user request as a binary choice: safe or unsafe. If the request is deemed safe, a response is generated and provided to the user. If the request is deemed not safe, the model declines to generate a response. The researchers say this decision is typically made at the beginning of the response-generation process, which can leave safety protections brittle and easier to bypass through reworded prompts or later fine-tuning for domain-specific use.
The research also identified safety-critical “neurons” in large language model neural networks that are important in deciding whether a model should fulfill or refuse a request. The team found that freezing these specific neurons during the fine-tuning process allows the model to retain the safety characteristics of the original model while adapting to new tasks in a specific domain. The paper’s abstract says the researchers identified four types of attribute-critical components: Safety Critical Unit (SCU), Utility Critical Unit (UCU), Complex Unit (CU), and Redundant Unit (RU). It also says that leveraging redundant units in the pre-trained model as an “alignment budget” can effectively minimize the alignment tax while achieving the alignment goal.
The researchers say the findings support a view that the atomic functional unit for safety in large language models is at the neuron level. They also argue that future work should focus on methods that let models continuously re-evaluate and re-select their reasoning direction, safe or unsafe, throughout response generation rather than only at the start. The paper, “Superficial Safety Alignment Hypothesis,” will be presented at the Fourteenth International Conference on Learning Representations (ICLR2026), being held April 23-27 in Rio de Janeiro, Brazil.
