Mit researchers propose self distillation fine tuning to add skills to large language models without forgetting

Researchers at MIT, the Improbable artificial intelligence lab and ETH Zurich have introduced self distillation fine tuning, a method that lets large language models gain new enterprise skills while preserving prior capabilities. The approach uses a model’s own in context learning as a teacher, avoiding explicit reward functions and reducing catastrophic forgetting.

Enterprises fine tuning large language models for specialized tasks often face catastrophic forgetting, where new training causes models to lose previously learned abilities, forcing organizations to maintain multiple separate systems. Researchers at MIT, the Improbable artificial intelligence lab and ETH Zurich have developed a method called self distillation fine tuning that enables large language models to acquire new skills and proprietary knowledge without sacrificing prior performance. The technique leverages the in context learning abilities of modern models to approximate on policy learning without requiring reinforcement learning reward functions, offering a path toward adaptive Artificial Intelligence agents that can evolve alongside dynamic business needs.

Self distillation fine tuning addresses limitations of both reinforcement learning and supervised fine tuning. On policy learning traditionally relies on reinforcement learning with explicit reward functions, which works for domains like math and coding but fails where it is difficult or impossible to define a numerical reward, such as legal writing or meeting summarization, and struggles when the model has zero prior knowledge of a topic. Supervised fine tuning, where a model mimics a fixed dataset of expert demonstrations, is inherently off policy and often fails to generalize to out of distribution cases while suffering heavily from catastrophic forgetting. Self distillation fine tuning creates a feedback loop inside a single model by splitting it into a frozen teacher and a trainable student: the teacher receives queries plus expert demonstrations and uses in context learning to infer correct answers and reasoning, while the student sees only the query and updates its parameters to match the teacher’s output distribution, effectively turning prerecorded demonstrations into an on policy style learning signal.

In experiments using the open weight Qwen 2.5 model on science question answering, software tool use and medical reasoning, self distillation fine tuning learned new tasks more effectively than standard supervised methods. On the Science Q&A benchmark, the self distillation fine tuning model achieved 70.2% accuracy, compared to 66.2% for the standard supervised fine tuning approach. When taught the science task, the supervised fine tuning model’s performance on general questions collapsed, while the self distillation fine tuning model kept its “Previous Tasks” score steady at 64.5%, indicating preservation of earlier knowledge. In a synthetic “2025 Natural Disasters” knowledge injection test, the self distillation fine tuning model, which had internalized reasoning over the new facts, scored 98% on indirect questions that required applying that knowledge. A sequential experiment showed the method could add science, tool use and medical skills in succession without regression, suggesting that a single model could replace “model zoos” of adapters and reduce inference costs.

The approach is available as open source code and is being integrated with the Hugging Face transformer reinforcement learning library, but it carries computational tradeoffs. The self distillation fine tuning pipeline is more similar to reinforcement learning in that it requires online response generation during training, and it is approximately four times slower and requires 2.5 times more computational power (FLOPs) than standard fine tuning because the model must generate rollout answers to compare against the teacher. The method also depends on sufficiently capable base models, with current experiments indicating that architectures around 4 billion parameters, such as Qwen 3 4B, have strong enough in context learning to act as their own teachers, while earlier 3 billion parameter models struggled. The researchers expect that improving small models could eventually bring self distillation fine tuning to 1 billion parameter systems, moving the field closer to lifelong learning where models continually improve from real world user interactions and where the growing compute spent on inference can be harnessed to update and refine model behavior over time.

58

Impact Score

Anumana wins FDA clearance for pulmonary hypertension ECG Artificial Intelligence tool

Anumana has received FDA 510(k) clearance for an Artificial Intelligence-enabled pulmonary hypertension algorithm designed for use with standard 12-lead electrocardiograms. The company says the software can help clinicians spot early signs of disease within existing workflows and without moving patient data outside the health system environment.

Anu Bradford on tech sovereignty and regulatory fragmentation

Anu Bradford argues that Europe is wavering in its role as the world’s digital rule-setter just as governments everywhere move toward more state control over technology. Global companies are being pushed to treat geopolitical risk, data sovereignty, and Artificial Intelligence governance as core strategic issues.

Mistral launches text-to-speech model

Mistral has expanded its Voxtral family with a text-to-speech system aimed at enterprise voice applications. The company is positioning the open-weights model as a flexible alternative for organizations that want more control over deployment, cost and customization.

UK Parliament opens workforce inquiry on Artificial Intelligence

A UK Parliament committee is examining how Artificial Intelligence is changing business and work, with a focus on both economic opportunity and labour disruption. The inquiry is seeking evidence on government priorities as adoption expands across the economy.

Windows 11 tightens kernel trust for older drivers

Microsoft is changing Windows 11 kernel policy so new drivers must be signed through the Windows Hardware Compatibility Program. Older trusted drivers will still be allowed in some cases to preserve compatibility during the transition.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.