DeepSeek Unveils New Method for Scaling Reward Models with SPCT

DeepSeek AI reveals a novel approach to enhance the scalability of general reward models in Artificial Intelligence systems.

DeepSeek AI, a leader in the large language model field, has unveiled a novel technique to enhance the scalability of general reward models (GRMs) during the inference phase. The newly introduced method, documented in their recent research paper, is aimed at optimizing reward generation by dynamically producing principles and critiques, utilizing rejection fine-tuning and rule-based online reinforcement learning.

At a time when the focus on scaling large language models has shifted to the inference phase, DeepSeek´s new method aligns with emerging models like OpenAI’s o1, which prioritize enhanced reinforcement learning during model testing. This reflects a growing trend toward leveraging reinforcement learning to continuously improve model performance by refining reasoning processes and enhancing decision-making capabilities.

DeepSeek´s SPCT approach addresses the challenge of scaling reinforcement learning for large language models by introducing Self-Principled Critique Tuning during inference. This involves rejection fine-tuning and rule-based online reinforcement learning, enhancing both the scalability and quality of GRMs. Experimental results demonstrate the superiority of SPCT over existing methods, setting the stage for further releases, including the anticipated R2 model from DeepSeek.

70

Impact Score

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.

Please check your email for a Verification Code sent to . Didn't get a code? Click here to resend