Kwai AI´s SRPO Framework Drastically Boosts LLM Reinforcement Learning Efficiency

April 24, 2025

Kwai AI´s new SRPO method streamlines large language model training, achieving DeepSeek-level math and code results with only a fraction of the effort—reshaping Artificial Intelligence reinforcement learning.

Recent research from Kwai AI´s Kwaipilot team has demonstrated a significant leap in large language model (LLM) reinforcement learning, introducing the Two-Staged history-Resampling Policy Optimization (SRPO) framework. This approach addresses well-known inefficiencies and bottlenecks in standard General Reinforcement Learning from Preference Optimization (GRPO) by reducing post-training steps by 90% while matching the high-performance benchmarks set by models like DeepSeek-R1-Zero in both mathematical and coding domains. By building on the same Qwen2.5-32B base as DeepSeek and open-sourcing their SRPO-Qwen-32B model, the team showcases a combination of technical transparency and empirical success.

The SRPO framework utilizes a two-stage reinforcement learning process tailored for cross-domain reasoning. The first stage exclusively trains on challenging mathematical data to enhance deep, reflective reasoning abilities, incentivizing behaviors such as step-by-step problem decomposition and self-correction. In the subsequent phase, code data is integrated, reinforcing not only programming skills but also advancing procedural thinking and tool-use capabilities. This staged approach resolves cross-domain optimization conflicts inherent in mixed training and ensures the model´s reasoning skills are robust and transferable. Notably, SRPO achieves state-of-the-art results on the AIME24 (50) and LiveCodeBench (41.6) benchmarks with only one-tenth the RL steps required by comparative methods.

Key innovations in SRPO include the history resampling technique, which filters out redundant or overly simple samples that provide little learning signal, and instead prioritizes examples that generate informative gradient updates. This substantially improves training efficiency and maintains effective learning curves, even in later training stages where traditional methods plateau. Rigorous data preprocessing further enhances model reliability by verifying solution correctness and categorizing problem difficulty. Experiments reveal that SRPO-trained models not only produce longer, more detailed responses but also spontaneously display advanced reflective behaviors like code-based self-verification, mirroring human-like problem-solving adaptability. This research marks a robust advancement in efficient, cross-domain RL for large language models and has set a new standard for open and reproducible Artificial Intelligence model development.

77

Impact Score

Latest News

Broadcom moves to narrow artificial intelligence chip lead with Nvidia

February 17, 2026

Broadcom is working to close a competitive gap in artificial intelligence chips with Nvidia, as investors push deeper into semiconductor names tied to data center demand.

Alan Turing Institute charts United Kingdom artificial intelligence governance model

February 16, 2026

The Alan Turing Institute has released a United Kingdom country profile detailing a principle-based, regulator-led model for artificial intelligence oversight, anchored in voluntary standards and international safety initiatives. The framework signals to education technology and digital learning providers that artificial intelligence governance is becoming a key factor in deployment, procurement, and compliance decisions.

US shifts on China tech bans as Artificial Intelligence reshapes security, infrastructure and labor

February 16, 2026

Geopolitics, energy, and datacenter buildout collide with rapid advances in Artificial Intelligence, driving new government controls, corporate investments, and rising infrastructure stress across telecoms, utilities, and labor.

Legal artificial intelligence in the large language model era reshapes data, modeling and evaluation

February 16, 2026

Large language models are transforming legal artificial intelligence from narrow task solvers into core components of data pipelines, modeling frameworks and evaluation workflows across key legal tasks.

MiniMax 2.5 local deployment and performance guide

February 16, 2026

MiniMax 2.5 is a large open language model tuned for coding, tool use, search and office workflows, with quantized variants designed to run on high memory desktops and workstations using llama.cpp and OpenAI compatible APIs.