Research·APAC

Kuaishou Introduces SRPO Advancing Reinforcement Learning

Global AI Watch · Editorial Team·7 March 2026·5 min read·Synced Review

Kuaishou's Kwaipilot team has unveiled a new reinforcement learning framework called Two-Staged history-Resampling Policy Optimization (SRPO). This approach addresses challenges common in traditional reinforcement learning methods, particularly for large language models (LLMs), such as performance bottlenecks and inefficient sample utilization. SRPO aims to improve scaling and reasoning capabilities by implementing a two-stage training process that targets both mathematical and code domains. The method has already shown promising results on notable benchmarks like AIME24 and LiveCodeBench, achieving superior performance compared to existing models with drastically fewer training steps.

The introduction of SRPO signals an important shift in the reinforcement learning landscape, as it not only demonstrates the potential for improved training efficiency but also highlights the challenges faced with existing algorithms like GRPO. This innovation is particularly significant for enhancing China's national AI strategy, paving the way for greater self-sufficiency in AI model training. By open-sourcing the SRPO-Qwen-32B model and sharing detailed technical reports, Kuaishou is contributing to the global AI research community while potentially reducing dependency on foreign-developed AI frameworks and methodologies.

Free Daily Briefing

Top AI intelligence stories delivered each morning.

Subscribe Free →

SourceSynced ReviewRead original

Explore Trackers

Global AI Activity MapLive regional intelligence

Related Articles

ARC Prize Analysis Reveals AI Models' Systematic Errors

CERN Discovers Anomaly in Particle Decay at LHC

KPR Institute Develops Hybrid Model for Health Monitoring

Arabic AI Models Misidentify Cultural Items, Risking Credibility

Top U.S. Scientist Moves to Singapore Amid Policy Changes

Explore Trackers