Research·APAC

Kuaishou Introduces SRPO Advancing Reinforcement Learning

Global AI Watch · Editorial Team··5 min read·Synced Review
Kuaishou Introduces SRPO Advancing Reinforcement Learning

Kuaishou's Kwaipilot team has unveiled a new reinforcement learning framework called Two-Staged history-Resampling Policy Optimization (SRPO). This approach addresses challenges common in traditional reinforcement learning methods, particularly for large language models (LLMs), such as performance bottlenecks and inefficient sample utilization. SRPO aims to improve scaling and reasoning capabilities by implementing a two-stage training process that targets both mathematical and code domains. The method has already shown promising results on notable benchmarks like AIME24 and LiveCodeBench, achieving superior performance compared to existing models with drastically fewer training steps.

The introduction of SRPO signals an important shift in the reinforcement learning landscape, as it not only demonstrates the potential for improved training efficiency but also highlights the challenges faced with existing algorithms like GRPO. This innovation is particularly significant for enhancing China's national AI strategy, paving the way for greater self-sufficiency in AI model training. By open-sourcing the SRPO-Qwen-32B model and sharing detailed technical reports, Kuaishou is contributing to the global AI research community while potentially reducing dependency on foreign-developed AI frameworks and methodologies.

Free Daily Briefing

Top AI intelligence stories delivered each morning.

Subscribe Free →
SourceSynced ReviewRead original

Related Articles

Explore Trackers