Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR

Xiao Liang¹, Zhong-Zhi Li³, Yeyun Gong², Yelong Shen², Ying Nian Wu¹,
Zhijiang Guo^4,5, Weizhu Chen²,

¹University of California, Los Angeles ²Microsoft
³School of Artificial Intelligence, Chinese Academy of Sciences
⁴Hong Kong University of Science and Technology ⁵Hong Kong University of Science and Technology (Guangzhou)

Page ArXiv Code Dataset Twitter (X) Rednote

Figure 1: We train Qwen2.5-32B-Instruct on the DAPO-17k dataset using our SvS strategy and standard RLVR. SvS achieves significant improvements in in Pass@32 and Pass@1 (average 32 times) scores on competition-level AIME benchmarks.

Introduction

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a key paradigm for post-training Large Language Models (LLMs), particularly for complex reasoning tasks. However, vanilla RLVR training has been shown to improve Pass@1 performance at the expense of policy entropy, leading to reduced generation diversity and limiting the Pass@k performance, which typically represents the upper bound of LLM reasoning capability. In this paper, we systematically analyze the policy's generation diversity from the perspective of training problems and find that augmenting and updating training problems helps mitigate entropy collapse during training. Based on these observations, we propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training, which uses the policy’s correct solutions to synthesize variational problems while ensuring their reference answers remain identical to the originals. This self-improving strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR, sustaining prolonged improvements and achieving absolute gains of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and AIME25 benchmarks. Experiments on 12 reasoning benchmarks across varying model sizes from 3B to 32B consistently demonstrate the generalizability and robustness of SvS.

Methodology

To achieve ideal data augmentation for RLVR as discussed in Section rethinking, we introduce the SvS framework, which leverages the policy itself to augment training problems through online self-play, enabling sustainable self-improvement.

The central idea for SvS is to synthesize variational problems from the policy's correct solutions to challenging (under-performing) training problems, and then prompts the policy to solve these synthetic problems. Ideally, these synthesized problems preserve the semantics and ground-truth answers of the originals, while their representations—such as structures, contexts, or descriptions—may differ substantially, thus encouraging the policy to explore diverse reasoning strategies. Specifically, as illustrated in Figure. 2, each online augmented training batch at step t consists of three components:

Original Problem Solving: The policy generates solutions to training problems, with under-performing ones retained as challenging problems.
Variational Problem Synthesis: The correct solutions to the challenging problems are used as context to synthesize variational problems that maintain the reference answers of the original ones.
Synthetic Problem Solving: The policy is prompted to solve the self-synthesized variational problems.

Figure 2: The data workflow of our SVS in a training iteration, comprising original problem solving, variational problem synthesis, synthetic problem solving, and policy update data filtering.

We present an example of variational problem synthesis and its reward-shaping. If a synthetic problem is either trivially solvable (too simple) or no solution aligns with the original answer (unsolvable) is sampled, it receives a negative reward.

Figure 3: Illustrations of a challenging problem, its correct solution from policy, the synthetic variational problems from the solution, and the reward-shaping strategy for the synthetic problems.

Experiments

SvS significantly improves both Pass@1 and Pass@k. Models trained on the DAPO dataset with the ours strategy achieve absolute gains of 18.3 and 22.8 points on Pass@32 for AIME24 and AIME25, respectively, compared to the standard RLVR baseline.

Table 1: Comparison of performance on challenging benchmarks using the Pass@1 (average 32 times) and Pass@32 metrics.

The SvS strategy consistently outperforms standard RLVR across all model sizes and evaluation benchmarks.

Table 2: Performance comparison between the vanilla RLVR and our SVS strategy on mainstream reasoning benchmarks

Analysis

1. SvS stably maintains policy entropy during RL training.

In RLVR training, policy entropy reflects the model’s capacity for sustained exploration. Notably, the standard RLVR baseline shows a continuous decline in entropy, whereas SvS maintains entropy within a relatively stable range, supporting sustained exploration and avoiding training collapse. This stability explains the continuous improvements in both Pass@1 and Pass@32 achieved by SvS, as shown in Figure 1, whereas RLVR saturates after a certain number of training steps.

Figure 4: Policy entropy trajectories during training for standard RLVR and our SVS strategy across various models and datasets.

2. SvS pushes the reasoning boundary on Pass@k.

We further evaluate SvS's effectiveness and limits in incentivizing reasoning by scaling Pass@k from 1 to 1024, testing whether the SvS-trained model can solve problems beyond the capability of the base model. As presented, SvS significantly outperforms the RLVR and initial model baseline, even when k is scaled up to 1024. For Pass@k scaling on MATH-500, RLVR outperforms the initial model at small k values but is surpassed at larger k. In contrast, SvS consistently outperforms both RLVR and the initial model as k increases, demonstrating its strong generalization and robust reasoning diversity.

Figure 5: Evaluating the scaled-up Pass@k performance on the AIME-24, AIME-25, Beyond-AIME, and MATH-500 benchmarks.

Figure 6: Comparison of instance-level accuracy between standard RLVR and SVS trained model. For each problem, the accuracy is averaged over 1024 generations on both AIME24 and AIME25.

3. SvS Generalizes Beyond Reasoning Tasks.

Notably, models trained with standard problem-solving RLVR exhibit a decline in performance on broad general benchmarks. In contrast, the SVS trained model not only avoids this degradation but also surpasses the initial instruction-following model on several general tasks. These results indicate that the additional problem synthesis task helps prevent overfitting to mathematical reasoning tasks.

Table 3: Evaluation on general question-answering and code benchmarks, where SvS shows the highest overall performance.

Cite Us

@misc{liang2025pass1selfplayvariationalproblem,
      title={Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR}, 
      author={Xiao Liang and Zhongzhi Li and Yeyun Gong and Yelong Shen and Ying Nian Wu and Zhijiang Guo and Weizhu Chen},
      year={2025},
      eprint={2508.14029},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.14029}, 
}