SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning

Xiao Liang¹, Zhong-Zhi Li², Yeyun Gong³, Yang Wang³, Hengyuan Zhang⁴
Yelong Shen³, Ying Nian Wu¹, Weizhu Chen³,

¹University of California, Los Angeles
²School of Artificial Intelligence, Chinese Academy of Sciences
³Microsoft ⁴Tsinghua University

Paper arXiv code dataset

Figure1: 32B model performance across mainstream reasoning benchmarks (GSM8k, MATH, Minerva-Math, Olympiad-Bench, Gaokao-2023, AMC23, AIME 24 & 25) and different mathematical domains.

Introduction

Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for training large language models (LLMs) on complex reasoning tasks, such as mathematical problem solving. A prerequisite for the scalability of RLVR is a high-quality problem set with precise and verifiable answers. However, the scarcity of well-crafted human-labeled math problems and limited-verification answers in existing distillation-oriented synthetic datasets limit their effectiveness in RL. Additionally, most problem synthesis strategies indiscriminately expand the problem set without considering the model's capabilities, leading to low efficiency in generating useful questions. To mitigate this issue, we introduce a Self-aware Weakness-driven problem Synthesis framework (SwS) that systematically identifies model deficiencies and leverages them for problem augmentation. Specifically, we define weaknesses as questions that the model consistently fails to learn through its iterative sampling during RL training. We then extract the core concepts from these failure cases and synthesize new problems to strengthen the model's weak areas in subsequent augmented training, enabling it to focus on and gradually overcome its weaknesses. Without relying on external knowledge distillation, our framework enables robust generalization byempowering the model to self-identify and address its weaknesses in RL, yielding average performance gains of 10.0% and 7.7% on 7B and 32B models across eight mainstream reasoning benchmarks.

Methodology

SwS is a targeted problem synthesis framework designed to strengthen a model's weakest reasoning capabilities via RL training. The framework begins with preliminary RL training on an initial problem set, identifying problems where the model fails to learn efficiently as weaknesses in reasoning. SwS then groups these recorded questions by category, extracts their underlying concepts, and recombines these concepts to synthesize new problems with appropriate difficulty levels. Finally, the model is further trained on the augmented problem set including the weakness-driven synthetic questions, aiming at addressing its weaknesses without relying on external knowledge distillation.

Figure 2: Overview of the SwS framework that targets at mitigating the model's reasoning weaknesses within the RLVR paradigm.

Weakness Identification: Below is the detailed strategy of how we define the problems with low learning efficiency as weaknesses. Specifically, a problem is classified as a weakness if it meets two criteria: (1) the model never achieves a response accuracy of 50% at any training epoch, and (2) the accuracy trend declines over time, as indicated by a negative slope.

Figure 3: Illustration of the self-aware weakness identification during a preliminary RL training

Experiment Results

Main Results

Table 1: We report the detailed performance of our SwS implementation across various base models and multiple benchmarks. AIME is evaluated using two metrics: Avg@1 (single-run performance) and Avg@32 (average over 32 runs). red and blue, respectively.

Weakness Mitigation

The effectiveness of weakness mitigation: We find that training the model on the augmented set enables it to solve a higher proportion of initially consistent failure cases across most domains compared to training on the original questions alone. The greatest gains are observed in its weakest areas: Intermediate Algebra (20%), Geometry (5%), and Precalculus (5%).

Figure 4: The ratios of consistently failed problems from different categories in the MATH-12k training set under different training configurations. (Base model: Qwen2.5-7B).

Extensions for SwS

1. Weak-to-Strong Generalization

When training the top-performing reasoning model, no stronger model exists to produce reference answers for problems identified as its weaknesses. To explore the potential of applying our SwS pipeline to enhancing state-of-the-art models, we extend it to the Weak-to-Strong Generalization setting by using a generally weaker teacher that may outperform the stronger model in specific domains to label reference answers for the synthetic problems.

Table 2: Performance on two benchmarks and category-specific results on MATH-500 of the weak teacher model and the strong student model.

One concern may be the presence of incorrect answers from the weaker teacher. However, our self-consistency mechanism, which combines answer generation and difficulty filtering prior to the second stage of RL training, effectively eliminates most erroneous answers. The accuracy of teacher-labeled answers on the MATH-500 test set is shown in the figure below.

Table 3: The performance of the weak teacher model for answer generation on the MATH-500 test set under different strategies and their corresponding revisions. "Stu-Con" refers to filtering out problems where the student model's accuracy falls below the threshold of 25%.

2. Self-evolving Targeted Problem Synthesis

We also explore the potential of utilizing the Self-evolving paradigm to address model weaknesses by executing the full SwS pipeline using the policy itself. This strategy leverages self-consistency to guide itself to generate effective trajectories toward accurate answers, while also integrating general instruction-following capabilities from question generation and quality filtering to enhance reasoning.

Table 4: Experimental results of extending the SwS framework to the Self-evolving paradigm on the Qwen2.5-14B-Instruct model.

3. Weakness-driven Selection

We propose an alternative extension that augments the training set using identified weaknesses from a larger mathematical reasoning dataset. The detailed Weakness-driven Selection extension of SwS is presented in the following algorithm.

The model trained with weakness-driven augmentation outperforms random augmentation in terms of accuracy on both the whole evaluated benchmarks and the competition-level subset. In (c), the model quickly fits the randomly selected problems in training, while the problems selected based on weaknesses remain more challenging, provide richer learning signals and promote continued development of reasoning skills.

Figure 5: Comparison of accuracy improvements using (a) Pass@1 on full benchmarks evaluated in Table 1 and (b) Avg@32 on the competition-level benchmarks. (c) illustrates the proportion of problems within a batch that achieved 100% correctness during training.

Visualization

The following visualization presents an illustration of a geometry failure case from the MATH-12k training set, accompanied by extracted concepts and our weakness-driven synthetic questions of varying difficulty levels, all closely aligned with the original question.

Figure 6: Illustration of a geometry problem from the MATH-12k failed set, with extracted concepts and conceptually linked synthetic problems across different difficulty levels.

Cite Us

@misc{liang2025swsselfawareweaknessdrivenproblem,
      title={SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning}, 
      author={Xiao Liang and Zhong-Zhi Li and Yeyun Gong and Yang Wang and Hengyuan Zhang and Yelong Shen and Ying Nian Wu and Weizhu Chen},
      year={2025},
      eprint={2506.08989},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2506.08989}, 
}