SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning

1University of California, Los Angeles
2School of Artificial Intelligence, Chinese Academy of Sciences
3Microsoft 4Tsinghua University
geometric reasoning

Figure1: 32B model performance across mainstream reasoning benchmarks (GSM8k, MATH, Minerva-Math, Olympiad-Bench, Gaokao-2023, AMC23, AIME 24 & 25) and different mathematical domains.

Introduction

Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for training large language models (LLMs) on complex reasoning tasks, such as mathematical problem solving. A prerequisite for the scalability of RLVR is a high-quality problem set with precise and verifiable answers. However, the scarcity of well-crafted human-labeled math problems and limited-verification answers in existing distillation-oriented synthetic datasets limit their effectiveness in RL. Additionally, most problem synthesis strategies indiscriminately expand the problem set without considering the model's capabilities, leading to low efficiency in generating useful questions. To mitigate this issue, we introduce a Self-aware Weakness-driven problem Synthesis framework (SwS) that systematically identifies model deficiencies and leverages them for problem augmentation. Specifically, we define weaknesses as questions that the model consistently fails to learn through its iterative sampling during RL training. We then extract the core concepts from these failure cases and synthesize new problems to strengthen the model's weak areas in subsequent augmented training, enabling it to focus on and gradually overcome its weaknesses. Without relying on external knowledge distillation, our framework enables robust generalization byempowering the model to self-identify and address its weaknesses in RL, yielding average performance gains of 10.0% and 7.7% on 7B and 32B models across eight mainstream reasoning benchmarks.

Methodology

SwS is a targeted problem synthesis framework designed to strengthen a model's weakest reasoning capabilities via RL training. The framework begins with preliminary RL training on an initial problem set, identifying problems where the model fails to learn efficiently as weaknesses in reasoning. SwS then groups these recorded questions by category, extracts their underlying concepts, and recombines these concepts to synthesize new problems with appropriate difficulty levels. Finally, the model is further trained on the augmented problem set including the weakness-driven synthetic questions, aiming at addressing its weaknesses without relying on external knowledge distillation.

Weakness Identification: Below is the detailed strategy of how we define the problems with low learning efficiency as weaknesses. Specifically, a problem is classified as a weakness if it meets two criteria: (1) the model never achieves a response accuracy of 50% at any training epoch, and (2) the accuracy trend declines over time, as indicated by a negative slope.

Experiment Results

Main Results


Weakness Mitigation

The effectiveness of weakness mitigation: We find that training the model on the augmented set enables it to solve a higher proportion of initially consistent failure cases across most domains compared to training on the original questions alone. The greatest gains are observed in its weakest areas: Intermediate Algebra (20%), Geometry (5%), and Precalculus (5%).

Extensions for SwS

1. Weak-to-Strong Generalization

When training the top-performing reasoning model, no stronger model exists to produce reference answers for problems identified as its weaknesses. To explore the potential of applying our SwS pipeline to enhancing state-of-the-art models, we extend it to the Weak-to-Strong Generalization setting by using a generally weaker teacher that may outperform the stronger model in specific domains to label reference answers for the synthetic problems.

One concern may be the presence of incorrect answers from the weaker teacher. However, our self-consistency mechanism, which combines answer generation and difficulty filtering prior to the second stage of RL training, effectively eliminates most erroneous answers. The accuracy of teacher-labeled answers on the MATH-500 test set is shown in the figure below.

2. Self-evolving Targeted Problem Synthesis

We also explore the potential of utilizing the Self-evolving paradigm to address model weaknesses by executing the full SwS pipeline using the policy itself. This strategy leverages self-consistency to guide itself to generate effective trajectories toward accurate answers, while also integrating general instruction-following capabilities from question generation and quality filtering to enhance reasoning.

3. Weakness-driven Selection

We propose an alternative extension that augments the training set using identified weaknesses from a larger mathematical reasoning dataset. The detailed Weakness-driven Selection extension of SwS is presented in the following algorithm.

The model trained with weakness-driven augmentation outperforms random augmentation in terms of accuracy on both the whole evaluated benchmarks and the competition-level subset. In (c), the model quickly fits the randomly selected problems in training, while the problems selected based on weaknesses remain more challenging, provide richer learning signals and promote continued development of reasoning skills.

Visualization

The following visualization presents an illustration of a geometry failure case from the MATH-12k training set, accompanied by extracted concepts and our weakness-driven synthetic questions of varying difficulty levels, all closely aligned with the original question.

Cite Us

@misc{liang2025swsselfawareweaknessdrivenproblem,
      title={SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning}, 
      author={Xiao Liang and Zhong-Zhi Li and Yeyun Gong and Yang Wang and Hengyuan Zhang and Yelong Shen and Ying Nian Wu and Weizhu Chen},
      year={2025},
      eprint={2506.08989},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2506.08989}, 
}