To achieve ideal data augmentation for RLVR as discussed in Section rethinking, we introduce the
SvS framework, which leverages the policy itself to augment training problems through
online self-play, enabling sustainable self-improvement.
The central idea for SvS is to synthesize variational problems from the policy's correct solutions to
challenging (under-performing) training problems, and then prompts the policy to solve these synthetic problems.
Ideally, these synthesized problems preserve the semantics and ground-truth answers of the originals, while their
representations—such as structures, contexts, or descriptions—may differ substantially, thus encouraging the policy
to explore diverse reasoning strategies.
Specifically, as illustrated in Figure. 2, each online augmented training batch at step t consists of three components:
-
Original Problem Solving: The policy generates solutions to training problems, with under-performing ones retained as challenging problems.
-
Variational Problem Synthesis: The correct solutions to the challenging problems are used as context to synthesize variational problems that maintain the reference answers of the original ones.
-
Synthetic Problem Solving: The policy is prompted to solve the self-synthesized variational problems.
We present an example of variational problem synthesis and its reward-shaping. If a synthetic problem is either trivially solvable (too simple) or no solution aligns with the original answer (unsolvable) is sampled, it receives a negative reward.