Verifier-Guided Reinforcement Learning for GSM8K Math Reasoning
Abstract
This project explores Reinforcement Learning from Verifier Rewards (RLVR) as a technique for improving multistep math reasoning in language models. However, despite advances in finetuning, existing instruction tuned LLMs still produce arithmetic errors in multistep reasoning tasks, lacking an inherent mechanism to verify and correct intermediate calculations. We apply RLVR to grade school word problems from the GSM8K dataset, using the flanT5base model for its instruction following capabilities. A Sympy based verifier checks each numeric prediction and issues a reward of 1.0 for exact matches and 0.0 otherwise. In a supervised finetuning baseline, our model achieved 3.07% exact match accuracy on a held out 10% test split. We then finetune the model with RLVR using the PPO algorithm in Hugging Face’s TRL framework, which raises exact match accuracy to 18%. These preliminary results show that verifier guided reinforcement learning can yield significant gains in LLM based math problem solving. Future work will investigate richer reward structures, model scaling, and additional verifier designs to further enhance reasoning performance.
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.