Verifier-Guided Reinforcement Learning for GSM8K Math Reasoning

Authors

  • Michelle Lin Department of Computer Science, George Mason University, Fairfax, VA
  • Jie Hao Department of Computer Science, George Mason University, Fairfax, VA
  • Mingrui Liu Department of Computer Science, George Mason University, Fairfax, VA

Abstract

This project explores Reinforcement Learning from Verifier Rewards (RLVR) as a technique for improving multistep math reasoning in language models. However, despite advances in finetuning, existing instruction tuned LLMs still produce arithmetic errors in multistep reasoning tasks, lacking an inherent mechanism to verify and correct intermediate calculations. We apply RLVR to grade school word problems from the GSM8K dataset, using the flanT5base model for its instruction following capabilities. A Sympy based verifier checks each numeric prediction and issues a reward of 1.0 for exact matches and 0.0 otherwise. In a supervised finetuning baseline, our model achieved 3.07% exact match accuracy on a held out 10% test split. We then finetune the model with RLVR using the PPO algorithm in Hugging Face’s TRL framework, which raises exact match accuracy to 18%. These preliminary results show that verifier guided reinforcement learning can yield significant gains in LLM based math problem solving. Future work will investigate richer reward structures, model scaling, and additional verifier designs to further enhance reasoning performance.

Published

2025-09-25

Issue

Section

College of Engineering and Computing: Department of Computer Science