Verifier-Guided Reinforcement Learning for GSM8K Math Reasoning

Michelle Lin; Jie Hao; Mingrui Liu

Authors

Michelle Lin Department of Computer Science, George Mason University, Fairfax, VA
Jie Hao Department of Computer Science, George Mason University, Fairfax, VA
Mingrui Liu Department of Computer Science, George Mason University, Fairfax, VA

Abstract

This project explores Reinforcement Learning from Verifier Rewards (RLVR) as a technique for improving multistep math reasoning in language models. However, despite advances in finetuning, existing instruction tuned LLMs still produce arithmetic errors in multistep reasoning tasks, lacking an inherent mechanism to verify and correct intermediate calculations. We apply RLVR to grade school word problems from the GSM8K dataset, using the flanT5base model for its instruction following capabilities. A Sympy based verifier checks each numeric prediction and issues a reward of 1.0 for exact matches and 0.0 otherwise. In a supervised finetuning baseline, our model achieved 3.07% exact match accuracy on a held out 10% test split. We then finetune the model with RLVR using the PPO algorithm in Hugging Face’s TRL framework, which raises exact match accuracy to 18%. These preliminary results show that verifier guided reinforcement learning can yield significant gains in LLM based math problem solving. Future work will investigate richer reward structures, model scaling, and additional verifier designs to further enhance reasoning performance.

Verifier-Guided Reinforcement Learning for GSM8K Math Reasoning

Authors

Abstract

Published

Issue

Section

License

assip