https://arxiv.org/abs/2312.08935
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
In this paper, we present an innovative process-oriented math process reward model called \textbf{Math-Shepherd}, which assigns a reward score to each step of math problem solutions. The training of Math-Shepherd is achieved using automatically constructed
arxiv.org
수학 문제 해결에서 각 step에 reward를 주게 train된 PRM
https://huggingface.co/datasets/peiyi9979/Math-Shepherd?row=89