inference-time, RLHF/Process reward model

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations 논문리뷰

jinuklee 2024. 8. 23. 22:35

https://arxiv.org/abs/2312.08935

 

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

In this paper, we present an innovative process-oriented math process reward model called \textbf{Math-Shepherd}, which assigns a reward score to each step of math problem solutions. The training of Math-Shepherd is achieved using automatically constructed

arxiv.org

수학 문제 해결에서 각 step에 reward를 주게 train된 PRM

https://huggingface.co/datasets/peiyi9979/Math-Shepherd?row=89