카테고리 없음
Step-Controlled DPO: Leveraging Stepwise Error forEnhanced Mathematical Reasoning 논문리뷰
jinuklee
2024. 8. 20. 15:14
https://arxiv.org/pdf/2407.00782
We introduce Step-Controlled DPO (SCDPO), which we empirically show improves the performance of DPO in enhancing LLMs’ mathematical reasoning abilities. We also conduct qualitative analysis of credit assignment of SCDPO.
• We conduct experiments on chain-of-thought and code-integrated solutions, showing that SCDPO can effectively improve mathematical problem-solving performance of three different SFT models.
We then apply SCDPO to an InternLM2-20B model, resulting in a 20B model that achieves high scores of 88.5% on GSM8K and 58.1% on MATH, rivaling all other open-source LLMs, showing the great potential of our method
https://github.com/mathllm/Step-Controlled_DPO