Step-Controlled DPO: Leveraging Stepwise Error forEnhanced Mathematical Reasoning 논문리뷰

카테고리 없음

Step-Controlled DPO: Leveraging Stepwise Error forEnhanced Mathematical Reasoning 논문리뷰

jinuklee 2024. 8. 20. 15:14

We introduce Step-Controlled DPO (SCDPO), which we empirically show improves the performance of DPO in enhancing LLMs’ mathematical reasoning abilities. We also conduct qualitative analysis of credit assignment of SCDPO.

• We conduct experiments on chain-of-thought and code-integrated solutions, showing that SCDPO can effectively improve mathematical problem-solving performance of three different SFT models.

We then apply SCDPO to an InternLM2-20B model, resulting in a 20B model that achieves high scores of 88.5% on GSM8K and 58.1% on MATH, rivaling all other open-source LLMs, showing the great potential of our method

https://github.com/mathllm/Step-Controlled_DPO