Step-Controlled DPO: Leveraging Stepwise Error forEnhanced Mathematical Reasoning 논문리뷰
https://arxiv.org/pdf/2407.00782We introduce Step-Controlled DPO (SCDPO), which we empirically show improves the performance of DPO in enhancing LLMs’ mathematical reasoning abilities. We also conduct qualitative analysis of credit assignment of SCDPO. • We conduct experiments on chain-of-thought and code-integrated solutions, showing that SCDPO can effectively improve mathematical problem-solvi..