Advancing Process Verification for Large Language Models via Tree-Based Preference Learning

Monte Carlo Tree Search

they often require more than 10 times the computational resources of greedy decoding due to wasteful search strategies, making them difficult to be deployed in practical applications.

Results show that our methods offer competitive performance but significantly less computation costs (saving around 5×) compared to other baselines.

related work

Another line of research focuses on inferencetime improvement. Except for the popular selfconsistency (Wang et al., 2022), most of these studies treat this task as a tree search problem and investigate various searching algorithms. Yao et al were the first to introduce Tree-of-Thought (ToT), incorporating Depth-First Search (DFS) and Breath-First Search (BFS) to address reasoning problems. (Deductive beam search: Decoding deducible rationale for chain-of-thought reasoning, Self-evaluation guided beam search for reasoning. ) applied step-wise Beam Search to math problems, which operates similarly to BFS under certain parameter conditions. To guide the search process, these studies above either directly prompt the LLMs to evaluate the quality of each step (ToT, self-evaluation guided beam search for reasoning), or train a verifier on corresponding datasets to achieve better performance (Khalifa et al., 2023; Zhu et al., 2024). Later research delved into other sophisticated search algorithms, such as Monte Carlo Tree Search (MCTS, Tian et al. 2024; Zhang et al. 2024; Wang et al. 2024b), A∗ (Ma et al., 2023), and Levin Tree Search (Kang et al., 2024). Nonetheless, these approaches necessitate more robust verifiers to steer the search procedure. Concretely, Tian et al. (2024) utilize a blend of the value function, Processsupervised Reward Model (PRM), and Outcomesupervised Reward Model (ORM). Ma et al. (2023) and Kang et al. (2024) train their PRM models on PRM800K (Lightman et al., 2023), which offers manual annotations for 800k reasoning steps of problems from MATH (Hendrycks et al., 2021). This study also follows the same research line, yet it concentrates on developing an efficient algorithm to decrease computation costs while maintaining performance. Besides, we employ a naive but more practical value network as the verifier, which is trained solely with the final answer labels as distant supervision