https://arxiv.org/pdf/2408.03314
Related Work
Language model reasoning.
Language model performance on challenging mathematical reasoning tasks has rapidly improved in recent years [20, 22, 25, 32, 39]. These improvements can be attributed to four primary factors: 1) running continued pretraining on large corpora of math focused data [20, 22, 32, 39]; 2) improving the LLM proposal distribution by either applying targeted optimization on specific reasoning tasks by finetuning with RL [32, 35, 49, 50] enabling models to critique and revise their answers iteratively [4, 8, 23, 30]; 3) enabling LLMs to benefit from additional test-time computation by finetuning verifiers [6, 7, 10, 22, 40, 42, 45, 48]. Our work builds on these second and third lines of research by analyzing the extent to which test-time compute scaling can be improved by 1) refining an LLM’s proposal distribution and 2) conducting search against verifiers.
Analyzing test-time compute scaling.
The tradeoff between train-time and test-time compute using Monte-Carlo tree search applied to the board game Hex was previously studied by Jones [16]. We instead focus our analysis on full-scale language model math reasoning problems. A survey work by Villalobos and Atkinson [44] analyzed the tradeoff between training and inference across a number of domains. However, much of their language-model analysis focused on test-time compute scaling in settings where the ground-truth answer is known. In contrast, our analysis focuses on the setting when the ground-truth answer is not known. Additionally, a number of works in the RL literature have proposed methods, such as MCTS [19], which aim to navigate the tradeoff between test-time and training-time compute so as to enable a form of iterative self-play. The findings in our work can be used to help develop similar algorithms that can operate on open-ended natural language. Augmenting LLMs with test-time compute. Beyond verifiers and revisions, a number of additional works have proposed alternative methods for enabling LMs to use test-time compute for reasoning. Namely, Wang et al. [46] conducts a hierarchical hypothesis search to enable inductive reasoning capabilities. A number of related works have proposed augmenting language models with tools at test-time, which can greatly improve their performance on downstream tasks [11, 26, 27]. Finally, several works have proposed methods for learning thought tokens in an unsupervised manner [12, 51], enabling models to more effectively utilize the additional test-time compute that comes with sampling longer sequences. While we focus our analysis on two primary mechanisms by which test-time compute can be scaled in this work (e.g. verifiers and revisions), many of the methods by which we conduct our analysis (e.g. compute optimal scaling according to question difficulty) could, in principle, also be applied to any of these other methods of scaling test-time compute, and we believe that this is an interesting direction for future research
https://arxiv.org/pdf/2407.00320
Despite their impressive capabilities, LLMs still face challenges when tackling problems with increasing reasoning steps due to the nature of autoregressive decoding. This can be analogous to the “System 1” mode of thought in psychology (Daniel, 2017), which is characterized by fast, intuitive, but error-prone thinking. Much of recent work has focused on enhancing the “System 1” capability of LLMs by prompt-engineering, such as hierarchical prompting (Suzgun and Kalai, 2024; Zeng et al., 2023) and automatic prompt refine (Madaan et al., 2024; Yang et al., 2024; Zhou et al., 2024). On the other hand, growing research attention is being paid to promote the “System 2” mode of thought (Daniel, 2017) for LLMs, which is characterized by deliberative thinking steps with back-and-forth refinements. These are the key features for solving 1The MCTS here does not utilize the guidance from PRM or ORM for fair comparison and efficiency. arXiv:2407.00320v1 [cs.CL] 29 Jun 2024 complex math reasoning tasks. Particularly, prior efforts have studied enhancing LLMs both at inference time and through self-improvement using tree search algorithms (e.g., DFS and BFS, Yao et al. 2024) and Monte Carlo Tree Search (MCTS, Feng et al. 2023; Tian et al. 2024; Zhang et al. 2024; Wang et al. 2024b). However, these approaches often necessitate the creation of expert-designed utility functions (Tian et al., 2024; Ma et al., 2023; Kang et al., 2024), making them difficult to be adapted to new scenarios. Moreover, they are computationally intensive, especially when tackling problems that require numerous logical steps (Xie et al., 2024). This is because these methods ineffectively manage the expansion budget (the number of nodes to expand) throughout the search process. As a typical example, BFS adopts a constant budget size throughout the search process, overlooking the fact that some tree nodes do not require much expansion. Some MCTS approaches (Tian et al., 2024) take adaptive budget based on the importance of each node, but they still require a large number of simulations or rollouts for accurate statistics to make decisions, and they overlook other important information, such as the depth (progress) of each node. As the result, there is a pressing need to develop more efficient and adaptable methods for enhancing LLMs’ “System 2” reasoning capabilities to effectively handle complex reasoning tasks. In this study, we introduce a guided tree search algorithm with dynamic node selection and nodelevel exploration budget calculation, aiming to maintain the performance at a moderate cost. Concretely, we employ the value score as guidance to select the most promising node for the next action and expand it within a dynamically computed budget size, navigating exploration-exploitation balance for guided tree search. We continue iterating operations of selection and expansion until the resulting trajectory either meets the expected quality score or surpasses the maximum number of iterations. Notably, the computational budget for each node is inversely correlated to its value score. This is inspired by the observation that nodes with higher value scores are more likely to yield the correct solution upon expansion, hence we allocate fewer computational resources to them to prevent unnecessary computation and vice versa. This not only promotes efficient exploitation, facilitating a faster convergence to the final answer, but also guarantees sufficient exploration to cover enough state space for maintaining performance. We conduct experiments on popular GSM8K (Cobbe et al., 2021) and TabMWP (Lu et al., 2022). Results show that our methods offer competitive performance but significantly less computation costs (saving around 5×) compared to other baselines. Detailed analyses confirm the usefulness of each component and provide more practical options for various settings. Additionally, we also identify the limitations of this research line and suggest possible ways to tackle them
Related Work
Thanks to the robust capabilities of LLMs, significant advancements have been made in mathematical reasoning tasks, surpassing traditional approaches that rely on semantic parsing (Matsuzaki et al., 2017; Hopkins et al., 2017) or Abstract Syntax Tree (AST) decoding (Li et al., 2019; Qin et al., 2021; Wu et al., 2021). Some studies improved the reasoning capabilities of LLMs through further training. These efforts involved either manually annotating or automatically generating feasible and challenging problems to fine-tune the LLMs (Luo et al., 2023; Yu et al., 2023; Liu and Yao, 2024; Toshniwal et al., 2024), as well as devising sophisticated techniques, such as reinforcement learning, for efficient training (Luo et al., 2023; Wang et al., 2023; Lightman et al., 2023; Chen et al., 2024). Another line of research focuses on inferencetime improvement. Except for the popular selfconsistency (Wang et al., 2022), most of these studies treat this task as a tree search problem and investigate various searching algorithms. Yao et al. (2024) were the first to introduce Tree-of-Thought (ToT), incorporating Depth-First Search (DFS) and Breath-First Search (BFS) to address reasoning problems. Khalifa et al. (2023); Zhu et al. (2024); Xie et al. (2024) applied step-wise Beam Search to math problems, which operates similarly to BFS under certain parameter conditions. To guide the search process, these studies above either directly prompt the LLMs to evaluate the quality of each step (Yao et al., 2024; Xie et al., 2024), or train a verifier on corresponding datasets to achieve better performance (Khalifa et al., 2023; Zhu et al., 2024). Later research delved into other sophisticated search algorithms, such as Monte Carlo Tree Search (MCTS, Tian et al. 2024; Zhang et al. 2024; Wang et al. 2024b), A∗ (Ma et al., 2023), and Levin Tree Search (Kang et al., 2024). Nonetheless, these approaches necessitate more robust verifiers to steer the search procedure. Concretely, Tian et al. (2024) utilize a blend of the value function, Processsupervised Reward Model (PRM), and Outcomesupervised Reward Model (ORM). Ma et al. (2023) and Kang et al. (2024) train their PRM models on PRM800K (Lightman et al., 2023), which offers manual annotations for 800k reasoning steps of problems from MATH (Hendrycks et al., 2021). This study also follows the same research line, yet it concentrates on developing an efficient algorithm to decrease computation costs while maintaining performance. Besides, we employ a naive but more practical value network as the verifier, which is trained solely with the final answer labels as distant supervision.
https://arxiv.org/pdf/2405.16265
Related Work
Multi-step Reasoning in LLMs. In recent years, several methods have been proposed to enhance LLM reasoning capability, ranging from fine-tuning the base model [5, 9, 20, 50] to chain-of-thought (CoT) prompting and its variants [19, 45, 51, 43, 6]. Specifically, Wei et al. [45] and Kojima et al. [19] demonstrate that CoT prompting can enhance LLM reasoning in few-shot and zero-shot settings. Such in-context improvement grounds in the decoder architecture of LLMs, however, a single reasoning path (i.e., greedy decoding) often suffers from stochasticity and lacks the diversity needed for complex reasoning tasks. To mitigate this, Wang et al. [43] proposes to generate a diverse set of reasoning paths and perform a majority voting. Similarly, Cobbe et al. [6] trains a solution verifier and Weng et al. [46] prompts LLM for self-verification in order to determine the quality of generated reasoning paths. Despite this, recent studies [10, 26, 41] found that LLMs often make unfaithful reasoning. This sheds light to the importance of verifying each step of the reasoning chain [21]. Moreover, CoT does not take different alternatives into account at the generation time, and there is no mechanism to evaluate the current generated chain and possibly look ahead or backtrack. Therefore, our work largely differs from the CoT literature since we utilize the step-level feedback in order to search for a reasoning path within a reasoning tree.
Feedback-Guided Tree Search for LLM Reasoning. The ToT framework is introduced in [48, 23]. Inspired by this, various methods [8, 27, 12, 47, 3] have been proposed to find a good reasoning path within the tree, employing different heuristics and search algorithms. A straightforward heuristic is that one prompt the LLM itself to assess its generated steps, as demonstrated in Yao et al. [48] with breadth/depth-first search, in Hao et al. [12] with Monte Carlo tree search, and in Xie et al. [47] with beam search. However, recent studies have shown that LLM struggles to evaluate itself and rectify its initial responses without any external feedback [17, 8]. In contrast, our method’s search heuristic relies on a reward model and thus performs more accurately. In a different approach, Feng et al. [8] and Tian et al. [39] propose learning the value function to estimate the value of the current reasoning path, while Ma et al. [27] trains a process-supervised reward model (PRM) and utilizes it with A∗ -like tree search. In comparison, our method is more computationally efficient since we do not deal with sample complexity issues of value function learning. In particular, we show that incorporating PRM as a heuristic with Levin tree search guarantees an upper bound on computation cost [30].