For tree-search algorithms, how to construct reliable value function and reward model is the main issue LATSmajority voting, LLM evaluation score -> value functionsimulation stage -> objective feedback == reward 실제 성공 여부로 backpropagate예시) hotpot task Alphamathwe have a value model V and a LLM policy model π , which are the same model but with different final layers in our paperpreliminarymethodI..