V-star: Training verifiers for self-taught reasoners 논문리뷰

inference-time, RLHF/Process reward model

V-star: Training verifiers for self-taught reasoners 논문리뷰

jinuklee 2024. 8. 27. 16:40

Common self-improvement approaches for large language models (LLMs), such as STaR, iteratively fine-tune LLMs on self-generated solutions to improve their problem-solving ability.

However, these approaches discard the large amounts of incorrect solutions generated during this process, potentially neglecting valuable information in such solutions.

To address this shortcoming, we propose V-STaR that utilizes both the correct and incorrect solutions generated during the self-improvement process to train a verifier using DPO that judges correctness of model-generated solutions.

This verifier is used at inference time to select one solution among many candidate solutions.

Running V-STaR for multiple iterations results in progressively better reasoners and verifiers, delivering a 4% to 17% test accuracy improvement over existing self-improvement and verification approaches on common code generation and math reasoning benchmarks with LLaMA2 models.

'inference-time, RLHF > Process reward model' 카테고리의 다른 글

MULTI-STEP PROBLEM SOLVING THROUGH A VERIFIER: ANEMPIRICAL ANALYSIS ON MODEL-INDUCED PROCESSSUPERVISION 논문리뷰 (0)	2024.08.29
Improving Reward Models with Synthetic Critiques 논문리뷰 (0)	2024.08.29
Generative verifiers 논문리뷰 (0)	2024.08.28
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations 논문리뷰 (0)	2024.08.23
OmegaPRM - Improve Mathematical Reasoning in LanguageModels by Automated Process Supervision 논문리뷰 (0)	2024.08.23

현재글V-star: Training verifiers for self-taught reasoners 논문리뷰

이진욱님의 블로그

ai research memo for reference

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

이진욱님의 블로그