2024/11/01 8

BoNBoN Alignment for Large Language Modelsand the Sweetness of Best-of-n Samplin

https://arxiv.org/pdf/2406.00832 This paper concerns the problem of aligning samples from large language models to human preferences using best-of-n sampling, where we draw n samples, rank them, and return the best one. We consider two fundamental problems. First: what is the relationship between best-of-n and approaches to alignment that train LLMs to output samples with a high expected reward ..

카테고리 없음 2024.11.01

RS-DPO: A Hybrid Rejection Sampling and Direct PreferenceOptimization Method for Alignment of Large Language Models

https://arxiv.org/pdf/2402.10038Reinforcement learning from human feedback (RLHF) has been extensively employed to align large language models with user intent. However, proximal policy optimization (PPO) based RLHF is occasionally unstable requiring significant hyperparameter finetuning, and computationally expensive to maximize the estimated reward during alignment. Recently, direct preference..

카테고리 없음 2024.11.01

Self-Play Preference Optimization for Language Model Alignment 논문리뷰

https://arxiv.org/pdf/2405.00675Standard reinforcement learning from human feedback (RLHF) approaches relying on parametric models like the Bradley-Terry model fall short in capturing the intransitivity and irrationality in human preferences. Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences, enabling more fl..

카테고리 없음 2024.11.01