전체 글 288

RS-DPO: A Hybrid Rejection Sampling and Direct PreferenceOptimization Method for Alignment of Large Language Models

https://arxiv.org/pdf/2402.10038Reinforcement learning from human feedback (RLHF) has been extensively employed to align large language models with user intent. However, proximal policy optimization (PPO) based RLHF is occasionally unstable requiring significant hyperparameter finetuning, and computationally expensive to maximize the estimated reward during alignment. Recently, direct preference..

카테고리 없음 2024.11.01

Self-Play Preference Optimization for Language Model Alignment 논문리뷰

https://arxiv.org/pdf/2405.00675Standard reinforcement learning from human feedback (RLHF) approaches relying on parametric models like the Bradley-Terry model fall short in capturing the intransitivity and irrationality in human preferences. Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences, enabling more fl..

카테고리 없음 2024.11.01

Inferaligner 논문리뷰: Inference-time alignment for harmlessness through cross-model guidance, 2024

https://arxiv.org/abs/2401.11206 InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model GuidanceWith the rapid development of large language models (LLMs), they are not only used as general-purpose AI assistants but are also customized through further fine-tuning to meet the requirements of different applications. A pivotal factor in the success of carxiv.orgAbstract With th..

카테고리 없음 2024.10.29

armo RM 논문리뷰 Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts

https://arxiv.org/pdf/2406.12845https://github.com/RLHFlow/RLHF-Reward-Modeling Reinforcement learning from human feedback (RLHF) has emerged as the primary method for aligning large language models (LLMs) with human preferences. The RLHF process typically starts by training a reward model (RM) using human preference data. Conventional RMs are trained on pairwise responses to the same user reque..

카테고리 없음 2024.10.29

MAVIS: Mathematical Visual Instruction Tuning 논문리뷰

https://arxiv.org/pdf/2407.08739 Multi-modal Large Language Models (MLLMs) have recently emerged as a significant focus in academia and industry. Despite their proficiency in general multi-modal scenarios, the mathematical problem-solving capabilities in visual contexts remain insufficiently explored. We identify three key areas within MLLMs that need to be improved: visual encoding of math diag..