'2024/11/01 글 목록

BoNBoN Alignment for Large Language Modelsand the Sweetness of Best-of-n Samplin

https://arxiv.org/pdf/2406.00832 This paper concerns the problem of aligning samples from large language models to human preferences using best-of-n sampling, where we draw n samples, rank them, and return the best one. We consider two fundamental problems. First: what is the relationship between best-of-n and approaches to alignment that train LLMs to output samples with a high expected reward ..

카테고리 없음 2024.11.01

PDL: A Declarative Prompt Programming Language 논문리뷰

https://arxiv.org/pdf/2410.19135Large language models (LLMs) have taken the world by storm by making many previously difficult uses of AI feasible. LLMs are controlled via highly expressive textual prompts and return textual answers. Unfortunately, this unstructured text as input and output makes LLM-based applications brittle. This motivates the rise of prompting frameworks, which mediate betwe..

카테고리 없음 2024.11.01

RAFT: Reward rAnked Fine-Tuning for Generative Foundation Model Alignment 논문리뷰

https://arxiv.org/abs/2304.06767

카테고리 없음 2024.11.01

Asymptotics of Language Model Alignment 논문리뷰

https://arxiv.org/abs/2404.01730

카테고리 없음 2024.11.01

ARGS: Alignment as Reward-Guided Search 논문리뷰

https://arxiv.org/abs/2402.01694 ARGS: Alignment as Reward-Guided SearchAligning large language models with human objectives is paramount, yet common approaches including RLHF suffer from unstable and resource-intensive training. In response to this challenge, we introduce ARGS, Alignment as Reward-Guided Search, a novel framearxiv.orgAligning large language models with human objectives is para..

카테고리 없음 2024.11.01

RS-DPO: A Hybrid Rejection Sampling and Direct PreferenceOptimization Method for Alignment of Large Language Models

https://arxiv.org/pdf/2402.10038Reinforcement learning from human feedback (RLHF) has been extensively employed to align large language models with user intent. However, proximal policy optimization (PPO) based RLHF is occasionally unstable requiring significant hyperparameter finetuning, and computationally expensive to maximize the estimated reward during alignment. Recently, direct preference..

카테고리 없음 2024.11.01

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding 논문리뷰

https://arxiv.org/abs/2404.11912

카테고리 없음 2024.11.01

Self-Play Preference Optimization for Language Model Alignment 논문리뷰

https://arxiv.org/pdf/2405.00675Standard reinforcement learning from human feedback (RLHF) approaches relying on parametric models like the Bradley-Terry model fall short in capturing the intransitivity and irrationality in human preferences. Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences, enabling more fl..

카테고리 없음 2024.11.01

일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

이진욱님의 블로그

2024/11/01 8

티스토리툴바