카테고리 없음

generative reward model 논문리뷰

jinuklee 2024. 10. 23. 21:15

https://arxiv.org/pdf/2410.12832

Reinforcement Learning from Human Feedback (RLHF) has greatly improved the performance of modern Large Language Models (LLMs).

 

The RLHF process is resource-intensive and technically challenging, generally requiring a large collection of human preference labels over model-generated outputs.

 

Reinforcement Learning from AI Feedback (RLAIF) addresses this data collection challenge by leveraging synthetic preferences generated by an LLM.

 

However, recent work has shown that synthetic preferences labels may not align well with human preference judgments (Zeng et al., 2023).

 

To address this, we propose a hybrid approach that unifies RLHF and RLAIF methodologies.

 

We introduce GenRM, an iterative algorithm that trains an LLM on self-generated reasoning traces, leading to synthetic preference labels matching human preference judgments.

 

Empirically, we show that zero-shot LLM-based judgments under-perform compared to Bradley-Terry reward models on in-distribution tasks (between 9-36%).

 

In contrast, GenRM achieves in-distribution accuracy comparable to Bradley-Terry models, while significantly outperforming them on out-of-distribution tasks (between 10-45%).

 

Moreover, GenRM surpasses the performance of using LLMs as judges on both in-distribution (by 9-31%) and out-of-distribution tasks (by 2- 6%).

 

Our results show that combining the strengths of RLHF and RLAIF offers a promising approach for improving the quality of synthetic preference labels

1 INTRODUCTION

 

Reinforcement Learning from Human Feedback (RLHF) has significantly improved the performance of modern Large Language Models (LLMs) (see e.g., Reid et al., 2024; OpenAI, 2023).

 

Despite its effectiveness, the RLHF process presents several challenges. First, it requires a large amount of human preference data to train reward models that reflect human preferences (Stiennon et al., 2022; Bai et al., 2022a).

 

Second, it necessitates additional architecture and infrastructure to handle reward model training (Wang et al., 2024a; von Werra et al., 2020; Havrilla et al., 2023).

 

Third, it requires a sophisticated online optimization loop using algorithms, such as Proximal Policy Optimization [PPO; Schulman et al. (2017)], to fine-tune an LLM-based policy to align with the reward model (Zheng et al., 2023c).

 

To address the challenge of collecting large-scale human preference data, synthetic preference data has emerged as a promising alternative.

 

For example, Bai et al. (2022b) introduced Reinforcement Learning from AI Feedback (RLAIF).

 

Instead of relying on human users for feedback, their method utilizes an LLM guided by a predefined set of principles—referred to as a “constitution”— to generate and select model outputs that are helpful and harmless (Askell et al., 2021). Employing AI-generated preference labels has demonstrated meaningful Pareto improvements in balancing helpfulness and harmlessness in assistant responses (Bai et al., 2022b; Kundu et al., 2023).

 

Direct alignment algorithms, such as Direct Preference Optimization (DPO) (Rafailov et al., 2023) and Implicit Preference Optimization (IPO) (Azar et al., 2023), were developed to address the challenges of reward model training and online optimization.

 

These works demonstrated that the reward model and the optimal policy can be mathematically interchanged, allowing the policy to be trained directly from preference data in an entirely offline manner, significantly simplifying the RLHF pipeline.

 

Benchmark evaluations (Lambert et al., 2024) have shown that DPO-based approaches are competitive with traditional reward models based on the Bradley-Terry algorithm.

 

However, recent empirical evidence suggests that purely offline methods may underperform compared to online approaches in both reward model-based reinforcement learning (Xu et al., 2024b;a) and in the RLAIF setting (Guo et al., 2024).

 

As a result, state-of-the-art models such as the LLaMA-3 family (Dubey et al., 2024) have adopted hybrid strategies that combine online DPO optimization with separate reward models.

 

In this work, we identify two key limitations in current alignment approaches:

 

(1) Explicitly parameterized reward models, while effective and accurate for in-distribution tasks, struggle with robustness and generalization to out-of-distribution (OOD) data.

 

(2) RLAIF approaches, such as utilizing an LLM-as-a-judge, offer a more robust alternative but may not always align well with actual user preferences when acting as the sole evaluator.

 

To address these limitations, we propose a unified framework for RLHF and RLAIF.

 

Our approach begins with a strong pre-trained LLM, which we employ as an evaluator.

 

Using a dataset of user preferences, we adopt a STaR-like methodology (Zelikman et al., 2022) to align the LLM with user choices, effectively training it to function as a reward model.

 

We demonstrate empirically that this fine-tuned judge model matches Bradley-Terry reward models for in-distribution prompts while significantly improving generalization on OOD prompts.

 

Additionally, it outperforms the base LLM on both in-distribution and OOD scenarios.