https://arxiv.org/pdf/2411.00418
Reinforcement Learning from Human Feedback (RLHF) is a crucial technique for aligning language models with human preferences, playing a pivotal role in the success of conversational models like GPT-4, ChatGPT, and Llama 2.
A core challenge in employing RLHF lies in training a reliable reward model (RM), which relies on high-quality labels typically provided by human experts or advanced AI system.
These methods can be costly and may introduce biases that affect the language model's responses.
As language models improve, human input may become less effective in further enhancing their performance.
In this paper, we propose Self-Evolved Reward Learning (SER), a novel approach where the RM generates additional training data to iteratively improve itself.
We conducted extensive experiments on multiple datasets such as HH-RLHF and UltraFeedback, using models like Mistral and Llama 3, and compare SER against various baselines.
Our results demonstrate that even with limited human-annotated data, learning from self-feedback can robustly enhance RM performance, thereby boosting the capabilities of large language models (LLMs).
[Section 1: INTRODUCTION]
Reinforcement Learning from Human Feedback (RLHF) is a well-established approach that aligns Large Language Models (LLMs) with human preference data Ouyang et al. (2022); Bai et al. (2022b).
The standard approach involves learning a reward model (RM) from human preferences and the learned RM is then frozen to train LLMs via Reinforcement Learning (RL) such as Proximal Policy Optimization (PPO) Schulman et al. (2017a).
Another common approach directly trains LLMs from the human preference data without learning an RM such as Direct Preference Optimiztion (DPO) Rafailov et al. (2024).
Both approaches rely heavily on the size and quality of human-annotated preference data.
However, the availability of such data is often limited and expensive to acquire, posing a significant bottleneck in the development and performance of RL approaches Yuan et al. (2024b).
This dependency on human-annotated data hinders the scalability of strong LLMs that require vast amounts of labeled data to achieve greater performance Kaplan et al. (2020); Muennighoff et al. (2024).
To mitigate the dependency, recent works leverage the AI feedback to train RMs, referred to as Reinforcement Learning from AI Feedback (RLAIF) Bai et al. (2022b); Lee et al. (2023), which reduces the reliance on human-annotated data.
However, they hold heuristic assumptions that LLMs can provide high-quality feedback and they often requires stronger LLMs to provide feedback Pang et al. (2023).
Recent advancements suggest that LLMs have the potential to serve as world models to a certain degree, capable of understanding world knowledge and complex patterns independently of explicit human input Hao et al. (2023); Guan et al. (2023); Zhao et al. (2024).
Leveraging this ability, LLMs can evaluate and provide feedback.
In the context of RLHF and RLAIF, this capability of LLMs can be extended as the role of RMs, and RL approaches rely heavily on the RMs Dewey (2014); Li (2017).
Focusing on training a better RM with limited human-annotated data, we propose a novel reward learning approach, which self-evolves the RM through a feedback loop using the RM itself.
In our approach, the LLM serves as the RM, generating feedback on the dataset that is subsequently used to refine its own learning.
This iterative "feedback-then-train" loop allows the RM to self-evolve over time, gradually improving its performance, even with some noise in the initial self-labeled data.
As the iteration progresses, however, similar data offers diminishing help and can even degrade performance.
To address this, we identify the RM learning status in each iteration and introduce data filtering strategies to select high-confidence data that are later used for a more robust RM training.
Figure 1 describes the Self-Evolved Reward Learning (SER) pipeline with its 4 key steps:
(1) Self-labeling: RM assigns labels to unlabeled data.
(2) Identifying learning status and selecting data: high-confidence data is selected by assessing the learning status.
(3) Retrain the RM: the RM trains itself using the self-labeled and selected data.
(4) Train the Large Language Model (LLM): the LLM is trained under the guidance of the self-evolved RM.
Steps (1)-(3) iterate multiple rounds until RM convergence.
The key contributions are:
• Novel self-evolved reward learning framework that achieves comparable performance using only 15% of human-annotated seed data.
• Insights into self-learning paradigms in LLMs, particularly in improving reinforcement learning through enhanced RMs.
• Extensive experiments demonstrating consistent improvement across various LLMs, model sizes, and datasets.
The experiments show significant results:
- Average improvement of 7.88% compared to seed models using limited human-labeled data
- Final convergence matches or exceeds performance of models using full human-annotated datasets
- Demonstrates potential for model self-improvement
This self-evolved reward learning process reduces dependency on human-labeled data while maintaining or improving model performance.