https://arxiv.org/pdf/2406.12845
https://github.com/RLHFlow/RLHF-Reward-Modeling
Reinforcement learning from human feedback (RLHF) has emerged as the primary method for aligning large language models (LLMs) with human preferences.
The RLHF process typically starts by training a reward model (RM) using human preference data.
Conventional RMs are trained on pairwise responses to the same user request, with relative ratings indicating which response humans prefer.
The trained RM serves as a proxy for human preferences.
However, due to the black-box nature of RMs, their outputs lack interpretability, as humans cannot intuitively understand why an RM thinks a response is good or not.
As RMs act as human preference proxies, it is desirable for them to be human-interpretable to ensure that their internal decision processes are consistent with human preferences and to prevent reward hacking in LLM alignment.
To build RMs with interpretable preferences, we propose a twostage approach:
i) train an Absolute-Rating Multi-Objective Reward Model (ArmoRM) with multi-dimensional absolute-rating data, each dimension corresponding to a human-interpretable objective (e.g., honesty, verbosity, safety);
ii) employ a Mixture-of-Experts (MoE) strategy with a gating network that automatically selects the most suitable reward objectives based on the context.
We efficiently trained an ArmoRM with Llama-3 8B and a gating network consisting of a shallow MLP on top of the ArmoRM.
Our trained model, ArmoRM-Llama3-8B, obtains stateof-the-art performance on RewardBench, a benchmark evaluating RMs for language modeling.
Notably, the performance of our model surpasses the LLM-as-a-judge method with GPT-4 judges by a margin, and approaches the performance of the much larger Nemotron-4 340B reward model.
3 Methodology
3.1 Multi-Objective Reward Modeling
Most existing reward models for LLM alignment are trained with Bradley-Terry loss on pairwise data with annotated preferences [Bai et al., 2022; Touvron et al., 2023; Ouyang et al., 2022], using the same approach as InstructGPT [Ouyang et al., 2022].
The pairwise preference annotations are essentially binary labels, e.g., {0, 1}, indicating which response is preferred by the annotator. We call them relative ratings here.
However, in some recent high-quality datasets, the relative ratings are converted from absolute ratings.
For instance, UltraFeedback [Cui et al., 2023] is curated with 5-objective absolute ratings: Overall Score, Instruction Following, Truthfulness, Honesty, and Helpfulness (each objective has 5 distinct ratings based on pre-defined rubrics).
The dataset is further binarized into pairwise comparisons, using the Overall Score, or the average score of the remaining 4 objectives, for training reward models or DPO.
The original ratings are fine-grained, as each objective has continuous integer rating scores (e.g., 1, 2, 3, 4, 5).
However, the binarization process discards some finegrained information. For example, a pair of examples with scores 1:5 is labeled in the same way as another pair with scores 2:3.
It is not justified that discarding the finegrained preference information is beneficial. Hence, we would like to include all fine-grained information for reward modeling.
As the training examples come with multiobjective ratings, the straightforward approach for learning with these ratings is multi-objective regression.
Here, we briefly introduce the training procedure. We consider each example to consist of a prompt x (including contexts from previous conversation turns), response y, and a kdimensional rating vector r ∈ R k, where each dimension corresponds to a reward objective such as helpfulness and truthfulness.
Now, we take a pre-trained decoder-only LLM without the original output linear layer as the feature extractor fθ.
We pass x ⊕ y, the concatenation of x and y, through the decoder layers and take the hidden state of the final decoder layer on the last token as a d-dimensional feature.
Also, we attach a new linear regression layer w ∈ R d×k on top of fθ, which outputs a k-dimensional rating prediction.
The model can be simply trained with regression loss:
3.2 Mixture-of-Experts Scalarization of Reward Objectives
An ArmoRM can predict multi-objective rewards for each response. However, the multi-dimensional outputs need to be reduced to a scalar for ranking or pairwise comparisons of test examples.
A straightforward approach is to take a linear combination of multiple objectives [Hu et al., 2024] as in the literature of multitask learning.
However, using fixed combination coefficients is too rigid for complex application scenarios.
For instance, for prompts that could easily trigger unsafe responses, the safety objective should be assigned a large coefficient, as we wish the reward model to rank unsafe responses lower than safe ones.
For prompts for math problem assistance, the safety objective becomes less relevant, and the helpfulness-related objectives should be the primary focus.
With the insight mentioned above, we propose a MoE-style scalarization of reward objectives, conditioned on the prompt x
The gating layer g_ϕ can simply be a shallow MLP (i.e., fully-connected network) that takes the prompt feature f_θ(x) and outputs a k-dimensional vector, followed by a softmax function to ensure the elements of the output vector are non-negative and summing up to 1.
However, most reward objectives are highly correlated with verbosity, which indicates a strong verbosity bias [Saito et al., 2023].
Using non-negative gating coefficients would make the final output inherit the bias.
To resolve the issue, we adjust each reward objective, r_i , with a penalty using the verbosity reward objective,
4 Experiment
Implementation of ArmoRM. We use the Llama-3 8B [Meta, 2024] architecture and initialize the model backbone with parameters from a Bradley-Terry RM of Llama-3 8B trained by Dong et al. [2024].
We append a linear layer to the backbone, and train it with regression loss while keeping the backbone frozen.
The training involves 19 objectives (including helpfulness, correctness, verbosity, etc.) from 8 datasets, with details presented in Appendix A.
Implementation of MoE. The gating layer is a ReLU MLP of 3 hidden layers with 1024 hidden units.
For the correlation metric Corr in Eq. (3), we adopt the Spearman correlation [Spearman, 1904], and use UltraFeedback [Cui et al., 2023] as the reference data distribution D.
The scaling variable β is initialized with a value of 100, and the gating layer is trained with the LLM backbone kept frozen.
The training is conducted on 10 pairwise preference datasets, with details in Appendix A
Software. Our training code is built with PyTorch [Paszke et al., 2019], HuggingFace's Transformers [Wolf et al., 2019] and Scikit-learn [Pedregosa et al., 2011].
Hardware. Training ArmoRM (the multiobjective reward modeling stage) only involves training the last linear layer (i.e., linear probing), so we save features extracted from the backbone locally and then conduct linear probing with Scikit-learn's linear regression solver on a CPU.
For the MoE stage, we also save features locally, and then train the gating layer on a single NVIDIA A6000 GPU.
Hyperparameters. The gating layer is trained using the AdamW optimizer [Loshchilov and Hutter, 2019] with a learning rate of 0.001 for 10,000 steps with a batch size of 1024.
We also apply a cosine decay learning rate scheduler.
Evaluation Benchmark. RewardBench [Lambert et al., 2024] is the first benchmark constructed to evaluate reward models for language modeling.
It consists of a diverse set of tasks designed to assess the performance of reward models for LLM alignment, including four primary categories (Chat, Chat Hard, Safety, Reasoning) and a category of prior sets.
Each category consists of multiple datasets with pairwise preference data, where each pair includes a chosen and a rejected text response.
The overall score is computed as a weighted average over the five categories, where the four primary categories have weights 1.0 and the prior-sets category has weight 0.5.
Evaluation Results. Table 1 compares the performance of our approach (ArmoRM + MoE) against other reward models. Several key observations can be made from these results:
• Our model significantly outperforms the Llama-3 8B BradleyTerry RM, which provides the LLM backbone for our model. This demonstrates the effectiveness of our ArmoRM design and the MoE gating mechanism in improving the performance of reward models.
• Our model also outperforms the LLM-as-a-judge approach [Zheng et al., 2023] with GPT4 judges by a considerable margin, indicating that our model could be used as a cheaper replacement for GPT-4 in many annotation jobs.
• Our model of 8B parameters has performance nearly on par with the Nemotron4 340B RM Wang et al. [2024b], a giant reward model of 340B parameters. This highlights the power and potential of our reward modeling approach.