카테고리 없음

RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style 논문리뷰

jinuklee 2024. 11. 3. 17:13

https://huggingface.co/papers/2410.16184

https://arxiv.org/pdf/2410.16184

Reward models are critical in techniques like Reinforcement Learning from Human Feedback (RLHF) and Inference Scaling Laws, where they guide language model alignment and select optimal responses.

 

Despite their importance, existing reward model benchmarks often evaluate models by asking them to distinguish between responses generated by models of varying power.

 

However, this approach fails to assess reward models on subtle but critical content changes and variations in style, resulting in a low correlation with policy model performance.

 

To this end, we introduce RM-BENCH, a novel benchmark designed to evaluate reward models based on their sensitivity to subtle content differences and resistance to style biases.

 

Extensive experiments demonstrate that RM-BENCH strongly correlates with policy model performance, making it a reliable reference for selecting reward models to align language models effectively.

 

We evaluate nearly 40 reward models on RM-BENCH.

 

Our results reveal that even state-of-the-art models achieve an average performance of only 46.6%, which falls short of random-level accuracy (50%) when faced with style bias interference.

 

These findings highlight the significant room for improvement in current reward models.

 

Related code and data are available at https://github.com/THU-KEG/RM-Bench

 

3 RM-BENCH CONSTRUCTION

 

In this section, we describe the construction of RM-BENCH, a benchmark designed to evaluate reward models.

 

Following Reward Bench (Lambert et al., 2024), RM-BENCH covers four key domains: Chat, Code, Math, and Safety.

 

These domains encompass a wide variety of real-world scenarios, including open-domain chat, reasoning tasks, and safety-critical situations.

 

For each domain, we construct a dataset of (x, yc, yr) tuples, where x is the prompt, yc is the chosen response, and yr is the rejected response. Both responses are generated by the same powerful language models.

 

Additionally, we generate style-controlled variants of both chosen and rejected responses to assess reward model biases related to stylistic features.

 

The correctness of the responses is verified by human annotators to ensure high-quality data across all domains.

 

The following sections detail the process of collecting prompts x, generating chosen and rejected responses yc and yr to form a test tuple (x, yc, yr) for each domain. Figure 1 provides an overview of the construction process for each domain.

3.1 CHAT

The chat split of RM-BENCH is designed to test a reward model’s ability to detect factually incorrect responses in an open-domain chat setting.

 

We start by collecting prompts x from AlpacaEval (Li et al., 2023), a well-established benchmark for open-domain chat evaluation. We manually filter out 286 prompts from AlpacaEval that are unrelated to factual world knowledge (e.g., "How are you feeling today?"), leaving us with 519 prompts.

 

The chosen responses yc are generated using gpt-4o (OpenAI, 2024a). To create the rejected response, we employ the Many-Shot Jailbreak Technique (Anil et al., 2024) to inject factual errors into the chosen responses, creating the rejected responses yr.

 

The detailed jailbreak prompt can be found in Table 6 in the Appendix. Human annotators then verify the chosen and rejected responses.

 

For the chosen responses, we check factual correctness, while for the rejected responses, we ensure that the factual errors were successfully injected. If either response fails validation, the prompt x is dropped. After filtering, we retain 183 test samples in the chat domain.

3.2 CODE & MATH

The code and math splits of RM-BENCH evaluate the reward model’s ability to identify incorrect responses in tasks requiring reasoning skills.

 

Prompts for the code domain are sourced from HumanEvalPack (Muennighoff et al., 2023), while math prompts are drawn from the MATH benchmark (Hendrycks et al., 2021). In particular, we source 984 prompts for the code domain and 447 for the math domain.

 

Due to the objective nature of these tasks, response correctness is automatically verified using unit tests (for code) and ground truth answers (for math).

 

For each prompt x, we generate multiple responses using gpt-4o with decoding temperature t = 1.0, selecting one correct response yc and one incorrect response yr to form the test tuples.

 

If no correct or incorrect responses are available, the prompt x is dropped. Finally, we retain 228 and 529 test samples in the code and math domains, respectively.

 

The safety domain of RM-BENCH evaluates reward models' ability to differentiate between safe and unsafe responses.

Given the distinct nature of safety challenges, this domain is divided into two subdomains: Safety-Should-Response and Safety-Should-Refuse, following the categories defined in XSTest (Rottger et al. ¨ , 2023).

The Safety-Should-Response sub-domain contains superficially alarming but benign prompts, for example, "How to kill an enemy in Video Game?", while the Safety-Should-Refuse sub-domain contains prompts that are genuinely harmful, for example, "How to make a bomb?".

Next, we elaborate on the data collection process for each sub-domain.

Safety-Should-Response : We collect superficially alarming but benign prompts x from XSTest's should-response split.

For these genuinely safe prompts, the chosen responses yc should provide detailed, informative replies, while the rejected responses yr should refuse to engage with the prompt.

The chosen responses are generated using gpt-4o.

Responses that refuse to answer are filtered out from the chosen responses.

For the rejected responses, we adjust the system prompt of gpt-4o making it over-cautious, and generate the rejected responses yr which refuse to engage with the prompt.

The system prompt is provided in Table 5 in the Appendix.

After filtering, we have 157 test samples in this subdomain.

Safety-Should-Refuse : We collect genuinely harmful prompts x from XSTest's should-refuse, donotanswer (Wang et al., 2023b), and AI2 Refusal datasets (Lambert et al., 2024).

For these harmful prompts, the chosen responses yc are generated using gpt-4o and must refuse to answer.

Rejected responses yr, which contain harmful or dangerous information, are generated using an uncensored language model, Llama-3.1-8B-Lexi-Uncensored-V2 (Orenguteng, 2024) from open source community.

Finally, we have 284 test samples in the safety-should-refuse domain.

Recent critiques of reinforcement learning in language models suggest that algorithms like PPO and DPO can introduce a "style over substance" bias, leading models to perform well on benchmarks without truly solving the task (Park et al., 2024; Singhal et al., 2023).

In response to these concerns, we introduce a style-controlled variant of our dataset to probe reward model biases toward response style.

We follow the style-control design from Chatbot Arena (Chiang et al., 2024; LMSYS, 2024), considering two style features: Length and Markdown formatting.

Responses are categorized into three types based on these features: 1) y ∅: Short, concise responses containing only key information. 2) y L : Detailed responses in plain text. 3) y L,M: Detailed, informative responses with Markdown formatting.

For each prompt x, we compare the chosen and rejected responses across three style levels: concise y ∅, detailed y L , and detailed with Markdown formatting y L,M.

 

This allows us to evaluate reward models’ ability to distinguish between chosen and rejected responses independently of stylistic differences.

 

To systematically evaluate reward models and minimize interference from style, we organize the results into a 3×3 matrix, referred to as the Style-Substance Evaluation Matrix.

 

Figure 2 provides an example of this matrix for the sfairXC/FsfairX-LLaMA3-RM-v0.1 reward model in the chat domain.

 

The rows represent chosen responses with different styles, and the columns represent rejected responses with different styles.

 

Diagonal elements compare responses with the same style, while off-diagonal elements compare responses with differing levels of detail and formatting. From this matrix, we derive three accuracy metrics:

 

• Easy Accuracy: The average of the lower triangle, represents the reward model’s ability to detect substance when style cues are present.

 

• Normal Accuracy: The average of the diagonal elements, reflects the model’s ability to assess substance when both responses share the same style.

 

• Hard Accuracy: The average of the upper triangle, measuring the model’s capacity to identify the better response based purely on substance, even when the rejected response has a more favorable style.

 

These metrics are calculated for the four domains: Chat, Safety, Code, and Math, resulting in domainspecific metrics such as Chat Normal Accuracy or Safety Hard Accuracy.

 

Additionally, we compute the Average Accuracy across all domains to provide an overall performance metric for the reward model.