데이터셋

MMHAL-BENCH : ALIGNING LARGE MULTIMODAL MODELSWITH FACTUALLY AUGMENTED RLHF 논문리뷰

jinuklee 2024. 9. 30. 18:11

https://arxiv.org/pdf/2309.14525

Large Multimodal Models (LMM) are built across modalities and the misalign-ment between two modalities can result in “hallucination”, generating textual out-puts that are not grounded by the multimodal information in context.

To address the multimodal misalignment issue, we adapt the Reinforcement Learning from Human Feedback (RLHF) from the text domain to the task of vision-language alignment, where human annotators are asked to compare two responses and pin-point the more hallucinated one, and the vision-language model is trained to max-imize the simulated human rewards.

We propose a new alignment algorithm called Factually Augmented RLHF that augments the reward model with addi-tional factual information such as image captions and ground-truth multi-choice options, which alleviates the reward hacking phenomenon in RLHF and further improves the performance.

We also enhance the GPT-4-generated training data (for vision instruction tuning) with previously available human-written image-text pairs to improve the general capabilities of our model.

To evaluate the pro-posed approach in real-world scenarios, we develop a new evaluation benchmark MMHAL-BENCH with a special focus on penalizing hallucinations.

As the first LMM trained with RLHF, our approach achieves remarkable improvement on the LLaVA-Bench dataset with the 94% performance level of the text-only GPT-4 (while previous best methods can only achieve the 87% level), and an improve-ment by 60% on MMHAL-BENCH over other baselines.

We opensource our code, model, data at https://llava-rlhf.github.io.


2.2 AUGMENTING LLAVA WITH HIGH-QUALITY INSTRUCTION-TUNING

Recent studies (Zhou et al., 2023; Touvron et al., 2023b) show that high-quality instruction tuning data is essential for aligning Large Language Models (LLMs).

We find this becomes even more salient for LMMs.

As these models traverse vast textual and visual domains, clear tuning instructions are crucial.

Correctly aligned data ensures models produce contextually relevant outputs, effectively bridging language and visual gaps.

For example, LLaVA synthesized 150k visual instruction data using the text-only GPT-4, where an image is represented as the associated captions on bounding boxes to prompt GPT-4.

Though careful filtering has been applied to improve the quality, the pipeline can occasionally generate visually misaligned instruction data that can not be easily removed with an automatic filtering script, as highlighted in Table 1.

In this work, we consider enhancing LLaVA (98k conversations, after holding out 60k conversa-tions for preference modeling and RL training) with high-quality instruction-tuning data derived from existing human annotations.

Specifically, we curated three categories of visual instruction data:

“Yes” or “No” queries from VQA-v2 (83k) (Goyal et al., 2017b),

multiple-choice questions from A-OKVQA (16k) (Marino et al., 2019),

and grounded captions from Flickr30k (23k) (Young et al., 2014a).

Our analysis revealed that this amalgamation of datasets significantly improved LMM capabilities on benchmark tests.

Impressively, these results surpassed models (Dai et al., 2023; Li et al., 2023a; Laurenc ¸on et al., 2023) trained on datasets an order of magnitude larger than ours, as evidenced by Table 7 and 4.

For a comprehensive breakdown of each dataset’s influence, refer to Section 3.5.

2.3 HALLUCINATION-AWARE HUMAN PREFERENCE COLLECTION

Inspired by the recent RLHF studies that collect helpfulness and harmlessness preferences (Bai et al., 2022b; Touvron et al., 2023b) separately, in this study, we decide to differentiate between responses that are merely less helpful and those that are inconsistent with the images (often characterized by multimodal hallucinations).

To achieve this, we provide crowdworkers with the template illustrated in Table 2 to guide their annotations when comparing two given responses.

With our current template design, we aim to prompt crowdworkers to identify potential hallucinations in the model’s responses.
Nonetheless, our training process integrates a single reward model that emphasizes both multimodal alignment and overall helpfulness2. We collect human preferences on 10k hold-out LLaVA data by re-sampling the last response with our SFT model and a temperature of 0.7. The reward model is initialized from the SFT model to obtain the basic multimodal capabilities.
2.4 FACTUALLYAUGMENTEDRLHF (FACT-RLHF) We conduct multimodal RLHF on 50k hold-out LLaVA conversations, with additional 12k multi-choice questions from A-OKVQA and 10k yes/no questions subsampled from VQA-v2. Due to the concerns of existing hallucinations in the synthetic multi-round conversation data of LLaVA, we only use the f i rst question in each conversation for RL training, which avoids the pre-existing hallucinations in the conversational context.
Reward Hacking in RLHF In preliminary multimodal RLHF experiments, we observe that due to the intrinsic multimodal misalignment in the SFT model, the reward model is weak and sometimes cannot effectively detect hallucinations in the RL model’s responses. In the text domain, previous work (Bai et al., 2022a; Touvron et al., 2023b) proposed to iteratively collect “fresh” human feed-back. However, this can be quite costly and cannot effectively utilize existing human-annotated data and there is no guarantee that more preference data can signif i cantly improve the discriminative capabilities of the reward model for multimodal problems.
Facutual Augmentation To augment the capability of the reward model, we propose Factually Augmented RLHF (Fact-RLHF), where the reward model has access to additional ground-truth information such as image captions to calibrate its judgment. In original RLHF (Stiennon et al., 2020; OpenAI, 2022), the reward model needs to judge the quality of the response only based on the user query (i.e., the input image and prompt):
Image: [IMAGE] User: [USER PROMPT] Assistant: [RESPONSE] Reward Model: [SCORE] In Factually Augmented RLHF (Fact-RLHF), the reward model has additional information about the textual descriptions of the image:

Image: [IMAGE]

Factual Information: [5 COCO IMAGE CAPTIONS / 3 A-OKVQA RATIONALS]

User: [USER PROMPT]

Assistant: [RESPONSE]

Augmented Reward Model: [SCORE]

This prevents the reward model hacked by the policy model when the policy model generates some hallucinations that are clearly not grounded by the image captions. For general questions with COCO images, we concatenate the f i ve COCO captions as the additional factual information, while for A-OKVQA questions, we use the annotated rationals as the factual information.
The factually augmented reward model is trained on the same binary preference data as the vanilla reward model, except that the factual information is provided both during the model f i ne-tuning and inference.
Symbolic Rewards: Correctness Penalty & Length Penalty In some of our RL data, certain questions come with a predetermined ground-truth answer.This includes binary choices (e.g., “Yes/No”) in VQA-v2 and multiple-choice options (e.g., “ABCD”) in A-OKVQA. These annota-tions can also be regarded as additional factual information. Therefore, in the Fact-RLHF algorithm, we further introduce a symbolic reward mechanism that penalizes selections that diverge from these ground-truth options.
Furthermore, we observed that RLHF-trained models often produce more verbose outputs, a phe-nomenon also noted by Dubois et al. (2023). While these verbose outputs might be favored by users or by automated LLM-based evaluation systems (Sun et al., 2023b; Zheng et al., 2023), they tend to introduce more hallucinations for LMMs. In this work, we follow Sun et al. (2023a) and incorporate the response length, measured in the number of tokens, as an auxiliary penalizing factor.