inference-time, RLHF/STaR, ResT - LMM

RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

jinuklee 2024. 10. 13. 01:17

https://arxiv.org/abs/2312.00849

Multimodal Large Language Models (MLLMs) have recently demonstrated impressive capabilities in multimodal understanding, reasoning, and interaction.

 

However, existing MLLMs prevalently suffer from serious hallucination problems, generating text that is not factually grounded in associated images.

 

The problem makes existing MLLMs untrustworthy and thus impractical in real-world (especially high-stakes) applications.

 

To address the challenge, we present RLHF-V, which enhances MLLM trustworthiness via behavior alignment from fine-grained correctional human feedback.

 

Specifically, RLHF-V collects human preference in the form of segment-level corrections on hallucinations, and performs dense direct preference optimization over the human feedback.

 

Comprehensive experiments on five benchmarks in both automatic and human evaluation show that, RLHF-V can enable substantially more trustworthy MLLM behaviors with promising data and computation efficiency.

 

Remarkably, using 1.4k annotated data samples, RLHF-V significantly reduces the hallucination rate of the base MLLM by 34.8%, outperforming the concurrent LLaVA-RLHF trained on 10k annotated data.

 

The final model achieves state-of-the-art performance in trustworthiness among open-source MLLMs, and shows better robustness than GPT-4V in preventing hallucinations aroused from over-generalization

3. Method

 

We introduce the RLHF-V approach that learns the finegrained correctional human feedback by dense direct preference optimization.

 

In addition, we also mitigate existing sources of hallucination in MLLM training by addressing the vision-language mismatch problem.

 

3.1. Dense Direct Preference Optimization

 

To leverage the dense and fine-grained human feedback, we present DDPO, a new variant of direct preference optimization [42] for directly optimizing the MLLM policy against dense human preference.

 

The prevalent RLHF approaches involve fitting a reward model on the preference data, and then training the critique, policy and value models to maximize the reward without deviating too far from the reference model [13, 39, 51].

 

This procedure requires training multiple LLMs with extensive sampling and training, which suffers from complex procedures and high computation cost. Direct Preference Optimization (DPO) [42] solves this reinforcement learning objective in a simpler equivalent supervised fashion.

 

Here we briefly introduce the DPO method, and refer readers to the original paper for more details.

 

The key observation of DPO is that the reward function r(x, y) can be analytically expressed by its optimal policy model π∗(y|x) and reference model πref(y|x), and therefore we can directly optimize the policy model under proper forms on the preference data.

 

Specifically, the reward model r(x, y) can be represented as:

where the reference model π ref(y|x) is usually implemented by an instruction-tuned base model we want to improve, and is kept fixed during DPO training.

 

Only the policy model π∗(y|x) is updated.

 

We note that DPO is more simple, efficient and stable in aligning MLLM behaviors compared with traditional RLHF approaches.

 

Leveraging dense and fine-grained segment-level feedback essentially requires the model to evaluate the reward of segment-level actions.

 

However, DPO is designed for learning preference in the form of overall response ranking labels.

 

Specifically, the action score of DPO is given by the likelihood of the holistic response in practice, where different segments are equally treated:

where y(i) is the i-th token of the response y.

 

We argue that compared with unchanged segments y(u), corrected segments y(c) more directly reveal human judgment in hallucination, and thus should contribute more to the overall action evaluation. Therefore, we propose to score the response as a weighted aggregation of the fine-grained segments:

3.2. Mitigating Hallucination from VL Mismatch

 

DDPO reduces hallucination by learning from human feedback.

 

From another cause-and-effect view, we examine the mainstream MLLM training paradigm, and identify sources of hallucinations in training MLLMs. Based on the observations, we motivate a more trustworthy training recipe.

 

In general, current MLLMs learn multimodal capabilities in a supervised learning paradigm, where the model outputs are supervised against the ground-truth text associated with the image.

 

In such a paradigm, hallucinations can be introduced by mismatches between images and text data.

 

In practice, the mismatch can come from: (1) low-quality text in pre-training and instruction tuning data, and (2) careless image augmentation during training. We specify the issues and solutions in the following.

 

Addressing Low-quality Text Influence.

 

Current pretraining data of MLLMs are automatically crawled from the Web [9, 10, 44], which inevitably suffers from severe noise in the text even after extensive post-processing.

 

Supervising MLLMs against such data is essentially teaching them to hallucinate (e.g., describing elements not present in the image, or producing inconsistent descriptions with the image).

 

Similarly, most existing visual instruction tuning datasets are generated by ChatGPT/GPT-4 according to intermediate text annotations [33, 35, 60], which inevitably introduces hallucination into instruction data.

 

While it can be difficult to repair existing pre-training and instruction-tuning data, we find that the influence can be countered by simply posttraining MLLMs on high-quality visual question-answering datasets.

 

Intuitively, human-labeled datasets can provide accurate learning signals to calibrate model behaviors from hallucinations, and also enhance instruction-following capabilities.

 

In our experiments, we find that simply finetuning the model on VQAv2 [18] can significantly reduce the hallucination rate (see Section 4.3).

 

Mitigating Untrustworthy Image Augmentation.

 

The vision-language mismatch can also come from the image domain.

 

Data augmentation is widely adopted to improve the data diversity and model robustness in various multimodal models [14, 27, 41, 53, 60].

 

However, we note that such augmentation must be performed with care in training MLLMs.

 

The key problem is that some image augmentation operations can significantly change the semantics of images, which may make the augmented image inconsistent with the associated text.

 

For example, during augmentation, random cropping can make the objects mentioned in the text absent from the image.

 

This can make the model describe non-existing objects, with wrong numbers, and in wrong positions.

 

In our model training, we exclude image cropping in data augmentation, which improves the trustworthiness of MLLMs (see Section 4.3)

 

4. Experiments

 

In this section, we empirically investigate the effectiveness of RLHF-V in aligning MLLM behaviors.

 

In addition to evaluating the trustworthiness and helpfulness of conversation, we also analyze the data efficiency and scalability as well as the robustness.

 

We refer readers to the appendix for more details on benchmarks, baselines and results.

 

4.1. Experimental Settings

 

We first introduce the experimental settings, including evaluation, baselines, and implementation details.

 

Evaluation.

 

We evaluate the models from two perspectives, including trustworthiness reflecting the hallucination degree, and helpfulness reflecting the general interaction quality.

 

Similar to [48], we find binary classification evaluation (i.e., answering yes/no) [17, 29] cannot well reflect the MLLM behaviors in open-ended long-form interactions.

 

We thus adopt benchmarks that directly evaluate the longform responses, which are more closely related to the practical usage scenarios of MLLMs.

 

For trustworthiness, we perform evaluation on three benchmarks:

 

(1) Object HalBench [43] is a widely adopted benchmark for assessing object hallucination in detailed image descriptions.

 

It compares the objects in the model output with object labels exhaustively annotated for COCO images [31] to detect object hallucination.

 

To improve the evaluation stability, we augment the benchmark with 8 diverse prompts for detailed image descriptions.

 

We report the response-level hallucination rate (i.e., the percentage of responses that have hallucinations), as well as the mentionlevel hallucination rate (i.e., the percentage of hallucinated object mentions among all object mentions).

 

(2) MMHal-Bench [48] evaluates hallucinations and response informativeness.

 

It employs GPT-4 to compare model output with human response and several object labels to decide the scores.

 

In experiments, we find that GPT-4 cannot reliably detect hallucinations due to the incompleteness of MMHal-Bench text annotations.

 

We therefore only report the informativeness score from GPT-4, and assess response-level hallucination rate by human evaluation.

 

(3) MHumanEval.

 

The above evaluations are either limited to common object hallucination or dominated by short-form question answering (i.e., questions that can be sufficiently answered by a few words).

 

To provide a more reliable and comprehensive evaluation over diverse hallucination types, we present MHumanEval benchmark, which covers both long-form image descriptions, and short-form questions.

 

The benchmark contains 146 samples collected from Object HalBench (50) and MMHal-Bench (96).

 

Given model responses, we ask human annotators to label the hallucinated segments and hallucination types of the segments, including objects, positions, numbers and others.

 

We report the response-level hallucination rate on these types.

 

For helpfulness, we adopt two benchmarks:

 

(1) LLaVA Bench [35] is a widely adopted benchmark for assessing multimodal conversation, detailed description and complex reasoning capabilities.

 

It scores model output against reference response via GPT-4. (2) VQAv2 [18] is a popular dataset for short-form visual question answering .

Baselines.

 

We compare our model with state-of-theart baselines.

 

(1) General baselines. We adopt QwenVL-Chat [6], LLaVA [35], LLaVA 1.5 [34], Muffin [60], and InstructBLIP [14] as representative general baselines.

 

These models are mostly pre-trained on large-scale multimodal data, and fine-tuned on high-quality instruction data, achieving strong performance across various multimodal tasks.

 

(2) Baselines tailored for hallucination problems. LRV [33] is fine-tuned on 400k instruction data generated by GPT-4, and mitigates hallucination by limiting the response length.

 

The concurrent LLaVA-RLHF [48] employs the strong 13B Vicuna v1.5 [62] (fine-tuned from LLaMA2 [51]) as LLM backbone.

 

It trains the reward model on 10k human-labeled preference data, and performs proximal policy optimization [45] on 72k factually augmented data.

 

(3) Commercial Baseline.

 

We also include GPT-4V [37] as a strong reference, evaluating the gap between the opensource models and state-of-the-art commercial models.

 

Implementation Details.

 

We implement the RLHF-V framework based on Muffin [60].

 

The model uses BEiT3 [53] as the visual module, and 13B Vicuna v1.0 [12] (finetuned from LLaMA [50]) as the LLM backbone.

 

The hyperparameter β is 0.5, and the weighting coefficient γ is 5.

 

We train the model with DDPO for 7 epochs, with image resolution 448, learning rate 5e-7 and batch size 32.

 

The training of RLHF-V is computationally efficient, which takes less than 1 hour on 8 A100 GPUs in total.

 

4.2. Main Results

 

The main experimental results are reported in Table 1, from which we observe that:

 

(1) RLHF-V achieves state-ofthe-art performance in trustworthiness among open-source models, outperforming strong general models and models tailored for hallucination.

 

The framework significantly reduces the hallucination rate of the base model Muffin by 75.8% relative points for common objects on Object HalBench, and by 34.8% for overall objects on MHumanEval.

 

The improvement is consistent in different granularities including response-level and mention-level hallucinations, and different hallucination types including objects, positions, and numbers.

 

The reduction is more significant on the more challenging long-form answers on Object HalBench and MHumanEval.

 

The results show that RLHF-V can effectively learn from fine-grained correctional human feedback to enable more trustworthy MLLM behaviors.

 

(2) RLHF-V achieves promising performance in response helpfulness, where the results on MMHalBench, LLaVA Bench and VQAv2 are strong and comparable to the base model.

 

This shows that RLHF-V can enhance the trustworthiness of MLLMs without sacrificing their helpfulness.

 

4.3. Analysis In this section, we conduct analyses on the framework considering the following research questions:

 

(1) How does RLHF-V’s performance scale with feedback data amount?

 

(2) What is the advantage of fine-grained correctional preference data over traditional overall ranking data?

 

(3) Can RLHF-V’s data and method be adopted to enhance the trustworthiness of other MLLMs?

 

(4) How does human feedback alleviate hallucinations intuitively? Scaling feedback data leads to promising results.

 

We report the hallucination rate and numbers of hallucinated segments on MHumanEval under different amounts of feedback data in Figure 2.

 

We observe that the hallucination rate and number of RLHF-V show a significant and rapid decrease as the data amount grows.

 

This shows that finegrained correctional human feedback provides effective and efficient learning signals for MLLM behavior alignment.

 

Based on this tendency, we expect better performance can be achieved with an increasing amount of feedback data. We leave this for future work.

 

Fine-grained correctional human feedback enables better learning efficiency.

 

To quantify the advantage of fine-grained correctional human feedback, we replace our data with the 2.2k human preference data on hallucination from LLaVA-RLHF, which gives overall ranking labels following common RLHF practices.

 

From the experimental results in Figure 2, we observe that model equipped with our data shows a more significant and rapid reduction in hallucination rate and number.

 

Notably, using only 200 preference data, our model achieves comparable hallucination rate to the model that uses an order of magnitude more labeled data from LLaVA-RLHF.

 

The superior data efficiency is due to (1) better data quality since label ambiguity is minimized, and (2) more direct feedback on hallucinated segments, excluding non-robust bias and linguistic variance.

 

RLHF-V generalizes to enhance other MLLMs.

 

To investigate the generalization capability of the framework, we adopt RLHF-V’s data and approach to align the behavior of LLaVA [35], a representative and widely used MLLM.

 

Experimental results show that RLHF-V effectively reduces the hallucination count of LLaVA by 13.8 relative points, as well as the hallucination rate by 5.9 relative points.

 

We also apply RLHF-V to stronger base models and build the OmniLMM-12B [3] which achieves new SoTA results on multiple hallucination benchmarks.

 

For example, OmniLMM-12B exhibits only 4.5% mentionlevel hallucination on the Object HalBench.

 

Moreover, OmniLMM-12B also shows leading performance among comparable-sized models on multiple benchmarks (1637 on MME-Perception [17], 71.1 on SeedBench-I [25]).

 

The results demonstrate that RLHF-V is applicable across different MLLMs to improve trustworthiness.

 

RLHF-V reduces hallucination from correlation and over-generalization.

 

LLMs possess rich world knowledge and strong generalization capabilities.

 

Without proper positive/negative human feedback, MLLMs can over-generalize to produce highly correlated and plausible concepts, which leads to hallucinations.

 

For example, a prevalent hallucination case observed across different MLLMs is claiming the presence of person as long as they see an image of street.

 

To quantify the problem, we select a set of representative scenes {living room, kitchen, bathroom,street}.

 

For each scene, we identify the corresponding images in COCO by lexically matching the captions with the scene name.

 

Then we obtain the top 10 frequent objects in the scene from the COCO object annotations.

 

We compare the response-level hallucination rate for these objects

 

(1) on average across all test samples, and

 

(2) on samples under the target scene. Models prone to over-generalization will expect a significant increase in the hallucination rate (∆).

 

From the experimental results in Table 2, we observe that:

 

(1) All models including GPT-4V show a substantial increase in the hallucination rate, which demonstrates the over-generalization hypothesis.

 

(2) RLHF-V exhibits the smallest change in the hallucination rate, which is even more robust than GPT-4V.

 

The reason for the robustness is that RLHF-V provides crucial positive/negative finegrained correctional human feedback for MLLMs, which helps to learn clear behavior boundaries between reasonable generalizations and over-generalizations.

 

(3) RLHFV achieves the lowest hallucination rates for these common objects both on average and especially under common scenes.

 

This makes RLHF-V preferable in practical realworld applications.

 

Ablation Study.

 

To investigate the contribution of each component, we perform an ablation study.

 

From the experimental results in Table 3, we can observe that:

 

(1) Learning human feedback with vanilla DPO leads to performance degrades, showing the advantage of DDPO in exploiting the fine-grained human preference.

 

(2) Fine-tuning on VQAv2 leads to a significant reduction in hallucination rates compared with the base model.

 

This reveals the value of traditional human-annotated datasets from a new perspective of hallucination mitigation.

 

(3) Including untrustworthy data augmentation (i.e., image cropping) in training hurts the performance on both hallucination and VQAv2.

 

This shows that careless data augmentation can be a doubleedged sword in training MLLMs.

 

Case Study.

 

To provide an intuitive understanding and comparison of different models, we provide qualitative results in Figure 3. We show cases in two representative scenarios:

 

(1) Short-form QA (i.e., questions that can be sufficiently answered in a few words).

 

Our model typically maintains a good balance between helpfulness, engagement and clarity.

 

In comparison, LLaVA-RLHF is usually far more engaging, introducing lengthy extensions however that can be less reasonable or relevant.

 

(2) Long-form QA (i.e., questions that require long text to answer).

 

We observe that MLLMs are significantly more prone to hallucinations in long-form QA, since it typically requires more comprehensive capabilities from multiple perspectives.

 

For example, InstructBLIP and LLaVA-RLHF can confidently describe non-existing objects in a large proportion of their responses, whereas RLHF-V introduces significantly fewer hallucinations while delivering a comparable amount of effective information.

 

We refer readers to the appendix for more qualitative results.