LMM의 DPO : Aligning Modalities in Vision Large Language Models via Preference Fine-tuning 논문리뷰

inference-time, RLHF/STaR, ResT - LMM

LMM의 DPO : Aligning Modalities in Vision Large Language Models via Preference Fine-tuning 논문리뷰

jinuklee 2024. 10. 9. 22:08

Instruction-following Vision Large Language Models (VLLMs) have achieved significant progress recently on a variety of tasks.

These approaches merge strong pre-trained vision models and large language models (LLMs).

Since these components are trained separately, the learned representations need to be aligned with joint training on additional image-language pairs.

This procedure is not perfect and can cause the model to hallucinate - provide answers that do not accurately reflect the image, even when the core LLM is highly factual and the vision backbone has sufficiently complete representations.

In this work, we frame the hallucination problem as an alignment issue, tackle it with preference tuning.

Specifically, we propose POVID to generate feedback data with AI models.

We use ground-truth instructions as the preferred response and a two-stage approach to generate dispreferred data.

First, we prompt GPT-4V to inject plausible hallucinations into the correct answer.

Second, we distort the image to trigger the inherent hallucination behavior of the VLLM.

This is an automated approach, which does not rely on human data generation or require a perfect expert, which makes it easily scalable.

Finally, both of these generation strategies are integrated into an RLHF pipeline via Direct Preference Optimization.

In experiments across broad benchmarks, we show that we can not only reduce hallucinations, but improve model performance across standard benchmarks, outperforming prior approaches.

3. Constructing Preferences to Aligning Modalities in VLLMs

While preference learning approaches (e.g., DPO) facilitate the lightweight and stable training of VLLMs, they require data in the form of preferences.

In contrast to LLMs, which support more freestyle generation in many scenarios, VLLMs used in various applications, such as VQA or image captioning, produce responses linked to input images.

This inherent image-centricity presents distinct challenges in the preference data generation process for VLLMs, setting it apart from the process in LLMs.

Specifically, in VLLMs, when comparing two responses, neither of which is correct for the required task (e.g., image captioning), the model may not be able to accurately align the image with the response.

To address this challenge, we propose Preference Optimization in VLLM with AI-Generated Dispreferences (POVID), a novel approach aimed at better aligning image and text modalities.

As illustrated in Figure 2, POVID leverages AI models to generate dispreferred responses without the need for human labeling efforts.

These generated dispreferred responses, when combined with groundtruth image descriptions (treated as preferred responses), form the preference data pairs. Specifically, we employ two strategies to generate the dispreferred response:

(1) Firstly, we manipulate the groundtruth text response by transforming the groundtruth response into hallucinated response, which serves as the dispreferred response;

(2) Secondly, we introduce distortion to the image input during the training process, intending to trigger inherent hallucination patterns within the VLLMs.

These patterns are then formalized as the dispreferred response, motivating the model to correct its inherent dispreferred patterns. In the remainder of this section, we will provide detailed explanations of both strategies and demonstrate how to integrate them into the preference training framework.

-------------------------------------------------------

3.1. Hallucinating Textual Responses

In our first strategy, we aim to generate dispreferred hallucinatory responses by hallucinating the groundtruth correct response.

We construct the hallucinatory response based on a subset with 17K examples that are randomly sampled from LLaVA-Instruct-150K (Liu et al., 2023b) dataset.

Here, the LLaVA-Instruct-150K datasets is used to train LLaVA LLaVA with supervised fine-tuning.

The 17K examples includes various task types, including image captioning, simple VQA and logical reasoning.

To construct the preferences, we treat the original answers in the 17K examples as preferred responses.

In terms of constructing dispreferred responses, we hallucinate the original answers using GPT-4V (OpenAI, 2023).

Here, we adopt two hallucinating approaches tailored to different tasks:

----------------------------------------------

I. Hallucinating Image Captioning Tasks.

First, we hallucinate the image captioning tasks by considering three fundamental causes of hallucination in VLLMs:

(1) Object Co-occurrence: This phenomenon arises when the training data contains spurious co-occurring patterns between objects, leading VLLMs to generate objects based on these learned spurious correlations.

In this context, we aim to leverage GPT-4V to deduce object co-occurrence within the given image and subsequently revise the original responses accordingly;

(2) Logical Relationships Between Entities: This involves using GPT-4V to modify the relationships between the original objects;

(3) Incorrect Attributes: In this case, we employ GPT-4V to alter the attributes of various objects, such as changing their colors.

We illustrate these three distinct hallucination scenarios with an example provided in Figure 3(a).

In addition, the prompt we used to generate the dispreferred response is in Appendix A.2.

II. Hallucinating Reasoning Tasks.

Secondly, when dealing with tasks involving reasoning, such as VQA and logical reasoning, we task GPT-4V with modifying the reasoning process.

This entails introducing errors related to logical relationships, entity information, entity attributes, and more.

Additionally, we recommend that GPT-4V attempts to make subtle changes to the reasoning process, ensuring it remains independent of factual reasoning results, meaning that an incorrect reasoning process may still yield correct results.

However, if the introduction of errors necessitates alterations to the reasoning results, we instruct GPT-4V to adjust the results accordingly.

Likewise, in Figure 3(b), we provide an example to demonstrate both the original and the generated dispreferred responses.

The prompt we used is detailed in Appendix A.2. 3.2. Mitigating Inherent Hallucination Patterns In addition to generating the dispreferred response using powerful external AI models like GPT-4V, we also aim to provoke inherent hallucination patterns within the VLLM to be finetuned.

Our second strategy introduces noise into the image to trigger inherent hallucination patterns in the VLLMs.

This noise disrupts the VLLM’s understanding of the image, leading it to produce uncertain responses that rely more on textual context or acquired knowledge from the training data.

This occurs because, in the presence of noisy images, the model tends to prioritize inherent object associations over visual information.

Notably, the noise step should remain within a reasonable range, ensuring that the image remains easily recognizable by humans.

For example, as depicted in Figure 4, when presented with the context “There are a knife and ”, under specific noisy conditions, the likelihood of “fork” surpasses that of “plate” (ground truth).

This may occur because “plate” is more likely to co-occur with “fork” in the training data. With an increase in noise steps, the term “pixel” becomes predominant, owing to the noticeable noise patterns within the image.

Consequently, establishing an appropriate noise step to activate inherent hallucination patterns is a reasonable approach.

To achieve this goal, we introduce diffusion noise into the original image. We define the noise step as k, and the noised image with step k can be expressed as follows:

VLLM, we integrate the image noising process into the DPO fine-tuning process.

Specifically, for each input prompt x, we take into account the dispreferred responses from both the hallucinated text responses discussed in Section 3.1 and the responses triggered by distorted images.

We then reformulate the DPO loss as follows:

ferred responses, since a substantial portion of dispreferred response overlaps with preferred response. The training process of our method is detailed in Algorithm 1.

'inference-time, RLHF > STaR, ResT - LMM' 카테고리의 다른 글

FGAIF: Aligning Large Vision-Language Modelswith Fine-grained AI Feedback 논문리뷰 (1)	2024.10.12
GLOV: GUIDED LARGE LANGUAGE MODELS AS IMPLICIT OPTIMIZERS FOR VISION LANGUAGE MODELS 논문리뷰 (0)	2024.10.12
REVISIT LARGE-SCALE IMAGE-CAPTION DATA IN PRETRAINING MULTIMODAL FOUNDATION MODELS 논문리뷰 (5)	2024.10.09
[CVPR 2024] Rich Human Feedback for Text-to-Image Generation 논문리뷰 (0)	2024.10.09
LLaVA-Video-178K : Video Instruction Tuning With Synthetic Data 논문리뷰 (0)	2024.10.09

현재글LMM의 DPO : Aligning Modalities in Vision Large Language Models via Preference Fine-tuning 논문리뷰

이진욱님의 블로그

ai research memo for reference

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

이진욱님의 블로그