https://arxiv.org/abs/2405.15973
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement
Large vision-language models (LVLMs) have achieved impressive results in various visual question-answering and reasoning tasks through vision instruction tuning on specific datasets. However, there is still significant room for improvement in the alignment
arxiv.org
Large vision-language models (LVLMs) have achieved impressive results in various visual question-answering and reasoning tasks through vision instruction tuning on specific datasets.
However, there is still significant room for improvement in the alignment between visual and language modalities.
Previous methods to enhance this alignment typically require external models or data, heavily depending on their capabilities and quality, which inevitably sets an upper bound on performance.
In this paper, we propose SIMA, a framework that enhances visual and language modality alignment through self-improvement, eliminating the needs for external models or data.
SIMA leverages prompts from existing vision instruction tuning datasets to self-generate responses and employs an in-context self-critic mechanism to select response pairs for preference tuning.
The key innovation is the introduction of three vision metrics during the in-context self-critic process, which can guide the LVLM in selecting responses that enhance image comprehension.
Through experiments across 14 hallucination and comprehensive benchmarks, we demonstrate that SIMA not only improves model performance across all benchmarks but also achieves superior modality alignment, outperforming previous approaches.
Our data and code are available at https://github.com/umd-huang-lab/SIMA
GitHub - umd-huang-lab/SIMA
Contribute to umd-huang-lab/SIMA development by creating an account on GitHub.
github.com
2 Self-Improvement Modality Alignment
In this section, we introduce the proposed Self-Improvement Modality Alignment (SIMA) framework.
SIMA is consisted of three stages: response self-generation, in-context self-critic, and preference tuning.
We will first explain how to obtain self-generated response candidates in Sec 2.1, then discuss how to use model itself πθ to critique the response candidates in Sec 2.2.
Finally, we will introduce how to use self-rewarded responses to update the πθ in Sec 2.3.
The pseudo-code of SIMA is provided in Algorithm 1.


GitHub - umd-huang-lab/SIMA
Contribute to umd-huang-lab/SIMA development by creating an account on GitHub.
github.com
2.1 Response self-generation
Previous works often require the introduction of external models to generate preference dataset to improve current LVLM [34, 47].
However, due to the significant distribution shift between the external models and the currently optimized LVLM, the generated dataset by these approaches may not be helpful to the LVLM.
For example, a common method to obtain negative responses is to use external models to deliberately modify the ground truth and inject object hallucinations [47], while the hallucinations generated by external models do not necessarily indicate that the currently optimized model would produce.
In this case, using such data for learning can not enhance LVLM.
Based on our goal to identify and correct the potential misunderstandings the current LVLM may have about images and improve the modality alignment, we propose using the currently optimized LVLM to self-generate responses.
This approach avoids the potential distribution shift introduced by external models.
As shown in Stage 1 of Figure 2, given an image and its corresponding prompt, we use the currently optimized model to generate two different response candidates for subsequent ranking and preference tuning.
Specifically, the two responses are generated using greedy decoding and temperature sampling to ensure diversity between the responses.
2.2 In-context self-critic
The core part of SIMA is criticizing the self-generated responses without introducing an additional reward model.
As shown in Stage 2 of Figure 2, we directly input the self-generated responses and the critic prompt into the currently optimized LVLM.
The LVLM then selects the better response as the positive response and the other one as the negative response.
The most critical part of this stage is designing an appropriate critic prompt, since the quality of the critic directly determines the performance of the LVLM optimized using the response pairs.
If the worse response is selected as the positive response, it will harm the training of the LVLM.
Our critic prompt consists of the following parts:
• Image, Question, and Ground Truth Response:
Unlike LLMs, which primarily focus on aspects such as the format, helpfulness, and harmlessness of the textual response, LVLMs primarily focus on the accuracy of the response’s understanding of the image content.
This means there is a quantifiable accuracy metric to measure the quality of the response.
Therefore, during in-context self-critic, we must provide the ground truth response as a reference to choose the positive response.
It is worth noting that since the prompts used to generate responses are sampled from the training data of the visual instruction tuning stage,
the corresponding ground truth responses have all been used for visual instruction tuning.
Hence, using the ground truth in the in-context self-critic stage is reasonable.
• Three critic metrics: Although we provide the ground truth response as a reference, without proper guidance, the LVLM might still choose a response that aligns more with the ground truth in terms of output format or harmlessness rather than focusing on the accuracy of visual comprehension.
Therefore, we propose three metrics to guide LVLM ranking, ensuring it select the positive response from the visual comprehension perspective.
The three critic metrics are: Accuracy in Object Description, Accuracy in Depicting Relationships, and Accuracy in Describing Attributes.
Accuracy in Object Description aims to guide current LVLM in evaluating the accuracy of the descriptions concerning the objects mentioned in the ground truth answer.
The responses should minimize the mention of objects not present in the ground truth answer and inaccuracies in the description of existing objects.
Accuracy in Depicting Relationships considers how accurately the relationships between objects are described compared to the ground truth answer and aims to let LVLM rank higher the responses that least misrepresent these relationships.
Accuracy in Describing Attributes assesses the accuracy in depicting objects’ attributes compared to the ground truth answer.
The responses should avoid inaccuracies in describing the characteristics of the objects present.
• Demonstrations: To ensure the correct format of the ranking output, we also leverage in-context learning by providing two ranking demonstrations in the designed ranking prompt for the LVLM to imitate


DPO