https://arxiv.org/abs/2312.10665
Silkie: Preference Distillation for Large Visual Language Models
This paper explores preference distillation for large vision language models (LVLMs), improving their ability to generate helpful and faithful responses anchoring the visual context. We first build a vision-language feedback (VLFeedback) dataset utilizing
arxiv.org
This paper explores preference distillation for large vision language models (LVLMs), improving their ability to generate helpful and faithful responses anchoring the visual context.
We first build a vision-language feedback (VLFeedback) dataset utilizing AI annotation.
Specifically, responses are generated by models sampled from 12 LVLMs, conditioned on multi-modal instructions sourced from various datasets.
We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations.
Furthermore, the preference supervision is distilled into Qwen-VL-Chat through the direct preference optimization (DPO) method.
The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities, respectively.
Silkie also demonstrates reduced hallucination by setting a new state-of-the-art score of 3.02 on the MMHal-Bench benchmark.
Further analysis shows that DPO with our VLFeedback dataset mainly boosts the fine-grained perception and complex cognition abilities of LVLMs, leading to more comprehensive improvements compared to human-annotated preference datasets. Project page: https://vlf-silkie.github.io.
Silkie
As large vision-language models (LVLMs) evolve rapidly, the demand for high-quality and diverse data to align these models becomes increasingly crucial. However, the creation of such data with human supervision proves costly and time-intensive. In this pap
vlf-silkie.github.io.
2 VISUAL-LANGUAGE FEEDBACK DATASET
In this section, we elaborate on the construction process of our visual-language feedback (VLFeedback) dataset, as illustrated in the Figure 1.
We first introduce the multi-modal instructions sources (§2.1), followed by the details of selected LVLMs for decoding (§2.2) and the annotation with GPT-4V (§2.3).
Finally, we present the statistics of our VLFeedback dataset (§2.4).
2.1 INSTRUCTION SOURCE
We curate instruction sources from diverse datasets that span various capabilities of LVLMs across different domains.
Our selection encompasses:
• General Vision-Language Instructions:
-------------------------------
Featuring datasets such as LLaVA (Liu et al., 2023c) and SVIT (Zhao et al., 2023a), these datasets are constructed by inputting textual descriptions of images to ChatGPT/GPT-4.
They prompt the generation of visual-related instructions that encompass diverse types, including detailed descriptions, reasoning processes, and interactive conversations.
• Academic Vision-Language Instructions:
--------------------------------
Drawn from 20 samples of each task in M3IT (Li et al., 2023c), this set offers comprehensive coverage of previous academic vision-language tasks such as visual question answering, image captioning and classification.
• Robustness-oriented Vision-Language Instructions:
----------------------------------
Challenging instructions from datasets like LRV (Liu et al., 2023a), demanding complex visual reasoning from LVLMs, and ComVint (Du et al., 2023), which introduces misleading queries in the instructions, are incorporated to enrich the coverage of our dataset.
• Domain-specific Vision-Language Instructions:
---------------------------------
We incorporate LLaVAR (Zhang et al., 2023b), emphasizing text-rich images like documents and logos; PMC-VQA (Zhang et al., 2023a) for medical images; LLaVAMed (Li et al., 2023a) for biomedical images; and PCAEVAL (Chen et al., 2023a), designed for visual decision-making instructions in embodied environments.
These instructions require domain knowledge that is potentially useful for downstream applications.
Table 1 summarizes the characteristics and statistics of instruction sources sampled in our paper.
2.2 MODEL POOL
We have curated a diverse model pool comprising 12 LVLMs to cover recent advancements, including
• GPT-4V (OpenAI, 2023a), the proprietary vision language models developed by OpenAI, which are shown to be powerful on various multi-modal tasks (Yang et al., 2023).
• LLaVA-series models, which adopt Vicuna models as the backbone and are trained on the GPT-4 (text-only) synthesized multi-modal dataset.
We select the enhanced version LLaVA-v1.5-7B and LLaVA-v1.5-13B (Liu et al., 2023b), and the RLHF version with visual faithfulness alignment, LLaVA-RLHF (Sun et al., 2023) with different image resolutions LLaVA-RLHF-7b-v1.5-224 and LLaVA-RLHF-13b-v1.5-336
• Qwen-VL-Chat (Bai et al., 2023), which show promising capabilities on various visionlanguage benchmarks with scaled-up multi-modal pre-training and supervised fine-tuning on curated datasets.
• IDEFICS-9b-Instruct (Laurenc¸on et al., 2023), which is a open-sourced implementation of Flamingo (Alayrac et al., 2022), supporting interleaved image-text inputs. After training on publicly available image-text alignment pairs and instruction tuning datasets, it demonstrates comparable results with the original closed-source model on various image-text benchmarks.
• Fuyu-8B (Bavishi et al., 2023), which introduces a novel architecture by segmenting images into patches and training a conditional language model from scratch, showcasing the great potential to deal with high-resolution images.
• InstructBLIP (Dai et al., 2023), which employs an instruction-aware visual feature extraction module based on BLIP2 (Li et al., 2023b).
We select InstructBLIP-Vicuna-7B and InstructBLIP-Vicuna-13B with different LLMs as the backbone models.
• VisualGLM-6B (Du et al., 2022) is an open-sourced, multi-modal dialog language model supporting images, Chinese, and English.
• MM-ICL (Zhao et al., 2023b), which is built on BLIP2 (Li et al., 2023b) and has been further enhanced via training on a curated interleaved image-text dataset to enhance the in-context learning ability.
We adopt MMICL-Vicuna-13B for decoding. For each instruction, we randomly sample four models for decoding.
The decoding hyper-parameters adhere to the recommendations provided in the original implementations.
2.3 GPT-4V AIDED PREFERENCE ANNOTATION
Inspired by the recent progress in alignment from AI Feedback (Bai et al., 2022b; Lee et al., 2023; Cui et al., 2023), we define Helpfulness for judging whether the response is relevant and helps the user, and Ethical Considerations to avoid potential inappropriate responses that may contain toxic content such as biases or violence.
Furthermore, considering the characteristics of LVLMs involving the interaction between modalities, we design a special Visual Faithfulness criterion to evaluate the response consistency between modalities.
Specifically, we ask the GPT-4V model to assess the response quality given the original image and instruction, rating the visual faithfulness from 1 to 5.
The annotation template for visual faithfulness can be found in Table 2, and we include the annotation templates for helpfulness and harmlessness in Appendix A.
2.4 PREFERENCE STATISTICS
We present statistics on the annotated results to elucidate the distribution of the annotation scores.
Score Distribution in Different Aspects
In Figure 2, we illustrate the score distributions for three distinct aspects.
(1) Helpfulness: The majority of samples garnered scores exceeding 4, while a notable portion of samples received the lowest score.
This suggests the general effectiveness of LVLMs in meeting the intended objectives of the annotations, indicating the successfully performed instruction tuning.
(2) Visual Faithfulness: Scores for visual faithfulness closely mirror the distribution observed in the helpfulness evaluation, implying a potential correlation between these two aspects during the annotation process.
The similarity in distributions suggests that the perceived helpfulness of the content likely influences judgments on visual faithfulness.
(3) Ethical Considerations: Interestingly, only a limited portion of the annotated instructions exhibit potential ethical considerations.
This observation may be attributed to the predominant nature of the sampled instructions, which may not be primarily geared toward red-teaming prompts (Perez et al., 2022) designed to elicit harmful results from the LVLMs.
Notably, this finding prompts consideration for a more targeted preference annotation focused explicitly on ethical considerations in future endeavors.
Score Differences between Models Table 3 lists the scores of different models regarding three aspects.
As the evaluated LVLMs may adopt the annotated instructions as the training data, we would like to note that this score comparison could be unfair for certain models.
Nevertheless, GPT-4V demonstrates a clear advantage over open-sourced LVLMs, showcasing its great potential to serve as a proxy for human annotators to provide feedback.
We further select two representative models, GPT-4V and Qwen-VL-Chat, to delve into the distribution of annotated scores.
Figure 3 depicts the distinctions between these models.
Notably, GPT-4V consistently obtains higher ratings across all three facets, evidenced by a prevalence of samples with scores equal to or greater than 4, echoing the results in the average ratings.
It is important to acknowledge that GPT-4V’s dominance may stem from its role as the annotator, introducing a potential bias towards its own characteristics and proclivity for detailed responses.
Despite this, Qwen-VL-Chat still exhibits better results in the helpfulness and visual faithfulness evaluation than in the overall performance of all models as presented in Figure 2.
This suggests Qwen-VL-Chat’s commendable competence in addressing diverse user queries, motivating us to adopt it as a backbone model for future explorations.
Preference Agreement between GPT-4V and Human Annotators Given that the efficacy of RLHF hinges on accurately rated human preferences and the AI evaluator can become unstable (Wang et al., 2023), we undertake a validation experiment by calculating the agreement rate between human annotators and GPT-4V.
We asked three human annotators to compare the overall quality of two responses given the same annotation guide for GPT-4V.
The experiment is conducted on a subset of 100 randomly sampled comparisons from our VLFeedback dataset, revealing an impressive average agreement rate of 83.1%.
This finding further underscores the reliability of employing GPT-4V for annotating preference data, substantiating its credibility in this crucial role.
3 PREFERENCE DISTILLATION FOR LVLMS
Previous results have shown that performant open-sourced LVLMs have been equipped with promising abilities after sufficient instruction tuning.
Therefore, in this work, we explore whether learning from the preference data can improve LVLMs regarding helpfulness and visual faithfulness.
Our method builds upon the VLFeedback dataset and distills vision-language AI preferences with direct preference optimization (DPO) (Rafailov et al., 2023b).