inference-time, RLHF/STaR, ResT - LMM

LMM-as-a-judge / PROMETHEUS-VISION:Vision-Language Model as a Judge for Fine-Grained Evaluation 논문리뷰

jinuklee 2024. 10. 7. 01:15

https://arxiv.org/pdf/2401.06591

Assessing long-form responses generated by Vision-Language Models (VLMs) is challenging.

 

It not only requires checking whether the VLM follows the given instruction but also verifying whether the text output is properly grounded on the given image.

 

Inspired by the recent approach of evaluating LMs with LMs, in this work, we propose to evaluate VLMs with VLMs.

 

For this purpose, we present a new feedback dataset called the PERCEPTION COLLECTION, encompassing 15K customized score rubrics that users might care about during assessment.

 

Using the PERCEPTION COLLECTION, we train PROMETHEUS-VISION, the first open-source VLM evaluator model that can understand the user-defined score criteria during evaluation.

 

PROMETHEUS-VISION shows the highest Pearson correlation with human evaluators and GPT-4V among open-source models, showing its effectiveness for transparent and accessible evaluation of VLMs.

 

We open-source our code, dataset, and model at

 

https://github.com/kaistAI/prometheus-vision

 

GitHub - prometheus-eval/prometheus-vision: [ACL 2024 Findings & ICLR 2024 WS] An Evaluator VLM that is open-source, offers repr

[ACL 2024 Findings & ICLR 2024 WS] An Evaluator VLM that is open-source, offers reproducible evaluation, and inexpensive to use. Specifically designed for fine-grained evaluation on customized ...

github.com

 

3 The PERCEPTION COLLECTION

 

In contrast to the language domain, to the best of our knowledge, there do not exist any available feedback, critique, or preference datasets applicable to train an evaluator VLM that could assess in a fine-grained manner.

 

For this purpose, we first construct a comprehensive multi-modal feedback dataset called the PERCEPTION COLLECTION.

 

As shown in Figure 2, each instance in the PERCEPTION COLLECTION consists of five input components (image, instruction, response to evaluate, customized score rubric, reference answer) and two output components (language feedback and score decision).

 

The number of each component in the PERCEPTION COLLECTION is shown in Table 1.

 

Specifically, the five input components are:

 

• Image: A real-world image that the user would provide to the VLM.

 

• Instruction: A text instruction that the user would prompt the VLM. It is also related to the provided image.

 

• Response to Evaluate: A text response that the VLM would generate based on the image and instruction.

 

The evaluator VLM has to assess this response.

 

• Customized Score Rubric: A detailed scoring criteria that the VLM should refer to for assessment.

 

We use fine-grained criteria in contrast to coarse-grained ones such as helpfulness, relevance, accuracy, and comprehensiveness.

 

The rubric consists of (1) a description of the criteria and (2) a description of each scoring decision on a scale of 1 to 5.

 

• Reference Answer: A reference answer that would achieve a score of 5.

 

While this component could be hand-crafted by human annotators, in our experiments, we utilize GPT-4V.

 

Moreover, the two output components are:

 

• Feedback: A rationale pinpointing what is good and bad about the response under assessment.

 

Instead of directly providing a scoring decision, this component makes the judgement process more interpretable.

 

• Score: An integer value on a scale of 1 to 5 that represents the quality of the response given the criteria mentioned in the score rubric.

 

3.1 PERCEPTION COLLECTION Construction

----------------------------------------

We construct a multi-modal feedback dataset called the PERCEPTION COLLECTION.

 

We mainly follow the construction process of Kim et al. (2023d).

 

While creating the PERCEPTION COLLECTION, we utilize 5K real-world images sampled from MS COCO 2017 Challenge (Lin et al., 2014) and the MMMU benchmark (Yue et al., 2023).

 

Concretely, the augmentation process consists of 4 stages:

 

(1) hand-crafting 50 seed score rubrics,

 

(2) brainstorming 15K fine-grained score rubrics,

 

(3) augmenting 30K instructions and reference answers closely tied with the rubric, and

 

(4) augmenting 150K responses and language feedback.

 

We include a detailed analysis of the PERCEPTIONCOLLECTION in terms of diversity and quality in Appendix A and all the prompts used for augmentation in Appendix F.

 

Step 1: Hand-Crafting Score Rubrics 

============================

We first start by writing 50 examples of fine-grained score rubrics that go beyond the coarse-grained counterparts.

 

For 50 images, we write an instruction and the corresponding rubric that pinpoints which aspect to consider during the assessment.

 

Step 2: Brainstorming Score Rubrics

==============================

 Using GPT-4V, we expand the number of our score rubrics from 50 to 15K.

 

Using an arbitrary image among the 5K pool and the 50 examples as demonstrations, we prompt GPT-4V to generate 3 variants for each image.

 

To ensure quality, we go through an additional stage of prompting GPT-4V to inspect whether the generated score rubric aligns with the image.

 

If it does not, we iteratively prompt it again until we acquire 3 candidates per image.

 

Step 3: Augmenting Instructions and Reference Answers related to the Score Rubric

============================

 

Afterwards, we use the 15K score rubrics and prompt GPT-4V to generate 2 novel instructions for each score rubric, leading to a total number of 30K.

 

This process ensures that the instruction is closely tied to the score rubric since the instruction was conditioned on the score rubric.

 

Step 4: Augmenting Training Instances

=============================

Lastly, we augment the remaining components which are the response to evaluate, feedback, and scoring decision.

 

We use the score rubric and instruction generated from the previous stages and prompt GPT4V to write a response that would get a score of i (1 ≤ i ≤ 5).

 

Importantly, we ensured that there is no length bias (i.e., giving a higher score for longer responses) and included an analysis at Section C.

 

This leads to a total number of 150K responses and 150K feedback where each score within between 1 and 5 has an even number of 30K instances.

 

We include our analysis of the PERCEPTION COLLECTION in terms of its quality, diversity, and whether there is a length bias among score decisions at Appendix A.

 

3.2 Fine-tuning a VLM as an Evaluator

----------------------------------------

Using the PERCEPTION COLLECTION, we use LLaVA-1.5 (7B & 13B) (Liu et al., 2023a) as our backbone model and train PROMETHEUS-VISION (7B & 13B).

 

Training on the PROMETHEUS COLLECTION is analogous to Chain-of-Thought finetuning which requires generating a rationale (which is the feedback in our case) and then the score in a sequential manner (Ho et al., 2022; Kim et al., 2023c).

 

We include a fixed phrase ‘So the overall score is’ in between the feedback and the score which we found to prevent degeneration during inference. The detailed hyper-parameters used during training are included in Appendix D.1.

 

4 Experimental Settings

 

4.1 Protocol for Evaluating Evaluator VLMs

 

In this section, we explain our experimental setting used to assess the fine-grained judgement capabilities of evaluator VLMs

 

As it is a non-trivial problem to directly measure ‘How well a VLM is evaluating’, we indirectly compare with two different standards:

 

(1) how closely PROMETHEUS-VISION could simulate human evaluators (Section 5.1) and (2) how closely PROMETHEUS-VISION could simulate the best VLM, which is GPT-4V, for nuanced assessment purposes (Section 5.2).

 

4.2 Evaluator VLM & LM Baselines

 

We employ 9 VLMs as our evaluator VLM baselines, namely LLAVA-1.5 (7B & 13B) (Liu et al., 2023a); LLAVA-RLHF (7B & 13B) (Sun et al., 2023); SHAREGPT4V (7B) (Chen et al., 2023); FUYU (8B) (Bavishi et al., 2023); and GPT4V (OpenAI, 2023) along with PROMETHEUSVISION (7B & 13B).

 

In addition, we also compare with using LMs as a judge for evaluating VLMs as in previous work (Bai et al., 2023).

 

We add 4 LMs as our evaluator LM baselines, namely PROMETHEUS (7B & 13B) (Kim et al., 2023d); GPT-3.5-TURBO (OpenAI, 2022); and GPT-4 (OpenAI, 2023).

 

Since LMs could not receive images as input, we prompt LLaVA-1.5 to generate a caption for the given image and provide the caption as additional input for LM evaluators.

 

In contrast, for VLM evaluator baselines, we directly provide the image as input.

 

The hyper-parameters used to inference evaluator LMs and evaluator VLMs are included in Appendix D.1.

 

4.3 Response VLMs

 

During our experiments, we utilize 3 different VLMs to sample the outputs that our VLM evaluators would assess.

 

We denote these 3 VLMs as ‘Response VLMs’.

 

We utilize FUYU (8B), LLAVA1.5 (13B), and GPT-4V as our response VLM.

 

The hyper-parameters used to inference response VLMs are included in Appendix D.1.

 

4.4 Benchmarks Our evaluation benchmarks are mainly divided into 3 categories:

 

• Visual Instruction Following Benchmarks:

-----------------------------------------------------

Tasks that require to write a long-form text output given an image and a text instruction.

 

We use LLaVA-Bench (Liu et al., 2023a), VisITBench (Bitton et al., 2023), and a held-out test set of the PERCEPTION COLLECTION called the PERCEPTION BENCH.

 

• Visual Question Answering Benchmarks: Tasks that require to write a text output given an image and a text question. 

---------------------------------------------------

Compared to instruction following benchmarks, one notable difference is that we use the short-form answers originated from each dataset as reference answers in the input.

 

We use the test set of the OKVQA dataset (Marino et al., 2019), VQAv2 dataset (Goyal et al., 2017), and TextVQA dataset (Singh et al., 2019).

 

• Captioning Benchmarks: Tasks that require to write a text caption of the given image.

---------------------------------------------------

Similar to the visual question answering benchmarks, the ground truth answers tend to be short compared to the reference responses in the instruction following benchmarks.

 

We use the test set of the COCOCaptions dataset (Chen et al., 2015) and NoCaps dataset (Agrawal et al., 2019).

 

The number of instances and score rubrics for each benchmark is shown in Table 2 and Table 3.

 

Note that while the datasets in the VQA and captioning benchmarks originally have ground-truth answers, the instruction following benchmarks inherently does not have a reference answer.

 

Using the same augmentation process mentioned in Section 3.1, we augment a reference answer and a finegrained score rubric for each instance within the LLaVA-Bench, VisIT-Bench, and PERCEPTIONBENCH.

 

For the PERCEPTION-BENCH, which is our held-out test set, we also generate new instructions.

 

For the VQA and captioning benchmarks, we generate 5 score rubrics with the original groundtruth answer in consideration.

 

The authors manually checked the quality of the added components.

 

4.5 Setups & Metrics

 

Our evaluation setup is divided into 2 parts.

 

Setup #1 (Table 2)

 

In Section 5.1, we utilize 45 instances with instance-wise hand-crafted score rubrics (15 instances each for LLAVA-BENCH, VISIT-BENCH, and PERCEPTION-BENCH).

 

We ask 9 human annotators proficient in English to provide a scoring decision as PROMETHEUS-VISION.

 

Then, we measure the correlation of the scoring decision by employing Pearson, Kendall-Tau, and Spearman as our metrics.

 

Next, we ask human annotators to compare 2 language feedbacks that are sampled from either GPT-4, GPT-4V, or PROMETHEUS-VISION (13B) and choose which one is better.

 

Then, we measure the Pairwise Preference Win-rate between the 3 candidates. Details of the annotation setting are explained in Appendix D.2.

 

Setup #2 (Table 3)

 

In Section 5.2, we expand the number of instances and utilize 1,085 fine-grained score rubrics tied across 3,560 instances in total.

 

In this setting, we prompt GPT-4V three times and compare the correlation of the scoring decision by also prompting evaluator VLMs and evaluator LMs three times.

 

As Setup #1, we use Pearson, Kendall-Tau, and Spearman as our metrics.

 

5 Experimental Results

=============================

5.1 Can PROMETHEUS-VISION Closely Simulate Human Evaluators?

 

In this subsection, to verify whether PROMETHEUSVISION can emulate human evaluators, we measure the correlation between scores annotated by humans and those predicted by evaluator VLMs.

 

The overall results are shown in Figure 3.

 

5.1.1 Correlation with Human Evaluators

 

Our PROMETHEUS-VISION 13B notably mirrors the high correlation exhibited by leading models GPT-4 and GPT-4V on the LLAVA-BENCH and PERCEPTION-BENCH, achieving correlations of 0.639 and 0.870, respectively.

 

Despite this, although our PROMETHEUS-VISION outperforms GPT-3.5-TURBO and PROMETHEUS 13B with a slightly improved correlation on the VISITBENCH, it is lower than GPT-4 and GPT-4V.

 

We posit that this disparity primarily originates from the differing characteristics of the VISITBENCH and other benchmarks.

 

The former contains a higher proportion of text-rich images, such as graphs and charts, compared to the latter two datasets.

 

Even though the PERCEPTION COLLECTION also includes instruction sets for text-rich images, their amount is relatively limited.

 

These inherent limitations in the model architecture of PROMETHEUS-VISION present challenges in processing such text-rich images during inference.

 

Nevertheless, recent works on vision-language models (Zhang et al., 2023; Ye et al., 2023b; Kim et al., 2022, 2023a) show promising capabilities for handling these image types, providing a better backbone model for future iterations of PROMETHEUSVISION.

 

In consideration of these findings, the use of text-rich datasets, along with the integration of new methods drawn from recent architectural advancements, could alleviate these limitations.

 

Also, it is worthwhile to compare where GPT-4 (LM Evaluator) and GPT-4V (VLM Evaluator) excel at each benchmark.

 

Similar to PROMETHEUSVISION, on the VISIT-BENCH, GPT-4 shows a slightly higher correlation with human evaluators compared to GPT-4V.

 

This could mainly be because processing text is as important when assessing responses from text-rich images such as diagrams, charts, and graphs.

 

On the other hand, GPT4V shows a higher correlation with human evaluators on the LLAVA-BENCH and PERCEPTIONBENCH which includes diverse real-world images.

 

5.1.2 Comparison of the Quality of the Feedback

---------------------------------------------------

Next, we compare the quality of the language feedback generated by GPT-4, GPT-4V, and PROMETHEUS-VISION 13B across 135 instances by hiring 9 human annotators.

 

The detailed experimental setting is explained in Appendix E and the results are shown in Figure 4.

 

Surprisingly, The PROMETHEUS-VISION 13B model is capable of generating feedback of a quality comparable to GPT-4.

 

Among the 135 instances, human annotators determine that 57.78 % of the time, PROMETHEUS-VISION’s feedback is better or as good as GPT-4V’s feedback.

 

Also, human annotators determine that 45.93 % of the time, PROMETHEUS-VISION’s feedback is better or as good as GPT-4’s feedback.

 

These results indicate that PROMETHEUS-VISION could also be utilized as a open-source critique model for assisting assessment by humans (Saunders et al., 2022).

 

5.2 Can PROMETHEUS-VISION Closely Simulate GPT-4 Vision as a Judge?

---------------------------------------------------

In this subsection, to check whether PROMETHEUSVISION could be used as a reliable evaluator on various multi-modal tasks, we compare the correlation between scores predicted by GPT-4V and scores predicted by baselines including PROMETHEUSVISION.

 

The results are shown in Tables 4, 5, 6.

 

5.2.1 Visual Instruction Following Benchmarks

---------------------------------------------------

The results in Table 4 show that PROMETHEUSVISION demonstrates a higher correlation with GPT-4V compared to that of its backbone model, LLAVA-V1.5, in all 3 benchmarks and 2 model sizes.

 

This indicates that training with PERCEPTION COLLECTION enhances the VLM’s evaluation capabilities.

 

Furthermore, in the LLAVABENCH and PERCEPTION-BENCH, PROMETHEUSVISION 13B exhibits a higher correlation than the LM evaluators GPT-3.5-Turbo and GPT-4.

5.2.2 Visual Question Answering Benchmarks

---------------------------------------------------

Table 5 presents the correlation results in the visual question answering (VQA) benchmarks.

 

In this benchmark, PROMETHEUS-VISION significantly outperforms other open-source models, including LLAVA-V1.5.

 

Also, we observe that PROMETHEUS-VISION’s correlation is generally lower in VQA benchmarks compared to visual instruction following benchmarks.

 

We attribute this to the PERCEPTION COLLECTION training data, which generally involves longer responses, while the answers in the VQA benchmark are mostly short.

 

Future works could consider adding more diversity to the training data to obtain a stronger VLM evaluator.

 

5.2.3 Captioning Benchmarks

 

Unlike visual instruction following or VQA benchmarks, captioning benchmarks do not have a direct question but rather require writing a description of a given image in a short sentence.

 

Therefore, we created prompts such as ‘Generate a coco-style caption.’ and fed them to our evaluator VLM baselines during experiments.

 

The results are shown in Table 6.

 

While most evaluators, including proprietary LMs, show low correlation, PROMETHEUSVISION 13B surprisingly stands out by showing a correlation above 0.5 in the COCO-Captions, indicating it could generalize to evaluate other visual language tasks beyond its training data.

 

6 Analysis of Potential Biases from VLM Evaluators

==============================

6.1 Is there a Length Bias?

 

Previous works have highlighted a phenomenon known as length bias in models, which refers to a tendency of evaluator models to prefer longer responses (Li et al., 2023; Dubois et al., 2023; Zheng et al., 2023).

 

This is a critical factor to consider during evaluation, as evaluators with length bias could give higher scores simply based on the length of the response, regardless of its actual content.

 

To verify if this is the case, we plot and analyze the lengths of responses using our results from Section 5.1.

 

The box plot in Figure 5 showcases GPT-4V and PROMETHEUS-VISION do not indiscriminately favor longer answers, indicating an absence of length bias.

 

This is likely because our experimental setting is in an absolute grading setting where the evaluator VLM assesses the given responses with an absolute score rather than comparing two responses.

 

This also aligns with the previous finding from Zheng et al. (2023) and Kim et al. (2023d).

 

We provide more details of our analysis in Appendix A.3 and Appendix E. 6.2 Is there a Self-Enhancement Bias?

 

Self-enhancement bias is another type of wellknown bias where evaluators tend to prefer its own response (Zheng et al., 2023).

 

Since PROMETHEUS-VISION is a model specialized for evaluation purposes only, it does not directly suffer from this bias.

 

However, since we train PROMETHEUS-VISION with data augmented from GPT-4V and use LLaVA-v1.5 as our base model, this could indirectly influence PROMETHEUS- VISION, making things more complicated.

 

To investigate whether there is a self-enhancement bias, we analyze the trends of which score was given to different response VLMs on the LLaVA-Bench and Perception Bench.

 

Figure 6 illustrates the results.

 

Overall, the results show that PROMETHEUS-VISION and GPT4V exhibit similar evaluation patterns across the two benchmarks, reinforcing the findings from previous correlation studies with GPT-4V.

 

Notably, PROMETHEUS-VISION gives a higher score to other models compared to its backbone model (LLaVA-v1.5) on the LLaVA-Bench, indicating that evaluator VLMs might not always prefer the responses from its backbone model.

 

While PROMETHEUS-VISION does give the highest score to GPT-4V, it is hard to determine if this is because PROMETHEUS-VISION was trained on data augmented from GPT-4V, or GPT-4V is distinctively better than the open-source VLMs.

 

We leave analysis of this to future research.

 

Lastly, the trends from Figure 6 also highlight the potential of our held-out testset, the PERCEPTION-BENCH, to be used as a testbed for VLM development in future research.

 

Specifically, on the predominant LLaVA-Bench, LLaVA-RLHF shows only a marginal difference of 0.14 points from GPT-4V. However, this gap widens significantly to 1.43 in PERCEPTION BENCH.

 

Since the PERCEPTION BENCH was generated based on fine-grained rubrics, its instructions are more complex and extended responses than those of LLaVABench.

 

7 Conclusion

 

In this paper, we expand the ‘LM-as-a-Judge’ paradigm to the multi-modal space and introduce ‘VLM-as-a-Judge’.

 

We first propose a multi-modal feedback dataset called the PERCEPTION COLLECTION, which has unique score criteria for each instance, unlike the existing multi-modal datasets that do not heavily consider the important values to consider during evaluation.

 

Using the PERCEPTION COLLECTION, we train PROMETHEUS-VISION, an open-source model specialized for evaluation purposes.

 

The uniqueness of PROMETHEUS-VISION is that it could adhere to user-defined criteria during evaluation.

 

Through experiments, we show that PROMETHEUS-VISION paves way for accessible and transparent evaluation of VLMs. We hope our work could pave the way for more research on open-source evaluators in different modalities.