inference-time, RLHF/STaR, ResT - LMM

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchm

jinuklee 2024. 10. 13. 22:06

https://arxiv.org/abs/2402.04788

 

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

Multimodal Large Language Models (MLLMs) have gained significant attention recently, showing remarkable potential in artificial general intelligence. However, assessing the utility of MLLMs presents considerable challenges, primarily due to the absence of

arxiv.org

Multimodal Large Language Models (MLLMs) have gained significant attention recently, showing remarkable potential in artificial general intelligence.
 
However, assessing the utility of MLLMs presents considerable challenges, primarily due to the absence of multimodal benchmarks that align with human preferences.
 
Drawing inspiration from the concept of LLM-as-a-Judge within LLMs, this paper introduces a novel benchmark, termed MLLM-as-a-Judge, to assess the ability of MLLMs in assisting judges across diverse modalities, encompassing three distinct tasks:
 
Scoring Evaluation, Pair Comparison, and Batch Ranking.
 
Our study reveals that, while MLLMs demonstrate remarkable human-like discernment in Pair Comparison, there is a significant divergence from human preferences in Scoring Evaluation and Batch Ranking.
 
Furthermore, a closer examination reveals persistent challenges in the judgment capacities of LLMs, including diverse biases, hallucinatory responses, and inconsistencies in judgment, even in advanced models such as GPT-4V.
 
These findings emphasize the pressing need for enhancements and further research efforts to be undertaken before regarding MLLMs as fully reliable evaluators.
 
In light of this, we advocate for additional efforts dedicated to supporting the continuous development within the domain of MLLM functioning as judges. The code and dataset are publicly available at our project homepage: https://mllm-judge. github.io
 

2. MLLM-as-a-Judge: A Benchmark to Assess Vision-Language Judging Ability Figure 2 shows an overview of our proposed MLLM-as-aJudge, consisting of three steps: 1) image-instruction pair collection, 2) MLLM response collection, and 3) comparison with human annotation. Initially, we collect a dataset

• Scoring Evaluation: Each individual response is evaluated on a scale from 1 to 5, with the specific criteria for this rating system detailed in Appendix F.
 
• Pair Comparison: It involves a direct comparison between two responses, culminating in the identification of the superior one.

 

Following the principles outlined by (Deutsch et al., 2023, Ties matter: Meta evaluating modern metrics with pairwise accuracy and tie calibration ), a tie option is incorporated to ensure a more equitable assessment. '

 


• Batch Ranking: The responses are systematically arranged in descending order of quality based on a given instruction, without any tie option.
 
2.1. Step 1: Image-Instruction Pair Collection
 
We meticulously curate a dataset consisting of 4,414 imagetext pairs, gathered from a variety of downstream task datasets, as detailed in Table 8 in Appendix B.
 
These pairs are carefully tailored into image-instruction pairs to suit a free-form response format.
 
To illustrate, within the domain of diffusion tasks, our dataset incorporated pairs challenging models to adeptly recognize and articulate connections between provided images and user-specified keywords.
 
2.2. Step 2: MLLM Response Collection
 
We employ six widely-used MLLMs – GPT-4V (OpenAI, 2023), Gemini (GeminiTeam, 2023), LLaVA (Liu et al., 2023d), Qwen-VL-Max (Bai et al., 2023a), LLaVA-1.6-34b (Liu et al., 2023d), and CogVLM (Wang et al., 2023c) – to generate responses based on the image-instruction pairs, obtaining approximately 17,000 responses.
 
Responses that are either too brief or non-compliant with security regulations (e.g., “I’m sorry, but I cannot assist with this request”) from GPT-4V and Gemini are excluded.
 
The number of responses and the length distributions for different MLLMs are shown in Table 1 and Figure 3, respectively.
 
We show specific hyper-parameter settings in Appendix B.2.
 
Besides, we segment these responses into three non-overlapping groups, to prevent response overlap.
 
2.3. Step 3: Comparison with Human Annotations
 
The annotation is conducted by 6 authors of this paper independently.
 
These annotators are proficient in this domain, with different genders, ages, and educational backgrounds to ensure diversity (Sun et al., 2020).
 
They are required to give objective judgments without considering answer lengths, and certain names or positions of the response to minimize human bias. More details are referred to Appendix E.
 
3. Experiment Settings

3.1. Settings of MLLM-as-a-Judge

We evaluate the judging performance of eleven leading MLLMs – GPT-4V (OpenAI, 2023), Gemini-Pro-Vision1.0 (GeminiTeam, 2023), LLaVA-1.5-13b, LLaVA-1.6- 7b/13b/34b (Liu et al., 2023d), Qwen-VL-Plus/Max (Bai et al., 2023a) and CogVLM (Wang et al., 2023c) – across three distinct evaluation settings.

Adapting the “Analyzethen-Judge” paradigm from Chiang & Lee (2023b), which is a one-step CoT approach (Wei et al., 2022), we first ask MLLMs to analyze responses and then provide a judgment based on their analysis.

However, due to capability limitations to perform the “Analyze-then-Judge” setting for LLaVA and CogVLM, we prompt them to directly output their judgment.

We also evaluate whether multi-step CoT will enhance the performance of MLLM serving as a judge.

Furthermore, to extensively explore MLLMs judging capabilities, we conduct experiments on various settings, including scenarios without vision input, replacing vision input with a detailed description generated by GPT-4V as a vision expert, and employing multi-step CoT.

Considering that the first two settings do not involve image inputs, we also include tests on the latest GPT-4 (OpenAI, 2023) Gemini (GeminiTeam, 2023), LLaMA-2-70b (Touvron et al., 2023), and Mixtral-8x7b (Jiang et al., 2024) to assess whether LLMs can effectively perform judging tasks without vision perception.

Comprehensive details of these experimental setups are available in Appendix C, and the prompts can be found in Appendix F. 3.2.

Judging Metrics

After collecting responses from MLLM judgments, we quantify their alignment with human annotations across three settings, employing distinct metrics as follows:

▷ Scoring Evaluation:

Following LLM-as-a-Judge (Zheng et al., 2023b), we compute the Pearson similarity (Lee Rodgers & Nicewander, 1988) between the MLLMs’ judgments and human ratings across different sub-datasets.

▷ Pair Comparison:

We assess the similarity between the MLLM judgments and human decisions using accuracy, F1-score (Goutte & Gaussier, 2005), and recall (Goutte & Gaussier, 2005) to assess the judging abilities of models.

▷ Batch Evaluation:

We consolidate the ranking results into a singular sequence and employ the Normalized Levenshtein distance (Levenshtein et al., 1966) to evaluate the similarity between judgments from MLLMs and human annotation.

3.3. Human Agreement in MLLM Judgment

Apart from traditional metrics for similarity assessment between judgments from MLLMs and humans, we further evaluate the judgments provided by MLLMs to uncover latent bias and hallucination in 10 datasets.

We also invite human annotators for further validation, focusing on the following aspects

▷ Human Agreement:

This involves a simple ‘yes’ or ‘no’ response to assess agreement with the MLLM judgments.

While some judgments might appear reasonable, they may still be considered incorrect due to unique human perspectives.

Hence, we conduct experiments on human agreement to address situations that traditional metrics may not adequately capture.

▷ Analysis Grading:

Each MLLM analysis is assigned a score from 1 to 5, considering relevance, accuracy, creativity, and response granularity, detailed in Appendix F.

▷ Hallucination Detection:

Given the propensity for hallucination issues in the complex reasoning chains and longterm vision-language contexts of MLLMs, we task human annotators with identifying any hallucinations in the analyses of MLLM judgments, adhering to established definitions of vision and language hallucination (Sun et al., 2024).

4. Empirical Results and Analysis

4.1. MLLM Judgment vs Human Annotation

As shown in Figure 1 and Table 3, judgments made by GPT-4V are closer to human annotations among all settings, while Gemini is far different, with LLaVA, CogVLM and Qwen-VL-Max are even worse.

Overall, MLLM judgments perform better on Pair Comparison, while falling short in Scoring Evaluation and Batch Ranking, showing a huge gap between the model and human preferences.

Under the “Analyze-then-Judge” setting, GPT-4V prefers to give a longer judge in all settings, convincing its ability to reason on long-term text.

▷ Scoring Evaluation:

GPT-4V demonstrates the highest similarity to human scoring with a similarity score of 0.490.

In contrast, Gemini achieves only 0.304, with LLaVA and CogVLM scoring even lower.

This discrepancy is mainly due to Gemini’s tendency to assign scores around 4 points as depicted in Figure 4, seldom giving 1 or 2 points.

LLaVA and CogVLM show a pattern similar to Gemini, predominantly assigning scores around 4 points.

We attribute this to a ‘High-Score’ Bias, akin to the ‘Yes/No’ bias identified by Liu et al. (2023a), which may result from an imbalance in positive and negative judging instructions in their training data (Liu et al., 2023b), severely limits their ability to provide just and varied scores in scoring settings.

In comparison, GPT-4V’s scores are more evenly distributed and align closely with human preferences.

▷ Pair Comparison: As illustrated in Figure 4, GPT-4V outshines other MLLMs in pair comparison tasks, achieving 0.636 in tie settings and 0.773 in non-tie settings, surpassing 0.8 in many datasets, which indicate a strong alignment with human preferences. Gemini, LLaVA, and CogVLM show a marked preference for declaring a clear winner, possibly due to a lack of tie situations in their training, leading to biased judgments. It’s also interesting that the frequency of ties given by GPT-4V closely mirrors that of human judges, suggesting similar thresholds for tie decisions. ▷ Batch Ranking: GPT-4V aligns more closely with human ranking results, indicating a significant lead with a mean Levenshtein Distance of 0.361. However, there is still substantial room for improvement in this task for all MLLMs. Notably, CogVLM is unable to provide a full ranking in this context, offering only the top choice; so it was excluded from this comparison; LLaVA also exhibits position bias influenced by prompt structure, often replicating judgments seen in example prompts, which complicates its ability to produce fair judgments.
 
4.2. MLLM Judging Consistency To be a reliable judge, consistent decision-making across repeated evaluations of the same query is crucial. For this purpose, we conduct six repeated tests with MLLM judgments and calculated the weighted average consistency scores and Majority Consistency Criterion ratios for GPT-4V and Gemini, as shown in Table 4 and Figure 5.

Despite a higher temperature setting, GPT-4V substantially outperforms Gemini across all tasks.

 

Particularly in Pair Comparison, GPT-4V achieves a higher consistency score of 0.675, but it encounters difficulties in maintaining similar levels of consistency in Scoring and Batch Ranking tasks, with scores dropping to 0.611 and 0.418, indicating the challenge of producing qualified and convincing judgments.

 

4.3. Human Agreement

 

Our manual evaluation of MLLMs on agreement and scoring, revealed notable findings.

 

Table 3 shows that GPT 4V achieved around 70% human agreement across all settings, excelling in the Pair Comparison task with 79.3% agreement.

 

Specifically, GPT-4V reached 78% in human agreement for Pair Comparison, with Gemini close at 72%, indicating strong performance in most sample pairs and supporting the idea that large models excel in pairwise distinctions (Zheng et al., 2023b), though improvements are needed in other judging settings.

 

In Scoring Evaluation, GPT-4V achieves a 70% human agreement rate, peaking at 79.9% in MS-COCO, while Gemini averaged 67.7%.

 

To assess the consistency of MLLM judging quality across multiple responses to a single imageinstruction pair, we use Mean Absolute Deviation (MAD) metric to measure the average absolute variance between individual scores and the mean.

 

Figure 18 shows that GPT-4V exhibits lower variation in quality assessments, indicating more consistent and reliable judgment compared to Gemini.

 

However, in Batch Ranking, both models exhibited decreased alignment with human judgments, especially in Maths and graphic information processing, suggesting that models may lack the capabilities to fully comprehend user instructions, leading to less reliable judgments.

 

4.4. Multi-steps CoT Do Not Enhance Performance

 

We have conducted additional tests using GPT-4V and Gemini with a 3-step CoT approach for judging, as detailed in Table 5.

 

Our analysis reveals that while employing CoT with additional steps markedly reduces hallucinations in judgments, it does not align more closely with human preferences.

 

On numerous datasets, this approach even diminishes judging performance.

 

Specifically, Gemini’s effectiveness drops more drastically.

 

With 3-step CoT, there is an increased likelihood that the judgment will be disturbed by its understanding of the figure and its own responses to the instruction, thereby undermining its final judgment if hallucinations exist in the previous chain.

 

4.5. Vision Perception Benefits MLLM Judging

 

We explore the feasibility of using LLMs for judging textbased responses without directly analyzing the original images.

 

This involves two approaches:

 

omitting vision information entirely and providing a detailed description of the picture.

 

We choose LLaMA-70b, Mixtral8x7b-v0.1 and GPT-3.5 to provide descriptions.

 

Surprisingly, as illustrated in Table 6, we find that LLMs’ performance in multimodal judging tasks significantly improve with picture descriptions, achieving a Pearson similarity of 0.435 in Scoring Evaluation tasks, markedly outperformed judgments made without any vision perception.

 

Notably, in no-tie Pair Comparison, MLLMs with detailed vision descriptions even exceed the standard performance of MLLMs in judging.

 

This suggests that MLLMs may lack certain human-like judging capabilities, while LLMs can be potential judges for multimodal tasks when provided with comprehensive task-related descriptions.

 

4.6. Bias and Hallucination Egocentric Bias.

 

Models tend to assign higher scores to their own responses while scoring others lower (Zheng et al., 2023b; Li et al., 2024).

 

In Figures 19 and 20, GPT-4V exhibits a slight degree of Egocentricity.

 

Conversely, Gemini maintains a uniform scoring distribution across different sources, demonstrating a more equitable approach to judgment.

 

In contrast, GPT-4V shows self-preference, aligning its judgments with its predefined ethical guidelines.

 

For example, GPT-4V consistently emphasizes privacy preservation, leading to higher scores for privacy-related questions based on its own metrics.

 

Despite efforts in prompt engineering to ensure neutrality, these models still rely on judgment criteria set during post-alignment training (Ouyang et al., 2022).

 

This bias can result in judgments that deviate from human preferences, highlighting the complexity of aligning MLLM judgments with humans’.

 

Position Bias.

 

Model consistently favor answers in specific positions, often influenced by training data that typically places correct responses at the beginning or end of prompts (Liu et al., 2023e).

 

Figure 4 illustrates bias in LLaVA and CogVLM during Pair Comparison tasks, where they consistently prefer answers in a specific position.

 

This bias likely arises from their limited ability to follow complex instructions, leading them to be influenced by prompt structure.

 

For example, if a Batch Ranking prompt includes a sequence like ‘ABCD’, LLaVA replicates this sequence in 88.2% of responses, significantly more than other sequences.

 

However, this bias can be reduced by introducing multiple examples, suggesting that prompts with more examples can better direct these models to follow instructions accurately.

 

Length Bias.

 

Models tend to prefer longer answers over concise but correct ones (Li et al., 2024), also known as verbosity bias (Zheng et al., 2023b).

 

Figure 6 shows that both GPT-4V and Gemini assign higher scores to longer content.

 

We conducted an expanded scoring experiment using GPT-4 (OpenAI, 2023) without vision, increasing the semantic length of answers without changing their original intent.

 

In Figure 7, we observe noticeable score increases, with GPT-4V and Gemini showing average gains of 0.6 and 0.75 points, respectively.

 

These results suggest that MLLMs may favor longer text for higher scores.

 

Hallucination Detection and Mitigation.

 

We observe a higher frequency of hallucinations in Batch Ranking, compared to Pair Comparison and Scoring Evaluation.

 

These hallucinations involved significant misinterpretations and retrieval errors, impacting judgment accuracy and reliability.

 

To address this, we employed a multi-step CoT approach on MLLM-AS-A-JUDGE-HARD, adding reasoning steps before the conventional “Analyze-then-Judge” process.

 

This enhanced procedure included:

 

1) image-instruction,

 

2) image, and

 

3) instruction.

 

In Table 7, this strategy effectively reduced hallucinations across all formats, with significant improvements in tasks involving image-related information.

 

In the Batch Ranking task, which requires handling longer text sequences, the detailed reasoning steps were particularly effective in reducing hallucinations.

 

4.7. Scaling Law for MLLM-as-a-Judge

 

We conduct two sets of experiments with models of different sizes, the LLaVA-1.6 series models and the Qwen series models in four newly added datasets, illustrated in Figure 10 and 11.

 

In Score evaluation, LLaVA-1.6-34b and Qwen-VLMax slightly outperform others in Math, Chart, and Text tasks, showing a relatively strong scaling law.

 

5. Related Work

 

LLM as a Judge.

 

The evolution of LLMs has made them increasingly effective evaluators in Natural Language Processing (NLP) tasks. Zhu et al. (2023) introduced JudgeLM for LLM evaluation, followed by AUTO-J (Li et al., 2023a), aligning closely with human judgment (Bai et al., 2023b; Li et al., 2023d; Kim et al., 2023).

 

Advancements in CoT reasoning (Wei et al., 2022; Chu et al., 2023) and training-free instruction following (Brown et al., 2020; Wei et al., 2021) further extend LLMs’ judging capability in diverse tasks like translation quality assessment (Kocmi & Federmann, 2023) and story generation (Chiang & Lee, 2023a).

 

Hallucination and Bias in Judgments.

 

MLLMs suffer from vision and language hallucinations (Ji et al., 2023; Huang et al., 2023a; Cui et al., 2023; Wang et al., 2023a), often due to vision-language misalignments in training phase (Sun et al., 2024; Huang et al., 2023b).

 

Recent research focuses on hallucination evaluation (Liu et al., 2023a), detection (Li et al., 2023e; Wang et al., 2023a), and mitigation (Yin et al., 2023; Gunjal et al., 2023; Zhou et al., 2023), noting that even GPT-4V suffer from these issues (Shi et al., 2023; Liu et al., 2023a; Cui et al., 2023).

 

Besides, biases in MLLM-as-a-Judge, similar to those in human decision-making (Blunch, 1984; Raghubir & Valenzuela, 2006) and other ML domains (Wang et al., 2018; Liu et al., 2023e), such as position (Zheng et al., 2023a), egocentric (Li et al., 2024), and verbosity biases (Saito et al., 2023), are compounded by the integration of visual perception, necessitating further investigation.

 

6. Future Directions Multimodal RLHF/DPO.

 

Our work is highly connected with multimodal RLHF/DPO (Sun et al., 2023; Li et al., 2023c; Yu et al., 2023a).

 

Our dataset includes extensive human annotations, such as manually assigned scores and preference on pairs, which could serve as invaluable training material for RLHF reward models and supply paired data essential for DPO (Rafailov et al., 2024; Zhang et al., 2024), paving the way for enhancing the training of MLLMs.

 

Exploring the upper bound of MLLM-as-a-Judge.

 

Beyond expanding the steps in the Chain of Thought prompting (Wei et al., 2022), we see significant potential in more sophisticated reasoning frameworks, such as multi-agent debating (Chan et al., 2023) when MLLM acts as a Judge, which could enhance the judging accuracy through improved reasoning capabilities.

 

Additionally, addressing inherent biases in the model during the judgment process is crucial.

 

For instance, position bias in Pair Comparison and Batch Ranking (Zheng et al., 2023a; Wang et al., 2024a), and the tendency to assign higher scores, as discussed in (Lee et al., 2024), are critical areas for improvement.

 

Incorporating a human-in-the-loop approach (Wang et al., 2023b) offers a promising solution to enhance judgment consistency and reliability.

 

For example, if judgment results vary in more than half of several repeated judgments, it may need human intervention for consistency checking.

 

When it’s challenging to discern the MLLM’s judgment due to non-compliance with the suggested output format or lack of a clear outcome, human intervention may be required to refine this process by manually verifying judgments.

 

7. Conclusion

 

In this paper, we have presented a new benchmark, termed MLLM-as-a-Judge, to assess the judging capabilities of MLLMs across three critical evaluation settings in the multimodal domain:

 

Scoring Evaluation, Pair Comparison, and Batch Ranking. We further evaluate their agreement with humans.

 

Our results reveal that advanced MLLMs can win significant human recognition in Pair Comparisons, but perform poorly in Scoring Evaluation and Batch Ranking tasks.

 

Our work highlights potential areas for future refinement and improvement of MLLMs. We advocate for additional efforts dedicated to supporting the continuous development of MLLMs as judges.
 
A. Comprehensive Related Works
 
A.1. Large Model as Judge
 
The rapid development of LLMs has significantly enhanced their capabilities in long-term context perception and reasoning, increasingly popularizing their use as evaluators in various Natural Language Processing (NLP) tasks.
 
Zhu et al. (2023) were pioneers in this area, introducing JudgeLM, a fine-tuned LLM designed for evaluating other LLMs.
 
Building on this, Li et al. (2023a) introduced AUTO-J, a system that evaluates LLMs through both pairwise comparisons and single-response assessments, demonstrating close alignment with human judgment (Bai et al., 2023b; Li et al., 2023d; Kim et al., 2023).
 
Further advancements in LLMs, such as the development of Chain-of-Thought reasoning (Wei et al., 2022; Chu et al., 2023), training-free instruction following (Brown et al., 2020; Wei et al., 2021), and enhanced alignment with human preferences (Ouyang et al., 2022), have solidified their role in diverse tasks like translation quality assessment (Kocmi & Federmann, 2023) and story generation (Chiang & Lee, 2023a).
 
A.2. Hallucination and Bias in Judge
 
MLLMs are known to exhibit both vision hallucination and hallucination originating from LLMs, a phenomenon typically characterized by responses containing information not present in the visual or natural language context (Ji et al., 2023; Huang et al., 2023a; Cui et al., 2023; Wang et al., 2023a).
 
This issue often stems from misalignments in vision-language training (Sun et al., 2024; Huang et al., 2023b). Recent studies have begun to address these hallucination issues, focusing on evaluation (Liu et al., 2023a), detection (Li et al., 2023e; Wang et al., 2023a), and mitigation strategies (Yin et al., 2023; Gunjal et al., 2023; Zhou et al., 2023).
 
Notably, GPT-4V (OpenAI, 2023), despite being a leading model in many fields (Yang et al., 2023; Wu et al., 2023b), has also demonstrated susceptibility to hallucinations (Shi et al., 2023; Liu et al., 2023a; Cui et al., 2023).
 
This raises concerns about the reliability of MLLMs in evaluative roles.
 
In terms of bias, MLLM judging is subject to issues not exclusive to our context of evaluation but also observed in human decision-making (Blunch, 1984; Raghubir & Valenzuela, 2006) and Machine Learning (ML) domains (Wang et al., 2018; Liu et al., 2023e; Huang et al., 2024a) such as position bias (Zheng et al., 2023a), egocentric bias (Li et al., 2024), and verbosity bias (Saito et al., 2023).
 
The integration of visual perception in MLLMs introduces additional complexities, resulting in biases unique to the fusion of dual perceptions, an area that still demands thorough exploration.
 
A.3. Evaluating Large Multimodal Models
 
Evaluating MLLMs typically involves diverse tasks and corresponding metrics, which reflect the models’ ability to comprehend and generate content based on both visual and textual information.
 
For instance, in image captioning tasks, models are tasked with generating descriptive text for a given image.
 
The effectiveness of these models is measured using metrics such as BLEU (Papineni et al., 2002), METEOR (Banerjee & Lavie, 2005), ROUGE (Lin, 2004), and CIDEr (Vedantam et al., 2015).
 
In the context of Visual Question Answering (VQA), models are evaluated based on their ability to answer questions on an image’s content.
 
Here, the accuracy of model responses is compared against human-annotated answers, serving as the primary metric (Antol et al., 2015) to ensure alignment with human preferences.
 
However, when tackling sophisticated visual-language tasks, conventional evaluation metrics often fail to accurately capture the nuanced responses generated by these models, especially in complex or subjective scenarios that involve both visual elements and extended textual content (Liu et al., 2023a).
 
Additionally, while manual annotation offers a more comprehensive and human-like evaluation, it comes with significant challenges.
 
These include high costs (Prendki, 2023), potential biases (Zheng et al., 2023b), and the difficulty of ensuring consistent replication (Chiang & Lee, 2023a).
 
These limitations highlight the need for a more holistic approach to evaluation, one that combines human-like calibration with more fine-grained assessment methods.