https://arxiv.org/abs/2311.01361
Automatically evaluating vision-language tasks is challenging, especially when it comes to reflecting human judgments due to limitations in accounting for finegrained details.
Although GPT-4V has shown promising results in various multimodal tasks, leveraging GPT-4V as a generalist evaluator for these tasks has not yet been systematically explored.
We comprehensively validate GPT-4V’s capabilities for evaluation purposes, addressing tasks ranging from foundational image-to-text and text-to-image synthesis to high-level image-to-image translations and multi-images to text alignment.
We employ two evaluation methods, singleanswer grading and pairwise comparison, using GPT-4V.
Notably, GPT-4V shows promising agreement with humans across various tasks and evaluation methods, demonstrating immense potential for multi-modal LLMs as evaluators.
Despite limitations like restricted visual clarity grading and real-world complex reasoning, its ability to provide human-aligned scores enriched with detailed explanations is promising for universal automatic evaluator.
2 GPT-4V-as-a-Generalist-Evaluator
Large Language Model (LLM) has shown promising performance in substituting human annotators for text evaluation tasks (Zheng et al., 2023).
Similarly, GPT-4V has demonstrated strong results across a variety of image-based tasks (Yang et al., 2023).
However, the role of GPT-4V as a multimodal evaluator remains unexplored.
In this section, we demonstrate how we leverage GPT-4V as a evaluator for evaluating various multimodal tasks. Inspired by Zheng et al. (2023), we employ two distinct evaluation methods:
(1) single-answer grading and (2) pairwise comparison.
These methods leverage GPT-4V to assess the quality of outputs across various multimodal tasks, taking into consideration the specific inputs associated with each task.
The nature of these input-output pairs varies depending on the type of task.
For example, in an image captioning task, the input is an image and the output is text, whereas in synthetic image generation, the roles are reversed
• Single-answer grading:
GPT-4V is instructed to generate a score to evaluate the quality and alignment of the input-output pair.
The scoring scale may vary depending on the task.
• Pairwise comparison:
GPT-4V evaluates a single task input along with a pair of answers.
It makes a evaluation to determine which of the two candidate answers is better, or whether the two answers are of equal quality.
Single-answer grading offers scalability as the number of candidate answers increases.
However, it lacks capability to directly compare different answers for a specific task input.
On the other hand, pairwise comparison allows for the direct comparison, while its computational complexity increases quadratically as the number of candidate answers grows.
To evaluate the effectiveness of GPT-4V as a evaluator, we conduct a comparative evaluation with human evaluators and GPT-4V.
Both human evaluators and GPT-4V assess input-output pairs from various tasks to assess output quality.
We categorize the results into “Answer 1,” “Answer 2,” and “Tie,” and calculate alignment scores to measure agreement among different evaluation approaches.
We illustrate a single-answer grading and pairwise comparison example in Figure 2.
Single-answer grading. We ask the evaluator to provide a score within a predefined scale range.
The score is based on the multimodal input-output pairs for different tasks.
To obtain a final answer that represents the evaluator’s preference between two candidate outputs, we consider the one with the higher score as the preferred answer.
If the scores are equal, the result is declared a “Tie”. Pairwise comparison.
To ensure a fair assessment, we implement a dual-sided comparing system by using GPT-4V to mitigate the potential position bias (Zheng et al., 2023).
Each output comparison is evaluated twice, where the answer 1 and the answer 2 are alternated, with the same input task.
To evaluate the final answer in dual-sided pair evaluation we follow:
• Answer 1: GPT-4V chooses Answer 1 twice or chooses once Answer 1 and once Tie.
• Tie: GPT-4V obtains twice Tie or chooses once Answer 1 and once Answer 2.
• Answer 2: GPT-4V chooses Answer 2 twice or chooses once Answer 2 and once Tie.
3.1 Image-to-Text Captioning Data Construction.
We first sample two subsets from MSCOCO Train 2017 (Lin et al., 2014) and Conceptual Captions 12M (Changpinyo et al., 2021) image-text paired datasets for evaluating GPT-4V on the image-to-text generation tasks.
For each dataset, we sampled 50 image-text pairs via clustering-based sampling to ensure the diversity in constructed dataset and in total the constructed dataset size is 100.
Evaluation Setup. We propose to evaluate whether GPT-4V can be consistent, accurate, and unbiased in providing a score towards the generated caption quality in both single scoring and pairwise comparison settings.
Due to the lack of negative captions used for paired comparison with ground truth high-quality caption, we follow ALIGN (Jia et al., 2021) to manually construct hard negative captions via adding the following noises:
1) category change, i.e., a Toyota car to a Volkswagen car;
2) color change, i.e., white to gray;
3) location change, i.e., changing the country name;
4) number change, i.e., 20 to 50;
5) position change, i.e., behind to front;
6) subtle dropping, i.e., deleting the adverbial to the object.
These introduced noises still keep the noisy caption a fluent sentence and differentiating these noisy captions from original ground-truth ones requires the evaluator to figure out the fine-grained dis-alignment between noisy words and the image details, making such negative captions hard to grade.
We ask expert-level human labelers to give consistent scoring towards both the ground-truth and hardnegative captions in the same scale of 100.
Then we firstly compute the correlation and alignment between GPT-4V’s scoring and human expert scoring in terms of image-text alignment.
We report Pearson and Spearman correlation scores for both ground-truth set and hard-negative set.
We also compare GPT-4V-as-an-evaluator with the strong reference-free baseline CLIPScore (Hessel et al., 2021a) on the single-answer grading setting.
The CLIPScore is calculated following Hessel et al. (2021a) using ViT-L/14 (Dosovitskiy et al., 2020) pre-trained checkpoint and it is scaled to [0, 100].
Secondly, in terms of the pairwise comparison, we report the agreement score between GPT-4V’s evaluation and human evaluation in both single-answer scoring and pairwise comparison settings.
In addition, we also compute the agreement between GPT-4V single-answer scoring and pairwise comparison to demonstrate its consistency in different evaluation variants.
GPT-4V is robust and effective in evaluating image-to-text captioning tasks.
The results on the agreement between human and GPT-4V scoring are presented in Table 1. GPT-4V’s scoring is significantly correlated with human rating in both ground-truth subset evaluation and hard-negative