https://arxiv.org/abs/2306.13394
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks, showing amazing emergent abilities in recent studies, such as writing poems based on an image. However, it is difficult for these case studies to fully reflect t
arxiv.org
Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks, showing amazing emergent abilities in recent studies, such as writing poems based on an image.
However, it is difficult for these case studies to fully reflect the performance of MLLM, lacking a comprehensive evaluation.
In this paper, we fill in this blank, presenting the first comprehensive MLLM Evaluation benchmark MME
1 . It measures both perception and cognition abilities on a total of 14 subtasks.
In order to avoid data leakage that may arise from direct use of public datasets for evaluation, the annotations of instruction answer pairs are all manually designed.
The concise instruction design allows us to fairly compare MLLMs, instead of struggling in prompt engineering.
Besides, with such an instruction, we can also easily carry out quantitative statistics.
A total of 30 advanced MLLMs are comprehensively evaluated on our MME, which not only suggests that existing MLLMs still have a large room for improvement, but also reveals the potential directions for the subsequent model optimization.
The data application manner and online leaderboards are released at
https:// github.com/BradyFU/Awesome- MultimodalLarge-Language-Models/tree/Evaluation
2. MME Evaluation Suite
2.1. Instruction Design
In order to facilitate quantitative performance statistics, the orientation of our instruction design is to let the model to answer “yes” or “no”.
As a result, the instruction consists of two parts, including a concise question and a description “Please answer yes or no.”
For each test image, we manually design two instructions, where the discrepancy lies in the questions.
The ground truth answer of the first question is “yes” and that of the second question is “no”, as shown in Fig. 1.
When MLLM answers both of the two questions correctly, it appears more confident that the MLLM actually comprehends the image and the corresponding knowledge behind it, rather than just guessing.
2.2. Evaluation Metric
Since the output of the model is limited to two types (“yes” or “no”), it is convenient to measure the metrics of accuracy and accuracy+.
The former is calculated based on each question, while the latter is based on each image where both of the two questions need to be answered correctly.
The random accuracies of the two metrics are equal to 50% and 25%, respectively.
It can be seen that accuracy+ is a stricter measurement but also better reflects the comprehensive understanding degree of the model to the image.
In addition, we calculate the score of a subtask based on the sum of accuracy and accuracy+.
The perception score is the sum of scores of all perception subtasks.
The cognition score is calculated in the same way.
Therefore, the full scores of perception and cognition are 2000 and 800, respectively.
2.3. Data Collection
2.3.1 Perception Tasks
We argue that perception is one of the most fundamental capabilities of MLLMs, and the lack of perception will easily lead to the object hallucination problem [25, 50].
That is, MLLM will answer questions based on its own fantasies rather than based on the realistic content of the image, as displayed in Fig. 4.
Coarse-Grained Recognition.
====================
The contents of coarsegrained recognition include the existence of common objects, and their count, color, and position.
The images are sampled from COCO [26], but the instruction-answer pairs are all manually constructed, rather than directly using publicly available annotations.
Even if MLLMs have seen these COCO images, our manually prepared pairs are not presented in their training sets.
This requires MLLMs to be able to understand the instructions and infer corresponding answers.
In each perception subtask of existence, count, color, and position, we prepare 30 images with 60 instruction-answer pairs.
Fine-Grained Recognition.
====================
The fine-grained recognition is more about testing the knowledge resources of MLLMs.
The subtasks consist of recognizing movie posters, celebrities, scenes, landmarks, and artworks, containing 147, 170, 200, 200, and 200 images respectively.
For the celebrities, we plot a red box to a person with a clearly visible face in the image, and the corresponding instruction is “Is the actor inside the red box named [celebrity name]? Please answer yes or no.”
Similar with the above coarse-grained recognition, the images of these subtasks are from publicly available datasets [19, 34, 35, 44, 58] and all of the instructions are manually designed.
OCR.
====================
Optical Character Recognition (OCR) is also a foundational capability of MLLMs, serving for subsequent text-based tasks such as text translation and text understanding.
The images are sampled from [30] and all of the instruction-answer pairs are manually designed.
Considering that MLLMs are still in its infancy, we only choose the relatively simple samples in this version of MME.
The numbers of image and instruction-answer pairs are 20 and 40, respectively.
2.3.2 Cognition Tasks
====================
We evaluate if any MLLM can carry out further logical reasoning after perceiving the image, which is the most fascinating aspect of MLLMs over previous traditional methods.
In order to infer the correct answer, MLLMs need to follow the instruction, perceive the contents of the image, and invoke the knowledge reserved in LLMs, which is much more challenging than the single perception tasks.
Examples of the following subtasks are shown in Fig. 1.
Commonsense Reasoning.
====================
Unlike the ScienceQA dataset [32] that requires specialized knowledge, the commonsense refers to the basic knowledge in daily life.
For example, given a photo of a down jacket, asking MLLMs whether it is appropriate to wear the cloth when it is cold (or hot).
These are basic knowledge that humans can judge instantly without complex step-by-step reasoning.
Therefore, we expect MLLMs to perform well in a zero-short setting.
The images are all manually photographed or generated by diffusion models, and the instruction-answer pairs are all manually designed.
There are a total of 70 images and 140 instruction-answer pairs.
Numerical Calculation.
====================
It requires MLLMs to be able to read the arithmetic problem in the image and output the answer in an end to end way, which has been demonstrated in [20].
In this version, we only consider relatively easy arithmetic problems, such as addition and multiplication.
There are 20 images and 40 instruction-answer pairs.
The images are all manually taken, and the instruction-answer pairs are all manually designed.
Text Translation.
====================
Considering that the MLLM [5] supports both English and Chinese, we set the text translation subtask.
It requires MLLMs to translate the Chinese written in an image to the corresponding English.
In this version, we only design basic translation problems, which will be updated according to the development of MLLMs in the future.
The images of this part are all manually taken, and the instruction-answer pairs are all manually designed.
There are a total of 20 images and 40 instruction-answer pairs.
Code Reasoning.
====================
It requires MLLMs to read the code in the images and automatically complete logical operation inside the code.
A similar task that writes website code based on an image has been demonstrated in [59].
The images are all manually taken, and the instruction-answer pairs are all manually designed.
We only set basic code problems in this version. There are in total 20 images and 40 instructionanswer pairs.