카테고리 없음

VHElM 논문리뷰 a holistic visual evaluation of vlm

jinuklee 2024. 10. 17. 16:58

https://arxiv.org/abs/2410.07112
Current benchmarks for assessing vision-language models (VLMs) often focus on their perception or problem-solving capabilities and neglect other critical aspects such as fairness, multilinguality, or toxicity.

Furthermore, they differ in their evaluation procedures and the scope of the evaluation, making it diff i cult to compare models.

To address these issues, we extend the HELM framework to VLMs to present the Holistic Evaluation of Vision Language Models (VHELM).

VHELM aggregates various datasets to cover one or more of the 9 aspects:

visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety.

In doing so, we produce a comprehensive, multi-dimensional view of the capabilities of the VLMs across these important factors.

In addition, we standardize the standard inference parameters, methods of prompting, and evaluation metrics to enable fair comparisons across models.

Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.

Our initial run evaluates 22 VLMs on 21 existing datasets to provide a holistic snapshot of the models.

We uncover new key f i ndings, such as the fact that eff i ciency-focused models (e.g., Claude 3 Haiku or Gemini 1.5 Flash) perform signif i cantly worse than their full models (e.g., Claude 3 Opus or Gemini 1.5 Pro) on the bias benchmark but not when evaluated on the other aspects.

For transparency, we release the raw model generations and complete results on our website at https://crfm.stanford.edu/helm/vhelm/v2.0.1.

 

VHELM is intended to be a living benchmark, and we hope to continue adding new datasets and models over time.

3. The VHELM Framework

 

VHELM focuses on vision-language models that take in interleaved images and text input as prompts to produce text completions 1 (see Figure A1).

 

The VHELM evaluation process consists of 4 main components: aspect, scenario, adaptation, and metric (see Figure 2).

An aspect is a specific evaluative dimension that contributes to assessing the overall performance.

 

The aspects considered in VHELM are bias, fairness, knowledge, multilinguality, reasoning, robustness, toxicity, and visual perception (details are in Sec. Section 3.1).

 

Aspects are evaluated by computing metrics over scenarios.


A scenario represents a use case for a VLM and is identified by a task (e.g., question answering, code generation, and captioning) and a usage category such as the domain, origin, language, or subject.

 

An example scenario is “visual question answering on medical images” where the task is visual question answering and the usage category is medical images.

 

We consider a wide range of scenarios, with tasks ranging from visual question answering to captioning and usage categories consisting of multiple languages, subjects, and image types.

 

The scenarios used in VHELM are listed in Table 3.

 

A dataset is a set of instances—defined as a pair of prompt and reference—that can be used for evaluating the model performance on one or more scenarios.

 

A dataset can power multiple scenarios, such as in the case of Bingo [6], where the ‘region bias’ or ‘OCR bias’ subsets assess visual question answering of images from different geographic locations (used to test fairness) and visual question answering of images with text in various languages (used to test multilinguality), respectively.

 

A dataset is sometimes synonymous with the scenario, especially in the context of model evaluation.


For example, we may state “MMMU (Accounting)” as a scenario with the understanding that the accounting subset of MMMU tests visual question answering in the domain of accounting.

 

VHELM compiles a total of 21 existing datasets (see Table 3).


An adaptation is a specific procedure for invoking a model.

 

Adaptation strategies include zero-shot prompting,k-shots prompting, and chain-of-thought prompting.

 

In this study, we use only zero-shot prompting as it is the most common strategy used by the layperson.

 

A metric quantifies how well a VLM performs on a scenario.

 

Some examples of metrics are exact match or using either a human or a model to score on a scale of 1 to 5.

3.1 Aspects & Scenarios

 

VHELM considers 9 aspects that are crucial for developing capable, safe, and reliable VLMs (see Table 2).

 

These include fundamental capabilities, such as visual perception, knowledge, and reasoning, and behavior relating to society and ethics, such as bias, fairness, multilinguality, robustness, toxicity, and safety.


VLMs are capable of visual perception, which is the ability to process and understand images.


Visual perception is assessed through image captioning, where VLMs produce descriptions of the input images, or visual-question answering (VQA), where VLMs are asked to answer questions pertaining to the images.

 

VHELM uses scenarios such as Flickr-30k [46], VQAv2 [12], VizWiz [14] and POPE [23] to assess this aspect.


Similar to LLMs, VLMs have knowledge and possess reasoning capabilities.

 

Knowledge is the ability to recall facts or information contained in the models and is assessed by asking questions whose answers cannot be found in the inputs, such as identifying the name of the mountain shown in an image.

 

In VHELM, these instances are provided by A-OKVQA [34], MME [44], MMMU [47], Vibe-Eval [33], and MathVista [29].


Reasoning, on the other hand, is the ability to perform multiple steps of inference to arrive at the answer and is assessed either by asking questions whose answers exist indirectly in the inputs or by explaining a sequence of pictures.

 

For example, the VLM is asked to compute the probability of a category given the unnormalized histogram.

 

Reasoning is benchmarked using GQA [15], MathVista [29], Mementos [43], SEED-Bench [22], and RealWorldQA [13] in VHELM.


Bias refers to the ability to avoid unwarranted associations between the input to and output of a VLM, such as associating a specific gender with certain occupations.

 

Compared to LLMs, VLMs’ visual input provides another place where spurious correlations could cause bad behavior.

 

For example, skin tone or hair length can be identif i ed from pictures and used to produce stereotypical associations.


We use PAIRS [10] to probe social biases in VLMs and provide some examples in Appendix E.


Fairness in VHELM refers to either counterfactual fairness or performance disparity.

 

Counterfactual fairness is expecting similar responses when a spurious attribute of the input (e.g., language dialect) is changed.

 

In VHELM, this is assessed by asking questions drawn from A-OKVQA [34] and VQAv2 [12] in African-American English (AAE), and through VQA on images around the world from Bingo [6].

 

See Appendix F for an example of AAE perturbation.

 

Performance disparity is having similar performance on every subset of the data when an attribute is used as the filter.

 

For example, a VLM should be equally skillful in captioning images from different geographical locations.


VHELM tests for fairness across geographies using Crossmodal-3600 and across race, gender, and age using FairFace [16].


We believe that a VLM should be multilingual, which is the ability to perform a task when the instruction and/or output languages are changed.

 

We augment A-OKVQA [34] by translating the questions and answers from English to either Chinese, Hindi, Spanish, or Swahili to test whether the VLMs are invariant to VQA in different languages.

 

An experiment to validate the machine translations is presented in Appendix G.

 

In addition, we use the "OCR bias" subset in Bingo [6] to test if VLMs understand an image if the text in it is presented in another language and EXAMS-V [8] to evaluate whether VLMs have reasoning capabilities in multiple languages.


An important property of a good VLM is robustness, defined as producing desired answers under invariant perturbations of the input text, such as having typographic errors (aka typos).

 

We introduce typos into A-OKVQA [34] and VQAv2 [12] to test the robustness against text perturbations.