카테고리 없음

wildvision-arena WILDVISION: Evaluating Vision-Language Modelsin the Wild with Human Preference 논문리뷰

jinuklee 2024. 10. 15. 20:42

https://arxiv.org/pdf/2406.11069

Recent breakthroughs in vision-language models (VLMs) emphasize the necessity of benchmarking human preferences in real-world multimodal interactions.

 

To address this gap, we launched WILDVISION-ARENA (WV-ARENA), an online platform that collects human preferences to evaluate VLMs.

 

We curated WVBENCH by selecting 500 high-quality samples from 8,000 user submissions in WV-ARENA.

 

WV-BENCH uses GPT-4 as the judge to compare each VLM with Claude-3-Sonnet, achieving a Spearman correlation of 0.94 with the WV-ARENA Elo.

 

This significantly outperforms other benchmarks like MMVet, MMMU, and MMStar.

 

Our comprehensive analysis of 20K real-world interactions reveals important insights into the failure cases of top-performing VLMs.

 

For example, we find that although GPT-4V surpasses many other models like Reka-Flash, Opus, and Yi-VL-Plus in simple visual recognition and reasoning tasks, it still faces challenges with subtle contextual cues, spatial reasoning, visual imagination, and expert domain knowledge.

 

Additionally, current VLMs exhibit issues with hallucinations and safety when intentionally provoked. We are releasing our chat and feedback data to further advance research in the field of VLMs.

 

2 WILDVISION-ARENA:  Ranking VLMs with Human Preference

 

In this section, we introduce WILDVISION-ARENA and present statistics of in-the-wild chat data, along with a deep analysis of human preferences that formulate our online VLMs leaderboard.

 

2.1 Overview Design of WILDVISION-ARENA

 

Users conduct multi-round chats over uploaded images, during which two models from the pool or third-party APIs are sampled.

 

Users vote for the better response, with the model’s identity revealed afterward, and can provide reasons for their choices.

 

Votes contribute to a live leaderboard, which is updated every few hours to rank the models.

 

Appendix A shows a screenshot of our user interface.

 

In WILDVISION-ARENA, we currently support 20+ VLMs as shown in the leaderboard on the right part of Figure 1.

 

The generation hyperparameters are set the same when comparing these models, and users can change the temperature, top-p and max output tokens per their use cases.

 

2.2 Statistics of Chat Data with Votings

 

Each chat data point that has human voting is classified into a category-subcategory and domainsubdomain using GPT-4v .

 

The prompt template details are provided in Appendix E.1.

 

Key statistics of user voting in WILDVISION-ARENA are presented in Table 1.

 

The number of tokens is estimated with tiktoken tokenizer corresponding to model ‘gpt-3.5-turbo’.

 

Figure 2 and Figure 3 visualize the distribution of these voting data in terms of question categories and image domains, respectively.

 

In addition to the three dominant question categories (Recognition, Descriptive, Analytical), the Interactive, Instructive, and Creative categories are also receiving increasing interest.

 

Users are mostly interested in chat about images tagged with the Entertainment domain (most of which are related to games and movies/TV shows), as well as the Urban, Expert, and People domains.

 

2.3 Crowdsourced Human Preference on VLMs in the Wild

 

Pairwise Comparison

 

We visualize the heatmap of battle counts and win fractions of seven models out of the 20+ models supported in the WILDVISION-ARENA in Figure 4.

 

The battle count heatmap highlights the frequency of direct comparisons, with models like GPT-4V vs. Gemini-Pro (252 voted battles) being tested more rigorously.

 

GPT-4o consistently outperforms the others by a large margin, winning 77% of its battles against the second-best model, GPT-4V, which ranks as the second best.

 

Reka-Flash follows closely behind GPT-4V, winning 42% of its battles, while other models demonstrate lower winning rates. Among the open-source models, LLaVA-NEXT leads, though there remains a significant gap between it and both GPT-4V and GPT-4o.

 

Expert Agreement with User Voting To assess the quality of crowdsourced user voting data on our platform, we evaluated inter-annotator agreement by comparing the annotations of our experts

 

with those from users of the WILDVISION-ARENA.

 

This analysis was conducted on a set of 100 samples.

 

Our findings indicate a substantial level of agreement with the two experts, with an average percentage agreement of 72.5%. Furthermore, the calculated Cohen’s Kappa coefficient was 0.59, suggesting a moderate to high degree of reliability in the annotations across different annotators.

 

2.4 Model Ranking with Elo Rating in WILDVISION-ARENA

 

Following Chatbot Arena [12], we adapt Elo Rating System [17] to provide a dynamic evaluation platform for ranking VLMs by statistical modeling based on our collected direct pairwise comparisons.

 

We briefly introduce the Online Elo Rating and the statistical estimation method.

3 WILDVISION-BENCH:

 

In-the-Wild Testbed for VLMs Recent VLMs reveal a closing gap with GPT-4V on various benchmarks[101, 102], but this improvement is not always reflected in users’ daily experiences.

 

This discrepancy arises from current models’ limited generalizability compared to proprietary ones, which fixed benchmarks fail to capture.

 

To address this, we propose creating WILDVISION-BENCH, a challenging and natural benchmark for VLMs that reflects real-world human use cases, with models’ rankings aligning closely with the WILDVISION-ARENA leaderboard contributed by diverse crowdsourced user votes.

3.1 Data Curation Pipeline

 

Starting with in-the-wild multimodal conversation data from WILDVISION-ARENA’s users, we apply the NSFW detector [36] on the images to filter out unsafe content.

 

We then perform deduplication on the images and apply diversity sampling to formulate a public set of 500 data samples for WILDVISION-BENCH.

 

Our experts manually annotate 50 samples as a preview of a hidden set, which will be updated dynamically to avoid contamination. We showcase the model performance on two cases from expert annotations in Table 3

 

VLMs as a Local Evaluator

=====================================================

Previous work [107, 35] shows alignment between GPT-4V and humans when evaluating the performance of VLMs.

 

We further validate the agreement of GPT-4V with crowdsourced human preferences in WILDVISION-ARENA to ensure its efficacy in the wild.

 

Specifically, we feed a pair of multimodal conversations along with the votes into GPT-4V to select among four choices: 1) left/right vote: the left/right model response is better, 2) tie/bad vote: both models are equally good/bad.

 

In Appendix E.3, we provide the detailed prompt template for GPT-4V.

 

We show the GPT-4V vs Arena Human alignment in Figure 6. We observe that GPT-4V has relatively low agreement with humans on tie votes but shows high agreement with humans when both models