전체 글 287

LOOKING INWARD: LANGUAGE MODELS CAN LEARNABOUT THEMSELVES BY INTROSPECTION 논문리뷰

https://arxiv.org/pdf/2410.13787Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. Can LLMs introspect? We define introspection as acquiring knowledge that is not contained in or derived from training data but ..

카테고리 없음 2024.10.19

wildvision-arena WILDVISION: Evaluating Vision-Language Modelsin the Wild with Human Preference 논문리뷰

https://arxiv.org/pdf/2406.11069Recent breakthroughs in vision-language models (VLMs) emphasize the necessity of benchmarking human preferences in real-world multimodal interactions. To address this gap, we launched WILDVISION-ARENA (WV-ARENA), an online platform that collects human preferences to evaluate VLMs. We curated WVBENCH by selecting 500 high-quality samples from 8,000 user submissions..

카테고리 없음 2024.10.15

MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities 논문리뷰

https://arxiv.org/pdf/2408.00765https://huggingface.co/spaces/whyu/MM-Vet-v2_Evaluator MM-Vet v2 Evaluator - a Hugging Face Space by whyu huggingface.co MM-Vet, with open-ended vision-language questions targeting at evaluating integrated capabilities, has become one of the most popular benchmarks for large multimodal model evaluation. MM-Vet assesses six core vision-language (VL) capabilities: r..

카테고리 없음 2024.10.15

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchm

https://arxiv.org/abs/2402.04788 MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language BenchmarkMultimodal Large Language Models (MLLMs) have gained significant attention recently, showing remarkable potential in artificial general intelligence. However, assessing the utility of MLLMs presents considerable challenges, primarily due to the absence ofarxiv.orgMultimodal Large L..

GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks 논문리뷰

https://arxiv.org/abs/2311.01361Automatically evaluating vision-language tasks is challenging, especially when it comes to reflecting human judgments due to limitations in accounting for finegrained details. Although GPT-4V has shown promising results in various multimodal tasks, leveraging GPT-4V as a generalist evaluator for these tasks has not yet been systematically explored. We comprehensiv..