inference-time, RLHF 41

FGAIF: Aligning Large Vision-Language Modelswith Fine-grained AI Feedback 논문리뷰

https://arxiv.org/pdf/2404.05046v1Large Vision-Language Models (LVLMs) have demonstrated proficiency in tackling a variety of visual-language tasks. However, current LVLMs suffer from misalignment between text and image modalities which causes three kinds of hallucination problems, i.e., object existence, object attribute, and object relationship. To tackle this issue, existing methods mainly ut..

GLOV: GUIDED LARGE LANGUAGE MODELS AS IMPLICIT OPTIMIZERS FOR VISION LANGUAGE MODELS 논문리뷰

https://arxiv.org/pdf/2410.06154In this work, we propose a novel method (GLOV) enabling Large Language Models (LLMs) to act as implicit Optimizers for Vision-Language Models (VLMs) to enhance downstream vision tasks. Our GLOV meta-prompts an LLM with the downstream task description, querying it for suitable VLM prompts (e.g., for zeroshot classification with CLIP). These prompts are ranked accor..

LMM의 DPO : Aligning Modalities in Vision Large Language Models via Preference Fine-tuning 논문리뷰

https://arxiv.org/abs/2402.11411Instruction-following Vision Large Language Models (VLLMs) have achieved significant progress recently on a variety of tasks. These approaches merge strong pre-trained vision models and large language models (LLMs). Since these components are trained separately, the learned representations need to be aligned with joint training on additional image-language pairs. ..

[CVPR 2024] Rich Human Feedback for Text-to-Image Generation 논문리뷰

https://arxiv.org/pdf/2312.10240Recent Text-to-Image (T2I) generation models such as Stable Diffusion and Imagen have made significant progress in generating high-resolution images based on text descriptions. However, many generated images still suffer from issues such as artifacts/implausibility, misalignment with text descriptions, and low aesthetic quality. Inspired by the success of Reinforc..

LLaVA-Video-178K : Video Instruction Tuning With Synthetic Data 논문리뷰

https://llava-vl.github.io/blog/2024-09-30-llava-video/ LLaVA-Video: Video Instruction Tuning with Synthetic DataWe fine-tune LLaVA-OneVision (SI) on the joint dataset of video and image data. Specifically, we added video data from the LLaVA-Video-178K dataset and four public datasets: ActivityNet-QA, NExT-QA, PerceptionTest, and LLaVA-Hound-255K, focusing on videosllava-vl.github.iohttps://arxi..

LMM-as-a-judge / PROMETHEUS-VISION:Vision-Language Model as a Judge for Fine-Grained Evaluation 논문리뷰

https://arxiv.org/pdf/2401.06591Assessing long-form responses generated by Vision-Language Models (VLMs) is challenging. It not only requires checking whether the VLM follows the given instruction but also verifying whether the text output is properly grounded on the given image. Inspired by the recent approach of evaluating LMs with LMs, in this work, we propose to evaluate VLMs with VLMs. For ..

Calibrated Self-Rewarding Vision Language Models 논문리뷰

https://arxiv.org/pdf/2405.14622요약reward 부여 방식: self-generated instruction-following score( calculated using the language decoder of the LVLM , 이거 하나로만 안되는 이유 : modality misalignment, potentially overlooking visual input information ), + the image-response relevance score, R^I (s).( We leverage CLIP-score [17] for this calculation ) 3 Calibrated Self-Rewarding Vision Language Models To address t..