분류 전체보기 251

[CVPR 2024] Rich Human Feedback for Text-to-Image Generation 논문리뷰

https://arxiv.org/pdf/2312.10240Recent Text-to-Image (T2I) generation models such as Stable Diffusion and Imagen have made significant progress in generating high-resolution images based on text descriptions. However, many generated images still suffer from issues such as artifacts/implausibility, misalignment with text descriptions, and low aesthetic quality. Inspired by the success of Reinforc..

LLaVA-Video-178K : Video Instruction Tuning With Synthetic Data 논문리뷰

https://llava-vl.github.io/blog/2024-09-30-llava-video/ LLaVA-Video: Video Instruction Tuning with Synthetic DataWe fine-tune LLaVA-OneVision (SI) on the joint dataset of video and image data. Specifically, we added video data from the LLaVA-Video-178K dataset and four public datasets: ActivityNet-QA, NExT-QA, PerceptionTest, and LLaVA-Hound-255K, focusing on videosllava-vl.github.iohttps://arxi..

LMM-as-a-judge / PROMETHEUS-VISION:Vision-Language Model as a Judge for Fine-Grained Evaluation 논문리뷰

https://arxiv.org/pdf/2401.06591Assessing long-form responses generated by Vision-Language Models (VLMs) is challenging. It not only requires checking whether the VLM follows the given instruction but also verifying whether the text output is properly grounded on the given image. Inspired by the recent approach of evaluating LMs with LMs, in this work, we propose to evaluate VLMs with VLMs. For ..

Calibrated Self-Rewarding Vision Language Models 논문리뷰

https://arxiv.org/pdf/2405.14622요약reward 부여 방식: self-generated instruction-following score( calculated using the language decoder of the LVLM , 이거 하나로만 안되는 이유 : modality misalignment, potentially overlooking visual input information ), + the image-response relevance score, R^I (s).( We leverage CLIP-score [17] for this calculation ) 3 Calibrated Self-Rewarding Vision Language Models To address t..

Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward

https://arxiv.org/pdf/2404.01258v2 Preference modeling techniques, such as direct preference optimization (DPO), has shown effective in enhancing the generalization abilities of large language model (LLM) However, in tasks involving video instruction following, providing informative feedback, especially for detecting hallucinations in generated responses, remains a significant challenge Previous..