LLaVA-Video-178K : Video Instruction Tuning With Synthetic Data 논문리뷰

inference-time, RLHF/STaR, ResT - LMM

LLaVA-Video-178K : Video Instruction Tuning With Synthetic Data 논문리뷰

jinuklee 2024. 10. 9. 03:40

https://llava-vl.github.io/blog/2024-09-30-llava-video/

LLaVA-Video: Video Instruction Tuning with Synthetic Data

We fine-tune LLaVA-OneVision (SI) on the joint dataset of video and image data. Specifically, we added video data from the LLaVA-Video-178K dataset and four public datasets: ActivityNet-QA, NExT-QA, PerceptionTest, and LLaVA-Hound-255K, focusing on videos

llava-vl.github.io

https://arxiv.org/pdf/2410.02713

https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2

The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web.

To address this, we consider an alternative approach, creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K.

This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA.

By training on this proposed dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM.

Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.

3 VIDEO INSTRUCTION-FOLLOWING DATA SYNTHESIS

A high-quality dataset for video instruction-tuning is crucial for developing effective video-language models.

We identify a key factor in building such datasets:

ensuring richness and diversity in both video content and its language annotations.

We perform comprehensive survey on the existing video benchmarks, covering across various public video captioning and question-answering datasets, then identify ten unique video sources that contribute to over 40 video-language benchmarks.

From each source, we select videos that exhibit significant temporal dynamics.

To maintain diversity in the annotations, we establish a pipeline capable of generating detailed captions for videos of any length.

Additionally, we define 16 types of questions that guide GPT-4o in creating question-answer pairs to assess the perceptual and reasoning skills of the video-language models.

3.2 VIDEO DETAIL DESCRIPTION

Automated Generation For selected videos, we use GPT-4o (OpenAI, 2024) to systematically describe their content.

We start by sampling video frames at one frame per second (fps).

However, due to the input size constraints of GPT-4o, we cannot use all sampled frames.

Instead, we describe the videos sequentially, as shown in Fig 2. We create descriptions at three distinct levels, detailed below.

• Level-1 Description:

================

Every 10 seconds, we provide a level-1 description that outlines the events in that segment.

This description considers: frames from the current clip and historical context, which includes all recent level-1 descriptions not yet summarized into a level-2 description and the latest level-2 description.

• Level-2 Description:

================

Every 30 seconds, we creat a level-2 summary of the entire video plot up to that point.

This is based on the last three level-1 descriptions, covering the most recent 30 seconds; and the latest level-2 description

• Level-3 Description:

================

At the video’s end, we generate a level-3 description to encapsulate the entire video.

The inputs for this description are the recent level-1 descriptions not yet summarized, covering the last moments of the plot after the recent summary; and the latest level-2 description.

3.3 VIDEO QUESTION ANSWERING

Question Type definition

-----------------------------------

In addition to detailed video descriptions, our dataset includes a variety of question-answer pairs designed for complex interactions.

This setup improves the video understanding model’s ability to handle real-life queries.

We refer to public video question-answering benchmarks (Xiao et al., 2021; Yu et al., 2019; khattak et al., 2024; Liu et al., 2024b) to organize these questions into 16 specific categories, as shown in Fig. 3.

Automated Generation

---------------------------

Given a detailed video description, we use GPT-4o to generate at most one question-answer pair for each type of question.

The prompts include:

(1) The task definition for the current question type.

(2) In-context examples for this type, which include three video descriptions and their three question-answer pairs of this specific type.

(3) The detailed video description for the current video. We instruct GPT-4o to return None if it cannot generate question-answer pairs for a specific question type.

Filtering.

---------------------------------

To filter out the generated question-answer pairs, we apply the following strategy:

(1) remove duplicates using the sentence-transformer (Reimers & Gurevych, 2020),

(2) discard answers that begin with phrases like “does not specify,” “does not mention,” “does not specifically,” “does not depict,” or “does not show.”

'inference-time, RLHF > STaR, ResT - LMM' 카테고리의 다른 글

REVISIT LARGE-SCALE IMAGE-CAPTION DATA IN PRETRAINING MULTIMODAL FOUNDATION MODELS 논문리뷰 (5)	2024.10.09
[CVPR 2024] Rich Human Feedback for Text-to-Image Generation 논문리뷰 (0)	2024.10.09
llava-critic 논문리뷰 (0)	2024.10.09
LMM-as-a-judge / PROMETHEUS-VISION:Vision-Language Model as a Judge for Fine-Grained Evaluation 논문리뷰 (0)	2024.10.07
Calibrated Self-Rewarding Vision Language Models 논문리뷰 (0)	2024.10.07

현재글LLaVA-Video-178K : Video Instruction Tuning With Synthetic Data 논문리뷰

이진욱님의 블로그

ai research memo for reference

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

이진욱님의 블로그