inference-time, RLHF/STaR, ResT - LMM

LLaVA-Video-178K : Video Instruction Tuning With Synthetic Data 논문리뷰

jinuklee 2024. 10. 9. 03:40

https://llava-vl.github.io/blog/2024-09-30-llava-video/

 

LLaVA-Video: Video Instruction Tuning with Synthetic Data

We fine-tune LLaVA-OneVision (SI) on the joint dataset of video and image data. Specifically, we added video data from the LLaVA-Video-178K dataset and four public datasets: ActivityNet-QA, NExT-QA, PerceptionTest, and LLaVA-Hound-255K, focusing on videos

llava-vl.github.io

https://arxiv.org/pdf/2410.02713

https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2

The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web.

 

To address this, we consider an alternative approach, creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K.

 

This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA.

 

By training on this proposed dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM.

 

Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.

 

3 VIDEO INSTRUCTION-FOLLOWING DATA SYNTHESIS

 

A high-quality dataset for video instruction-tuning is crucial for developing effective video-language models.

 

We identify a key factor in building such datasets:

 

ensuring richness and diversity in both video content and its language annotations.

 

We perform comprehensive survey on the existing video benchmarks, covering across various public video captioning and question-answering datasets, then identify ten unique video sources that contribute to over 40 video-language benchmarks.

 

From each source, we select videos that exhibit significant temporal dynamics.

 

To maintain diversity in the annotations, we establish a pipeline capable of generating detailed captions for videos of any length.

 

Additionally, we define 16 types of questions that guide GPT-4o in creating question-answer pairs to assess the perceptual and reasoning skills of the video-language models.

 

3.2 VIDEO DETAIL DESCRIPTION

 

Automated Generation For selected videos, we use GPT-4o (OpenAI, 2024) to systematically describe their content.

 

We start by sampling video frames at one frame per second (fps).

 

However, due to the input size constraints of GPT-4o, we cannot use all sampled frames.

 

Instead, we describe the videos sequentially, as shown in Fig 2. We create descriptions at three distinct levels, detailed below.

• Level-1 Description:

================

Every 10 seconds, we provide a level-1 description that outlines the events in that segment.

 

This description considers: frames from the current clip and historical context, which includes all recent level-1 descriptions not yet summarized into a level-2 description and the latest level-2 description.

 

• Level-2 Description:

================

Every 30 seconds, we creat a level-2 summary of the entire video plot up to that point.

 

This is based on the last three level-1 descriptions, covering the most recent 30 seconds; and the latest level-2 description

 

• Level-3 Description:

================

At the video’s end, we generate a level-3 description to encapsulate the entire video.

 

The inputs for this description are the recent level-1 descriptions not yet summarized, covering the last moments of the plot after the recent summary; and the latest level-2 description.

3.3 VIDEO QUESTION ANSWERING

 

Question Type definition

-----------------------------------

In addition to detailed video descriptions, our dataset includes a variety of question-answer pairs designed for complex interactions.

 

This setup improves the video understanding model’s ability to handle real-life queries.

 

We refer to public video question-answering benchmarks (Xiao et al., 2021; Yu et al., 2019; khattak et al., 2024; Liu et al., 2024b) to organize these questions into 16 specific categories, as shown in Fig. 3.

 

Automated Generation

---------------------------

Given a detailed video description, we use GPT-4o to generate at most one question-answer pair for each type of question.

 

The prompts include:

 

(1) The task definition for the current question type.

 

(2) In-context examples for this type, which include three video descriptions and their three question-answer pairs of this specific type.

 

(3) The detailed video description for the current video. We instruct GPT-4o to return None if it cannot generate question-answer pairs for a specific question type.

 

Filtering.

---------------------------------

To filter out the generated question-answer pairs, we apply the following strategy:

 

(1) remove duplicates using the sentence-transformer (Reimers & Gurevych, 2020),

 

(2) discard answers that begin with phrases like “does not specify,” “does not mention,” “does not specifically,” “does not depict,” or “does not show.”