https://llava-vl.github.io/blog/2024-09-30-llava-video/
LLaVA-Video: Video Instruction Tuning with Synthetic Data
We fine-tune LLaVA-OneVision (SI) on the joint dataset of video and image data. Specifically, we added video data from the LLaVA-Video-178K dataset and four public datasets: ActivityNet-QA, NExT-QA, PerceptionTest, and LLaVA-Hound-255K, focusing on videos
llava-vl.github.io
https://arxiv.org/pdf/2410.02713
https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2

The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web.
To address this, we consider an alternative approach, creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K.
This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA.
By training on this proposed dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM.
Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.
3 VIDEO INSTRUCTION-FOLLOWING DATA SYNTHESIS
A high-quality dataset for video instruction-tuning is crucial for developing effective video-language models.
We identify a key factor in building such datasets:
ensuring richness and diversity in both video content and its language annotations.
We perform comprehensive survey on the existing video benchmarks, covering across various public video captioning and question-answering datasets, then identify ten unique video sources that contribute to over 40 video-language benchmarks.
From each source, we select videos that exhibit significant temporal dynamics.
To maintain diversity in the annotations, we establish a pipeline capable of generating detailed captions for videos of any length.
Additionally, we define 16 types of questions that guide GPT-4o in creating question-answer pairs to assess the perceptual and reasoning skills of the video-language models.
3.2 VIDEO DETAIL DESCRIPTION
Automated Generation For selected videos, we use GPT-4o (OpenAI, 2024) to systematically describe their content.
We start by sampling video frames at one frame per second (fps).
However, due to the input size constraints of GPT-4o, we cannot use all sampled frames.
Instead, we describe the videos sequentially, as shown in Fig 2. We create descriptions at three distinct levels, detailed below.

• Level-1 Description:
================
Every 10 seconds, we provide a level-1 description that outlines the events in that segment.
This description considers: frames from the current clip and historical context, which includes all recent level-1 descriptions not yet summarized into a level-2 description and the latest level-2 description.
• Level-2 Description:
================
Every 30 seconds, we creat a level-2 summary of the entire video plot up to that point.
This is based on the last three level-1 descriptions, covering the most recent 30 seconds; and the latest level-2 description
• Level-3 Description:
================
At the video’s end, we generate a level-3 description to encapsulate the entire video.
The inputs for this description are the recent level-1 descriptions not yet summarized, covering the last moments of the plot after the recent summary; and the latest level-2 description.

3.3 VIDEO QUESTION ANSWERING
Question Type definition
-----------------------------------
In addition to detailed video descriptions, our dataset includes a variety of question-answer pairs designed for complex interactions.
This setup improves the video understanding model’s ability to handle real-life queries.
We refer to public video question-answering benchmarks (Xiao et al., 2021; Yu et al., 2019; khattak et al., 2024; Liu et al., 2024b) to organize these questions into 16 specific categories, as shown in Fig. 3.
Automated Generation
---------------------------
Given a detailed video description, we use GPT-4o to generate at most one question-answer pair for each type of question.
The prompts include:
(1) The task definition for the current question type.
(2) In-context examples for this type, which include three video descriptions and their three question-answer pairs of this specific type.
(3) The detailed video description for the current video. We instruct GPT-4o to return None if it cannot generate question-answer pairs for a specific question type.
Filtering.
---------------------------------
To filter out the generated question-answer pairs, we apply the following strategy:
(1) remove duplicates using the sentence-transformer (Reimers & Gurevych, 2020),
(2) discard answers that begin with phrases like “does not specify,” “does not mention,” “does not specifically,” “does not depict,” or “does not show.”