inference-time, RLHF/STaR, ResT - LMM

Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward

jinuklee 2024. 10. 5. 02:23

https://arxiv.org/pdf/2404.01258v2 

Preference modeling techniques, such as direct preference optimization (DPO), has shown effective in enhancing the generalization abilities of large language model (LLM)

 

However, in tasks involving video instruction following, providing informative feedback, especially for detecting hallucinations in generated responses, remains a significant challenge

 

Previous studies have explored using large large multimodal models (LMMs) as reward models to guide preference modeling, but their ability to accurately assess the factuality of generated responses compared to corresponding videos has not been conclusively established.

 

This paper introduces a novel framework that utilizes detailed video captions as a proxy of video content, enabling language models to incorporate this information as supporting evidence for scoring video Question Answering (QA) predictions

 

3. Method

As shown in fig. 1, our methodology enhances video LMM alignment through DPO method using rewards from a language model.

 

----------------------------------------

We elaborate on constructing a video caption dataset in section 3.1.

----------------------------------------

3.1 Prompting GPT-4V Model for Detailed Video Caption Distillation

The selection of dataset includes videos from three sources: the WebVid and VIDAL datasets, ActivityNet dataset

 

To accommodate the requirement that GPT-4V only takes images as input, we preprocess videos by uniformly extracting ten frames per video content.

 

These frames are then concatenated into a sequence to serve as a proxy for the video.

 

This sequence is the input into GPT-4V to generate a coherent caption for the represented video based on the frame sequence

 

The prompt adheres to guidelines covering temporal dynamics, world knowledge, object attributes, spatial relationships, aesthetic assessments, etc., with the goal of comprehensively understanding the video contents.

----------------------------------------

Subsequently, in section 3.2, we discuss the generation of video instruction data and the fine-tuning process of our model.

----------------------------------------

To generate video instruction-following data for SFT, we adopt a similar methodology outlined in Video-ChatGPT (Li et al., 2023b).

 

Specifically, we first randomly sample 20k, 30k, 30k captions in our dataset from ActivityNet, WebVid and VIDAL respective and then employ ChatGPT to generate three question-answer pairs given each detailed video caption, resulting in a total of 240k instruction data for finetuning.

 

This approach ensures that the instructional data remains factually consistent with the content of the detailed captions. The specific prompting strategy used for this instruction generation process is detailed in fig. 13.

----------------------------------------

Lastly, section 3.3 details the incorporation of generated captions as a feedback mechanism for DPO method to refine our model’s factual alignment in video instruction-following task

----------------------------------------

Acquiring high-quality preference data is both costly and labor-intensive.

 

Although GPT-4V is an effective model for reward distillation, its high cost, slow performance, and limited accessibility hinder scalability, especially for video inputs with multiple frames.

 

We propose a cost-efficient method to generate reward data for DPO using detailed video captions as supporting evidence, as shown in fig. 2.

Initially, we randomly select a subset of 20k instruction pairs from the dataset described in section 3.2.

 

The SFT model uses these sampled questions and their corresponding videos to generate six responses per input pair at a temperature of 1.0.

 

This procedure results in 120k question-answer pairs, which will be evaluated.

 

Subsequently, we employ ChatGPT to process inputs including a question, the ground truth answer, the model’s prediction, and a detailed description serving as supportive evidence, with the prompt in fig. 15.

 

This generates an output that includes a natural language explanation as chain-of-thought step, followed by a numerical reward score on a scale from 1 to 5, indicating the level of factual alignment and overall quality.

 

For each video and question pair, we randomly select an answer with a score ≥ 3 as positive example, and an answer with a score below 3 as negative.

 

Cases where all responses are uniformly scored above or below 3 are excluded from the dataset. After the selection process, approximately 17k training instances are compiled for DPO training. Formally, the dataset