카테고리 없음

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

jinuklee 2024. 11. 18. 16:43

https://arxiv.org/abs/2411.10440

 

 

 

 



Large language models have demonstrated substantial advancements in reasoning capabilities, particularly through inference-time scaling, as illustrated by models such as OpenAI's o1.

However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks.

In this work, we introduce LLaVA-o11, a novel VLM designed to conduct autonomous multistage reasoning.

Unlike chain-of-thought prompting, LLaVA-o1 independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation.

This structured approach enables LLaVA-o1 to achieve marked improvements in precision on reasoning-intensive tasks.

To accomplish this, we compile the LLaVA-o1-100k dataset, integrating samples from various visual question answering sources and providing structured reasoning annotations.

Besides, we propose an inference-time stage level beam search method, which enables effective inference-time scaling.

Remarkably, with only 100k training samples and a simple yet effective inference time scaling method, LLaVA-o1 not only outperforms its base model by 8.9% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.


Our LLaVA-o1 facilitates a progressive, step-by-step reasoning process that enhances the reasoning capabilities of Vision-Language Models (VLMs) and allows for effective inference time scaling [47].

Using structured thinking, LLaVA-o1 achieves a systematic and efficient reasoning process.

Its inference-time reasoning framework enables it to outperform existing methods in inference time scala-bility.

This design ensures both robustness and accuracy in complex tasks requiring reasoning, which separates it from traditional approaches.

Figure 1 illustrates our gen-eral framework of the reasoning process.

Our goal during training time is to develop a visual language model capable of extended chains of reasoning, allowing it to engage in systematic and in-depth reasoning.

Our proposed model, LLaVA-o1, decomposes the answer generation process into four structured reasoning stages:

• Summary Stage. In this initial phase, LLaVA-o1 provides a high-level summary interpretation of the question, outlining the primary aspects of the problem it intends to address.

• Caption Stage. If an image is present, LLaVA-o1 offers a concise overview of the visual elements relevant to the question, helping to understand multimodal input.

• Reasoning Stage. Building on the initial summary, LLaVA-o1 conducts structured, logical reasoning to de-rive a preliminary answer.

• Conclusion Stage. In this final stage, LLaVA-o1 synthesizes an answer based on the preceding reasoning.

Here, the output from the conclusion stage is the direct response provided to the user, while the prior three stages are internal "hidden stages" representing LLaVA-o1's reasoning process.

The output at this stage adapts to the user's requirements: for instance, if the user requests a brief answer, the conclusion will be concise; if detailed explanations are desired, the conclusion provides a thorough, comprehensive response.

Each stage is initiated at the model's discretion, without external prompt engineering frameworks or addi-tional prompting.

Specifically, we provide the model with four pairs of special tags:

<SUMMARY></SUMMARY>,

<CAPTION></CAPTION>,

<REASONING></REASONING>,

and<CONCLUSION></CONCLUSION>.

These tags correspond to summarizing the response approach, describing relevant image content, conducting reasoning, and preparing a final answer, respectively.

Upon training, the model autonomously selects these tags as needed, activating each stage based on its own judgment.

As with OpenAI o1 [63], all stages are completed by the model in a single inference pass.

This structured approach enables the model to independently manage its reasoning process, improving its adaptability and performance on complex reasoning tasks.

Most existing VQA datasets lack detailed reasoning processes needed to train the LLaVA-o1 model.

Therefore, we compile a new dataset, integrating samples from several widely used VQA datasets, resulting in a total of 99k.



As shown in Figure 3, since no multimodal model currently exists that can directly produce systematic, structured reasoning, we use GPT-4o [3] to generate detailed reasoning processes, including summary, caption, reasoning, and conclusion, and compile these into the LLaVA-o1-100k dataset, which we plan to release for public use.

We include data from both general-purpose VQA datasets and science-targeted VQA datasets specified blow:

General VQA Datasets.

 

We include several generalpurpose VQA datasets with distinct focuses.

 

ShareGPT4V [8] provides multi-turn question-answering data from GPT-4V [57] interactions.

ChartQA [38] focuses on interpret-ing charts and graphs.

A-OKVQA [45] emphasizes exter-nal knowledge beyond visible content.

DocVQA [39] in-volves document-based questions requiring textual compre-hension.

We also include PISC [28] to understand social relationships, and CLEVR [22] to address object proper-ties, spatial relationships, and counting tasks.


Science-Targeted VQA Datasets
These datasets include GeoQA+ [7] for geometric reasoning, along with AI2D[23] and ScienceQA [34], which target scientific questions.


CLEVR-Math [13], an extension of CLEVR, focuses on arithmetic analysis in visual contexts.

Table 1 shows the number of QA pairs selected from each dataset.

Model Training.

The LLaVA-o1-100k dataset we construct can be used to further conduct Supervised FineTuning (SFT) on any existing model to enhance reasoning capabilities.

In this work, we select the Llama-3.2-11BVision-Instruct [40] model as the base model, and perform a full parameter fine-tuning by using the LLaVA-o1-100k dataset.

The training is conducted on a single node with 8 H100 GPUs.

3.2. Effective Inference Time Scaling using Stagelevel Beam Search

After training, our objective is to further enhance the model's reasoning ability during inference.

Specifically, we leverage the stage-based outputs of LLaVA-o1, which provides an ideal granularity for inference time scaling.

Our method follows the steps below:

• Sample N responses for the first stage in the solution.

• Randomly sample 2 responses and let the model determine which is better, keeping the better response.

• Repeat for N − 1 times, retaining the best response.

• Sample N responses for the next stage, then repeat steps 2-4 until all stages are processed.

Notably, it is the structured output design of LLaVA-o1 that makes this approach feasible, enabling efficient and accurate verification at each stage.

This validates the effectiveness of structured output in improving inference time scaling.

An illustration of the three approaches is shown in Figure 4.

We provide an example in Figure 5.

When inference time scaling is not applied, although the model generates correct reasoning steps, it fails to arrive at a concrete answer during the reasoning process.

This causes the model to make a guess in the conclusion phase, leading to an incorrect result.

In contrast, with inference time scaling, the model retains the reasoning steps leading to the final result, ensuring the correctness of the answer.

Here's the text with each sentence on a new line:

4. Post-Training Performance

In this section, we compare LLaVA-o1 with the base model, Llama-3.2-11B-Vision-Instruct, on six commonly used multimodal benchmarks to demonstrate the effectiveness of our approach during the training phase.

Following this comparison, we conduct ablation studies to evaluate the contribution of each component within our method, addressing the following three key questions: (1) Is our LLaVA-o1-100k dataset more effective than directly using the original dataset's Q&A pairs?

(2) What is the impact of structured tags on the performance?

Specifically, we explore whether LLaVA-o1 can function without tags by implicitly segmenting different stages of the response.

(3) In which specific areas does our model show the most improvement compared to the base model, and does it genuinely enhance reasoning capabilities?

4.1. Experimental Setup

We selected six widely used and challenging benchmarks for our experiments: MMStar [9], MMBench V1.1 [33], MMVet [60], MathVista [35], AI2D [23], and HallusionBench [17].

MMStar, MMBench, and MMVet primarily evaluate the general visual question-answering capabilities of models, while MathVista, and AI2D focus on models' proficiency in mathematical and scientific reasoning.

HallusionBench specifically assesses the models' handling of language hallucinations and visual illusions.

For MMBench, we use the V1.1 version of the test set, MathVista is evaluated using the testmini set, and the remaining datasets each have a single test set.

To ensure fairness and reproducibility, all evaluations are conducted using VLMEvalKit [14], an open-source evaluation toolkit for large visionlanguage models.

The performance metrics of all baseline models are derived from VLMEvalKit's testing results [1].

4.2. Benchmark Results

We found that LLaVA-o1 achieves significant performance improvements, despite using only 100k data.

According to Table 2, compared to the base model, Llama-3.2-11BVision-Instruct, LLaVA-o1 demonstrates notable improvements across general VQA, mathematical reasoning, scientific VQA, and hallucination control tasks, with an average benchmark score increase of 6.9%, thereby validating the effectiveness of our approach.

4.3. Ablation Study

Effectiveness of LLaVA-o1-100k Compared to Original Datasets.

To demonstrate the effectiveness of our improved LLaVA-o1-100k dataset, we present a comparison between LLaVA-o1 and the model trained on the original Q&A pairs across different benchmarks in Table 2.

Although the model trained directly on the original Q&A pairs shows some overall improvement on the base model, its average performance remains significantly lower.

In particular, on the MMVet benchmark, which requires more detailed responses, its performance is even worse than the base model.

This result underscores the importance of the multistage format of our LLaVA-o1-100k dataset for training models capable of advanced reasoning.

Structured Tags are Essential for Enhanced Performance.

To examine whether the four tags we introduced improve the model's performance, we compare LLaVA-o1 with the model trained on the LLaVA-o1-100k dataset with structured tags removed.

As shown in Table 2, our results show a significant drop in performance when the tags are removed, indicating that the structured tagging facilitates reasoning and improves model performance.

To the best of our knowledge, LLaVA-o1 is the first attempt to successfully enhance a model's reasoning ability and overall performance through a structured reasoning with tags.

Performance Gains Primarily in Reasoning-Intensive Areas.

To analyze the specific areas in which LLaVA-o1 has improved compared to the base model, we conduct a detailed assessment of the model's performance across different skills on the MMStar benchmark.

MMStar is designed to evaluate six key capabilities: coarse perception, finegrained perception, instance reasoning, logical reasoning, math, and science & technology.

In Table 3, we compare the base model with LLaVA-o1.

Our analysis reveals that LLaVA-o1 demonstrates notable improvements in tasks requiring systematic reasoning, such as instance reasoning, logical reasoning, math, and science & technology, while showing relatively smaller gains in coarse perception and fine-grained perception.

This suggests that our method can mainly improve reasoning capabilities of the model.

5. Inference Time Scaling

In this section, we aim to compare the effectiveness of our stage-level beam search approach with traditional methods like best-of-N and sentence-level beam search under comparable computational constraints.

The experimental setup mirrors that used in the previous section, with evaluations conducted across the same six benchmarks: MMStar, MMBench V1.1, MMVet, MathVista, AI2D, and HallusionBench.

All methods are evaluated using VLMEvalKit to ensure reproducibility.

5.1. Benchmark Results

As shown in Table 4, stage-level beam search demonstrates substantial effectiveness in leveraging the structured reasoning stages of LLaVA-o1.

By evaluating outputs at each reasoning stage, this approach strikes a balance between rigorous quality control and computational efficiency, yielding higher inference accuracy on complex reasoning tasks without significant computational overhead.

These findings suggest that stage-level beam search, which is made possible by the structured output design of LLaVA-o1, is an effective and powerful approach for inference time scaling.

5.2. Comparison to Baseline Methods

We compare our method with baseline inference scaling methods on the MMVet benchmark to evaluate relative performance.

For a fair comparison, our stage-level beam search method and the baseline models are evaluated using comparable levels of inference time compute.

Specifically, we set N = 10 for the best-of-N method, generate 4 candidate responses per stage for our stage-level beam search, and use a sentence-level beam search generating 2 candidates per sentence.

As shown in Table 5, the best-of-N method yields only a modest improvement of 0.6%, while sentence-level beam search even shows a 1.9% decrease in performance.

We examine the sub-scores and found that the main reason for the performance drop in sentence-level beam search is the excessively granular sentence-level approach, which struggles to effectively address open-ended questions.

In contrast, our stage-level beam search improved performance by 2.6%, highlighting the superiority of stage-based search.

 

5.3. Scaling Trend of Stage-level Beam Search

To better illustrate the effectiveness of our stage-level beam search as inference time compute increases, we evaluate LLaVA-o1 with different beam sizes on the MMVet benchmark.

As shown in Table 6, we test the performance of the model by generating 1 (ie, no inference time scaling), 2, 3, and 4 candidate responses at each reasoning stage, allowing the model to select the best answer from these options.

Our findings show that as the number of candidate responses increases, the model's performance consistently improves, confirming that our stage-level beam search approach is scalable.

Due to computational resource constraints, we only test a beam size of 2 across all benchmarks.

However, it is expected that increasing the beam size will lead to even more significant improvements.

6. Comparison to State-of-the-Art VLMs

As shown in Table 7, we compare LLaVA-o1 with other state-of-the-art open-source and closed-source vision language models (VLM) across six benchmarks that require advanced reasoning capabilities: MMStar-R, MMBench-R, MMVet-R, MathVista, AI2D, and HallusionBench.

MMStar-R, MMBench-R, and MMVet-R are custom benchmarks derived from MMStar, MMBench V1.1, and MMVet, respectively, with tasks requiring only coarse perception, fine-grained perception, and OCR removed.

These filtered benchmarks retain tasks that demand complex reasoning.

MathVista, AI2D, and HallusionBench inherently focus on advanced reasoning, so we retained all tasks within these benchmarks.

Our results show that LLaVA-o1 consistently outperforms many open-source models of similar or even larger sizes, such as InternVL2-8B [10], Ovis1.5-Gemma2- 9B [36], MiniCPM-V2.6-8B [58], Llama-3.2-90B-VisionInstruct [40], and VILA-1.5-40B [30].

Remarkably, LLaVA-o1 even surpasses certain closed-source models like GPT-4o-mini [41] and Gemini-1.5-pro [43], underscoring the effectiveness of our structured reasoning approach.

This comparison validates the advantages of our method, particularly in benchmarks that heavily depend on reasoning skills, and highlights LLaVA-o1 as a competitive model in the domain of reasoning-intensive VLM tasks.

7. Conclusion

In this paper, we present LLaVA-o1, a novel vision language model that performs structured, autonomous reasoning in multiple stages.

By introducing four distinct stages—summary, caption, reasoning, and conclusion—LLaVA-o1 achieves a systematic reasoning process.

Our contributions are twofold: first, the creation of the LLaVA-o1-100k dataset with detailed reasoning annotations, which supports training on systematic, structured responses; and second, the proposal of a stage-level beam search method, enabling effective inference time scaling.

Overall, LLaVA-o1 establishes a new standard for multimodal reasoning in VLMs, offering robust performance and scalability, especially in inference time.

Our work paves the way for future research on structured reasoning in VLMs, including potential expansions with external verifiers and the use of reinforcement learning to further enhance complex multimodal reasoning capabilities.