카테고리 없음

Virgo: A Preliminary Exploration on Reproducing o1-like MLLM 내용 요약

jinuklee 2025. 2. 11. 00:21

2 Method

In this section, we present our preliminary attempts to adapt MLLMs by equipping them with slow-thinking capacities for complex multimodal tasks. We explore two straightforward adaptation methods: (1) transferring slow-thinking abilities using text-based long thought data, and (2) distilling multimodal long thought data from existing slow-thinking MLLMs. Our aim is to investigate how slow-thinking capacities are elicited in MLLMs and to identify which approaches are more effective for achieving this goal. Next, we describe the specific implementation details.

2.1 Capacity Transfer from Text-only Instructions

Previous studies min2024imitate have shown that slow-thinking reasoning is likely a behavioral mode that can be elicited by fine-tuning with a small amount of long thought data. Moreover, this capacity can generalize across different domains. Therefore, our idea is to investigate whether this ability can also transfer to different modalities, given that existing MLLMs are developed with LLMs as their backbone.

2.1.1 Collecting Textual Long Thought Data

We begin by collecting textual long thought data from our previous study min2024imitate. Specifically, we obtain approximately 5K long thought instruction instances distilled from two open slow-thinking reasoning systems: DeepSeek-R1-Lite-Preview r1 (abbreviated as R1) and QwQ-32B-preview qwq (abbreviated as QwQ). The statistics of the collected instruction data are categorized by domain as follows: math (3.7K), science (0.9K), code (0.2K) and puzzle (0.1K). We select the majority of the instruction data from the math domain because it contains more challenging problems that require longer reasoning processes.

These instructional data are formatted with two distinct parts: the thought process, indicated by special symbols “<|begin_of_thought|>” and “<|end_of_thought|>”, and the final solution, indicated by special symbols “<|begin_of_solution|>” and “<|end_of_solution|>”. More details about the data composition and instruction format can be found in our previous paper min2024imitate.

2.1.2 Textual Long Thought Instruction Tuning

After collecting instruction data for long-form reasoning, we fine-tune the base MLLM to emulate slow-thinking reasoning behavior. We choose Qwen2-VL-72B-Instruct as the target model due to its excellent multimodal capabilities. Additionally, our previous work min2024imitate indicates that slow-thinking capacities are more readily achieved in stronger models.

To optimize the target MLLM, we train only the parameters from the LLM and cross-modal connector while keeping the parameters in the visual encoder frozen. We use the following optimization settings: a learning rate of 7e-6, a batch size of 128, and training for 10 epochs. Based on the performance on the development set, we select the model at the 5th epoch for evaluation.

We do not employ more advanced training algorithms, such as DPO rafailov2024direct and RLHF Ouyang2022instruct, as our objective is not to attain the maximum possible performance. Instead, we aim to explore the potential of transferring slow-thinking capacities through simple fine-tuning with textual long thought data. Our aim is to investigate the effect of textual long thought data using the straightforward imitation method.

2.2Capacity Distillation from Slow-thinking MLLMs

The second approach we explore is the direct distillation of multimodal long thought data from slow-thinking MLLMs (e.g., QVQ). This approach aims to achieve two goals: first, to compare the fine-tuning performance of textual long thought data with that of multimodal long thought data, and second, to investigate the potential effects of combining both textual and multimodal instruction data.

2.2.1Visual Long Thought Data Collection

To construct visual long thought data, a crucial step is to gather a set of high-quality visual problems, which include both task descriptions and images as input. Additionally, these problems should be accompanied by ground-truth answers for correctness verification. We consider selecting problems from visual question answering (VQA) datasets to cover diverse domains such as geometry, tables, figures, and icons. We select these domains because they typically present more challenging problems for MLLMs.

Specifically, we select four geometry datasets (Geos seo2015solving, GeoQA+ chen2021geoqa, Geometry3K lu2021inter, and UniGeo chen2022unigeo), three table and figure datasets (TabMWP lu2022dynamic, FigureQA kahou2017figureqa, and ChartQA masry2022chartqa), and an object dataset (CLEVR johnson2017clevr). These datasets can be accessed from the LLaVA-OneVision li2024llava data collection, where each instance provides a question, image, and answer triple. Detailed statistics for each dataset are presented in Table 1.

 

To complete these problems with long thought processes, we consider two approaches: either distilling from existing slow-thinking MLLMs or utilizing our fine-tuned MLLMs with textual long thought data. We assume that fine-tuning MLLMs with textual long thought data can effectively transform them into slow-thinking MLLMs, essentially engaging in a self-distillation process. For existing slow-thinking MLLMs, we select the recently released QVQ model, which demonstrates superior performance on several challenging benchmarks.

To generate the reasoning process, we use the commonly employed rollout method by randomly sampling responses from both the QVQ model and our own. We set a special format to parse the final answer from the response for each problem, retaining only those problems that the models can successfully solve within a reasonable number of rollouts. Intuitively, simpler problems require fewer rollouts to solve. We will further discuss the impact of problem difficulty on the fine-tuning performance of MLLMs in Section 3.3.

2.2.2Visual Long Thought Instruction Tuning

When distilling the long thought data from QVQ (denoted by DQVQ), the training process is straightforward: we fine-tune only the parameters of the LLM and the modality connector, as we do with the textual long thought data described in Section 2.1. Although the visual instruction data includes image information, our experimental results indicate that updating the visual encoder does not result in substantial performance improvement.

As another alternative approach, we design a multi-stage tuning method for self-distillation. Specifically, we first fine-tune the selected MLLM (i.e., Qwen2-VL-72B-Instruct) on the textual long thought instruction set DT, obtaining model ℳ0. Next, we use ℳ0 to generate the visual long thought instruction set by self-distillation DSD, which can be subsequently used for fine-tuning the original MLLM.

In our experiments, our aim is to investigate the effects of individual instruction datasets (i.e., DT, DSD and DQVQ) and their combinations on the slow-thinking performance.

 

3Experiments

3.1Evaluation Setup

To validate the effectiveness of our methods, we conduct experiments on four challenging benchmarks: MathVerse zhang2025mathverse, MathVision wang2024measuring, OlympiadBench he2024olympiadbench, and MMMU yue2024mmmu. MathVerse consists of 2,612 multi-subject math problems from diverse sources. MathVision comprises 3,040 high-quality mathematical problems sourced from established mathematics competitions. OlympiadBench features 8,476 bilingual multimodal problems for Olympic-level mathematics and physics competitions. MMMU encompasses 11,500 problems spanning 30 subjects and 183 subfields. To ensure a fair comparison, we conduct evaluations on the validation set of MMMU and the testmini set of MathVerse. Following VLMEvalKit duan2024vlmevalkit, we exclude the text-only split from MathVerse and the theorem-proof parts from OlympiadBench. Among all the benchmarks, OlympiadBench is the most challenging, while MMMU demonstrates relatively lower difficulty levels and focuses more on comprehensive subject knowledge.

We select Qwen2-VL-72B-Instruct wang2024qwen2 as our base MLLM due to its strong multimodal capabilities. We fine-tune it with LLaMA-Factory zheng2024llamafactory and denote the resulting model as Virgo-72B. We then compare it with a range of models that are capable of conducting o1-like slow-thinking (i.e., OpenAI o1 and QVQ-72B-preview). We also include advanced general-purpose models (i.e., GPT-4o, Gemini-Pro, and Claude-3.5-Sonnet) for comparison. We also train Virgo-7B based on Qwen2-VL-7B-Instruct to further study the influence of model size.

3.2Main Results

Table 2:Performance comparison of top-tier MLLMs on four representative benchmarks. Here, DT denotes the textual long thought data, and DSD and DQVQ denote the visual long thought data distilled by our model (the version fine-tuned by DT) and QVQ, respectively. The bold fonts denote the best performance among our training variants, while the underline fonts denote the second-best performance. * Since QVQ has not released the evaluation code, we report the evaluation results reproduced by our team.

 

 

In this section, we provide a comprehensive performance comparison of various methods on the selected evaluation benchmarks, as summarized in Table 2. The results include the performance of o1-like MLLMs, general-purpose MLLMs, and our approaches that extend the backbone model with different long thought instruction datasets.

First, the slow-thinking reasoning ability can be effectively transferred through text-only reasoning data. As demonstrated in the second group of Table 2, after fine-tuning with only 5K textual long thought instructions, our model yields highly competitive results, approaching and even surpassing those of industry counterparts. For instance, our model achieves 38.4% accuracy on MathVision and 29.3% accuracy on OlympiadBench. However, another observation is that our model does not show significant improvement on the MMMU benchmark. To thoroughly analyze the performance limitations on MMMU, we further examine fine-grained performance by using the difficulty annotation of the test samples: easy, medium, and hard. As shown in Table 3, our method lags behind QVQ in overall performance, with the disadvantage mainly concentrated in the easy and medium samples. For samples in the hard bin, our method achieves an accuracy of 54.70%, compared to QVQ’s 48.62%. As we will discuss in Section 3.4, not all visual problems require complex reasoning processes, and enforcing a longer thought process might lead to performance degradation of MLLMs.

Secondly, synthesized visual instructions, whether obtained through distillation or self-distillation, do not significantly outperform textual reasoning instructions when fine-tuning the MLLM. Upon conducting a human review of the synthesized trajectories for visual questions, we find that many questions are not sufficiently complex and rely more on perception than reasoning, despite that we have carefully selected the data source and conducted a rigorous data filtering process to control the difficulty. Developing high-quality, complex visual instructions remains a challenging direction for future exploration.

Additionally, we conduct experiments on smaller MLLMs, specifically Qwen2-VL-7B-Instruct, as shown in the third group of Table 2. The performance trends observed with different reasoning instruction datasets show some deviations from those of the larger model, Qwen2-VL-72B-Instruct. Notably, Virgo-7BDSD outperforms Virgo-7BDT, particularly on MathVerse and MMMU, suggesting that visual long-thinking instructions are more effective than textual instructions for small MLLM. Another difference is that after fine-tuning with long thought instructions, the performance on MMMU has substantially decreased. We speculate that a smaller model might be less capable of managing complex long thought processes, especially when applied to problems that do not necessitate complex reasoning (as MMMU appears to be simpler than the other three benchmarks). Incorporating visual instructions may alleviate this degradation.

 

Further analysis

After presenting the overall performance analysis, we further investigate the detailed effects of long thought instruction data on visual reasoning. We present the major findings below.

Harder tasks benefit more from long thought reasoning.

We first examine how our approach impacts model performance across tasks of varying difficulty levels. Previous researchmin2024imitate has indicated a correlation between the average length of responses generated by models and the complexity of the questions: longer responses generally accompany more complex or challenging questions. Building on this insight, we analyze the average length of responses produced by our model on evaluation benchmarks and visualize the corresponding model performance in Figure 2. The results indicate that benchmarks with longer response lengths, such as OlympiadBench, tend to be more difficult, as evidenced by their lower accuracy. Notably, our approach demonstrates substantial improvements on these challenging benchmarks, achieving absolute gains of 18.1% and 12.4% on OlympiadBench and MathVision, respectively. Conversely, we observe limited performance gains on the relatively easier benchmark, MMMU, which is characterized by shorter response lengths.

 

Longer reasoning does not guarantee better results.

Since reasoning capacity is influenced by the difficulty of the instruction data, we compare fine-tuning performance across different difficulty levels. We use a simple method to determine instruction difficulty based on instruction length. Specifically, we train the model using textual long-thought instructions sampled from varying length ranges: (0,2000],(2000,4000], and (4000,8000], and present the results in Table 4. The results indicate that increasing the length of reasoning in the training data from 2000 to 4000 tokens leads to performance improvements across all benchmarks. However, further increasing the length to 8000 tokens results in performance degradation on most benchmarks. To further examine the performance decrease associated with long instructions, we analyze the data composition of each length range and observe that the math domain dominates the long instruction data in the (4000,8000] range. These math problems may result in excessively long instructions compared to the actual required lengths for visual reasoning tasks; even the longest OlympiadBench examples have an average length below 4000, as shown in Figure 2.

 

Scaling textual instruction leads to improvement.

We further investigate the impact of scaling textual instructions on reasoning performance. The results, presented in Table 5, demonstrate that increasing the number of textual instructions generally leads to performance improvements across most benchmarks. Specifically, increasing the instruction samples from 1K to 5K results in a 7.7% average performance gain for both the 7B and 72B models on MathVision, while showing a modest 1.8% performance gain on OlympiadBench. These observations suggest that while scaling textual instructions is generally effective, its impact varies across different tasks. Another finding is that textual instructions initially diminish the model’s capacity on MMMU, but performance gradually recovers as more instructions are added.

 

Difficulty of visual thought data has limited impacts on performance.

In Section 2.2.2, we select visual problems from various domains and generate visual long thought instructions by distilling from QVQ and Virgo-72BDT. Our goal is to explore the impact of visual instructions with varying difficulty levels. Specifically, we first use Qwen2-VL-72B-Instruct, which has not been fine-tuned with long thought instructions, to generate responses for visual questions via greedy search. Questions that the base MLLM answers correctly are excluded, as they are considered relatively easy. For the remaining questions, Virgo-72BDT performs multiple rollouts, generating five candidate trajectories per question. Based on the ratio of correct trajectories, we define two levels of difficulty: medium, for questions with 4 or 5 correct trajectories, and hard, for those with 2 or 3 correct trajectories. To investigate how question difficulty affects model performance, we also randomly sample some questions, regardless of whether the base MLLM can solve them, and synthesize trajectories based on these questions. This set is referred to as the “random-level”. We combine 5K textual long thought instructions with each of the three splits (medium, hard, and random) to fine-tune Qwen2-VL-72B-Instruct and report the results in Table 6. The results show that visual instructions with different difficulty levels do not lead to significant performance differences. This suggests that advanced strategies for synthesizing visual long thought instructions are needed to enhance multi-modal slow-thinking reasoning capabilities.

 

3.4 Case Study

In this section, we present several examples to demonstrate the advantages of slow-thinking reasoning in addressing complex multimodal problems. Additionally, we provide examples highlighting some of the negative impacts introduced by our approach.

Textual long thought instruction tuning elicits visual slow-thinking reasoning ability.

In Table 7, the query requires evaluating the integral of a function and involves an image composed of three semi-circles. Qwen2-VL-72B-Instruct directly calculates the radius and center of each semi-circle individually but makes errors in determining their centers. In contrast, our model first describes the image in detail (highlighted in orange), then thoroughly reasons through the question, and finally arrives at the correct answer. Furthermore, the model can reflect on its reasoning process and attempt to verify its solution (highlighted in blue). This case demonstrates that long thought training enhances both the model’s detailed captioning ability and its capacity for self-reflection, which are crucial for performing complex reasoning tasks.

Lack of reflection on perception causes reasoning to fail.

By examining several failure cases, we observe that Virgo fails to reflect on its perception results, which can cause the entire reasoning process to collapse. A representative case is illustrated in Table 9, where Virgo mistakenly perceives the number of unemployed individuals with a “high school diploma” in September (highlighted in red). This leads to the incorrect conclusion that both August and September satisfy the problem’s requirements. While Virgo recognizes the irrationality of the result and begins to reflect on its reasoning process (highlighted in blue), it does not question the validity of its perception. As a result, erroneous conclusions are repeatedly generated, leading to incorrect answers. This case highlights that slow-thinking MLLMs transferred from text-only instructions may have limited capacity for reflecting on perception. Future models should be designed with the ability to reflect on both perception results and reasoning processes