카테고리 없음

Mantis: Interleaved multi-image instruction tuning 논문리뷰

jinuklee 2024. 10. 13. 01:15

https://arxiv.org/abs/2405.01483

Large multimodal models (LMMs) have shown great results in single-image vision language tasks.

 

However, their abilities to solve multi-image visual language tasks is yet to be improved.

 

The existing LMMs like OpenFlamingo, Emu2, Idefics gain their multi-image ability through pre-training on hundreds of millions of noisy interleaved image-text data from the web, which is neither efficient nor effective.

 

In this paper, we aim to build strong multi-image LMMs via instruction tuning with academic-level resources.

 

Therefore, we meticulously construct MANTISINSTRUCT containing 721K multi-image instruction data to train a family of models MANTIS.

 

The instruction tuning empowers MANTIS with different multiimage skills like co-reference, comparison, reasoning, and temporal understanding.

 

We evaluate MANTIS on five multi-image benchmarks and seven single-image benchmarks.

 

MANTIS-SigLIP can achieve SoTA results on all the multi-image benchmarks and beat the strongest multi-image baseline, Idefics2-8B by an average of 11 absolute points.

 

Notably, Idefics2-8B was pre-trained on 140M interleaved multi-image data, which is 200x larger than MANTIS-INSTRUCT.

 

We observe that MANTIS performs equivalently well on the held-in and held-out benchmarks, which shows its generalization ability.

 

Notably, we found that MANTIS can even match the performance of GPT-4V on multi-image benchmarks.

 

We further evaluate MANTIS on single-image benchmarks and demonstrate that MANTIS also maintains a strong single-image performance on par with CogVLM and Emu2.

 

Our results show that multi-image abilities are not necessarily gained through massive pre-training, instead, it can be gained by the low-cost instruction tuning.

 

Our work provides new perspectives on how to improve LMMs’ multi-image abilities.