https://arxiv.org/abs/2405.01483
Large multimodal models (LMMs) have shown great results in single-image vision language tasks.
However, their abilities to solve multi-image visual language tasks is yet to be improved.
The existing LMMs like OpenFlamingo, Emu2, Idefics gain their multi-image ability through pre-training on hundreds of millions of noisy interleaved image-text data from the web, which is neither efficient nor effective.
In this paper, we aim to build strong multi-image LMMs via instruction tuning with academic-level resources.
Therefore, we meticulously construct MANTISINSTRUCT containing 721K multi-image instruction data to train a family of models MANTIS.
The instruction tuning empowers MANTIS with different multiimage skills like co-reference, comparison, reasoning, and temporal understanding.
We evaluate MANTIS on five multi-image benchmarks and seven single-image benchmarks.
MANTIS-SigLIP can achieve SoTA results on all the multi-image benchmarks and beat the strongest multi-image baseline, Idefics2-8B by an average of 11 absolute points.
Notably, Idefics2-8B was pre-trained on 140M interleaved multi-image data, which is 200x larger than MANTIS-INSTRUCT.
We observe that MANTIS performs equivalently well on the held-in and held-out benchmarks, which shows its generalization ability.
Notably, we found that MANTIS can even match the performance of GPT-4V on multi-image benchmarks.
We further evaluate MANTIS on single-image benchmarks and demonstrate that MANTIS also maintains a strong single-image performance on par with CogVLM and Emu2.
Our results show that multi-image abilities are not necessarily gained through massive pre-training, instead, it can be gained by the low-cost instruction tuning.
Our work provides new perspectives on how to improve LMMs’ multi-image abilities.