Mantis: Interleaved multi-image instruction tuning 논문리뷰

카테고리 없음

Mantis: Interleaved multi-image instruction tuning 논문리뷰

jinuklee 2024. 10. 13. 01:15

Large multimodal models (LMMs) have shown great results in single-image vision language tasks.

However, their abilities to solve multi-image visual language tasks is yet to be improved.

The existing LMMs like OpenFlamingo, Emu2, Idefics gain their multi-image ability through pre-training on hundreds of millions of noisy interleaved image-text data from the web, which is neither efficient nor effective.

In this paper, we aim to build strong multi-image LMMs via instruction tuning with academic-level resources.

Therefore, we meticulously construct MANTISINSTRUCT containing 721K multi-image instruction data to train a family of models MANTIS.

The instruction tuning empowers MANTIS with different multiimage skills like co-reference, comparison, reasoning, and temporal understanding.

We evaluate MANTIS on five multi-image benchmarks and seven single-image benchmarks.

MANTIS-SigLIP can achieve SoTA results on all the multi-image benchmarks and beat the strongest multi-image baseline, Idefics2-8B by an average of 11 absolute points.

Notably, Idefics2-8B was pre-trained on 140M interleaved multi-image data, which is 200x larger than MANTIS-INSTRUCT.

We observe that MANTIS performs equivalently well on the held-in and held-out benchmarks, which shows its generalization ability.

Notably, we found that MANTIS can even match the performance of GPT-4V on multi-image benchmarks.

We further evaluate MANTIS on single-image benchmarks and demonstrate that MANTIS also maintains a strong single-image performance on par with CogVLM and Emu2.

Our results show that multi-image abilities are not necessarily gained through massive pre-training, instead, it can be gained by the low-cost instruction tuning.

Our work provides new perspectives on how to improve LMMs’ multi-image abilities.

현재글Mantis: Interleaved multi-image instruction tuning 논문리뷰

이진욱님의 블로그

ai research memo for reference

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

이진욱님의 블로그

Mantis: Interleaved multi-image instruction tuning 논문리뷰

'카테고리 없음'의 다른글

티스토리툴바