https://arxiv.org/html/2412.14835v1
Progressive Multimodal Reasoning via Active Retrieval
Large Language Models (LLMs) [72, 67, 25, 112, 97] and Multimodal Large Language Models (MLLMs) [54, 6, 104, 125, 13, 14, 42] have rapidly advanced, with broad applications in mathematics [126, 106], programming [31, 91], medicine [45], character reco
arxiv.org
요약
text Retrieval
they employ contriever
https://arxiv.org/pdf/2112.09118
cross modal Retreval
by utilizing CLIP model,
->
encode text image pairs (query)
https://github.com/DAMO-NLP-SG/multimodal_textbook 의 식 사용
->
perform cross-modal retrieval between the encoding of each multimodal query and the entire retrieval database, utilizing FAISS [36] for indexing to retrieve K samples for each query:
# knowledge concept filtering
r 은 retrieved insights from the corpus
compute cosine similarity both multimodal query Q^m and its knowledge concept label L_{kc}
T denotes as threshold