전체 글 285

InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model 요약

3IXC2.5-RewardData PreparationReward models are trained using pairwise preference annotations (e.g., prompts x with chosen responses yc and rejected responses yr) that reflect human preferences. While existing public preference data is primarily textual, with limited image and scarce video examples, we train IXC-2.5-Reward using both open-source data and a newly collected dataset to ensure broad..

카테고리 없음 2025.02.12

Virgo: A Preliminary Exploration on Reproducing o1-like MLLM 내용 요약

2 MethodIn this section, we present our preliminary attempts to adapt MLLMs by equipping them with slow-thinking capacities for complex multimodal tasks. We explore two straightforward adaptation methods: (1) transferring slow-thinking abilities using text-based long thought data, and (2) distilling multimodal long thought data from existing slow-thinking MLLMs. Our aim is to investigate how slo..

카테고리 없음 2025.02.11

insight-V 논문 요약

https://arxiv.org/html/2411.14432v1#S3간단하게 두개의 MLLM 사용 reasoning, summarizationTo fully leverage the reasoning capabilities of MLLMs, we propose Insight-V, a novel system comprising two MLLMs dedicated to reasoning and summarization, respectively. reasoning model - detailed reasoning process 생성summary model - reasoning을 supplementray info 보조적인 정보로 사용해 정답에 대한 relevance utilility 를 평가  3.2 Constru..

카테고리 없음 2025.02.10

forest of thought 논문 요약

https://arxiv.org/html/2412.09078v1#S4 Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM ReasoningForest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning Zhenni Bi    Kai Han    Chuanjian Liu    Yehui Tang    Yunhe Wang Abstract Large Language Models (LLMs) have shown remarkable abilities across various language tasks,arxiv.orgbenchmark : GSM 8k , MATH 3.1 FoT frame..

멀티모달 preference optimization + reward model 관련 논문

Efficient self-improvement in multimodal large language models: A model-level judge-free approach. Strengthening multimodal large language model with bootstrapped preference optimization. CLIP-DPO: Vision-language models as a source of preference for fixing hallucinations in lvlms. Enhancing large vision language models with self-training on image comprehension. RLAIF-V: Aligning mllms through o..

카테고리 없음 2025.01.30