카테고리 없음

멀티모달 preference optimization + reward model 관련 논문

jinuklee 2025. 1. 30. 18:03

Efficient self-improvement in multimodal large language models: A model-level judge-free approach.

 

Strengthening multimodal large language model with bootstrapped preference optimization.

 

CLIP-DPO: Vision-language models as a source of preference for fixing hallucinations in lvlms.

 

Enhancing large vision language models with self-training on image comprehension.

 

RLAIF-V: Aligning mllms through open-source ai feedback for super gpt-4v trustworthiness.

 

Multi-modal hallucination control by visual information grounding.

 

Aligning modalities in vision large language models via preference fine-tuning.

 

Self-supervised visual preference alignment.

 

Modality-fair preference optimization for trustworthy mllm alignment.

 

기타

------------------------------

Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

https://arxiv.org/pdf/2501.12895

Pairwise RM: Perform Best-of-N Sampling with Knockout Tournament

Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback