Efficient self-improvement in multimodal large language models: A model-level judge-free approach.
Strengthening multimodal large language model with bootstrapped preference optimization.
CLIP-DPO: Vision-language models as a source of preference for fixing hallucinations in lvlms.
Enhancing large vision language models with self-training on image comprehension.
RLAIF-V: Aligning mllms through open-source ai feedback for super gpt-4v trustworthiness.
Multi-modal hallucination control by visual information grounding.
Aligning modalities in vision large language models via preference fine-tuning.
Self-supervised visual preference alignment.
Modality-fair preference optimization for trustworthy mllm alignment.
기타
------------------------------
Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback
https://arxiv.org/pdf/2501.12895
Pairwise RM: Perform Best-of-N Sampling with Knockout Tournament
Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback