Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward
https://arxiv.org/pdf/2404.01258v2 Preference modeling techniques, such as direct preference optimization (DPO), has shown effective in enhancing the generalization abilities of large language model (LLM) However, in tasks involving video instruction following, providing informative feedback, especially for detecting hallucinations in generated responses, remains a significant challenge Previous..