2.3 RLAIF AND LLM-AS-A-JUDGE Reinforcement Learning from AI Feedback (RLAIF) presents an alternative approach to the standard RLHF pipeline. Bai et al. (2022b) demonstrate the efficacy of RLAIF in training helpful and harmless models without relying on human feedback labels for harmlessness assessment. Their work shows that as language model capabilities improve, AI identification of harms increases significantly, particularly when leveraging chain-of-thought reasoning. Notably, they demonstrate that utilizing self-supervised preference labels for reinforcement learning can yield improvements in model behavior that are competitive with or surpass those achieved using human feedback for harmlessness evaluation. Zheng et al. (2023a) introduce the LLM-as-a-Judge method, further extending the RLAIF paradigm. They demonstrate that strong language models, even without explicit training for evaluation tasks, can provide judgments that exhibit agreement with human preferences. Their study finds that LLMs can achieve over 80% agreement with human preferences, a level comparable to inter-expert agreement. This finding establishes a foundation for developing LLM-based evaluation frameworks
parrot
Multimodal Large Language Models. The domain of MLLMs has witnessed significant advances, particularly in the enhancement of visual and language processing. Current MLLM is usually a combination of visual encoders [51; 57; 21; 70; 48; 68], LLMs, and fusion modules. Innovations like Flamingo [2] and OpenFlamingo [4] have advanced visual representation by integrating a Perceiver Resampler with vision encoders. BLIP-2 [31] and InstructBLIP [17] employ Q-Former to connect the frozen LLM and vision encoder. InternVL [14] trains huge ViT and QFormer to integrate visual modalities through a multi-stage training method. MiniGPT4 [75] leverages both a Q-Former and a linear projector to bridge the gap between the vision module and LLM. Furthermore, LLaVA [38] adopts a simple MLP projector to promote the alignment between the LLM and vision encoder. mPLUG-Owl [64] introduces an approach that begins to finetune the vision encoder and align visual features, followed by tuning the LLM using LoRA [23]. Qwen-VL [6] improves visual module resolution to 448, aiming to refine the model’s visual processing capabilities. Fuyu-8B [7] directly projects image patches before integration with LLM. MM1 [43] has conducted ablative studies on connector design choices, revealing that the modality adapter type is less critical than the number of visual tokens and the resolution. MiniGemini [34] utilizes high-resolution visual tokens and highquality data to narrow the performance gap with GPT-4 and Gemini. With the rapid advancements in open-source models, proprietary models such as GPT-4V/4o [46; 47], Gemini [58; 54], QwenVL-Plus/MAX [6], and Claude3 [3] have achieved outstanding results in evaluations and practical applications. In this work, owing to the simplicity of the LLaVA architecture, we adopt a framework similar to LLaVA to design our mode
5. Related Work Multimodal Large Language Models. Recent trends in multimodal learning have witnessed the success of building MLLMs by connecting visual encoders with powerful LLMs [11, 20, 26, 58, 61]. The current MLLM training paradigm typically involves two stages: (1) Pretraining. Models are pretrained on large-scale image-text pairs [6, 14, 27, 54, 60] or interleaved data [4, 5, 20] to learn the semantic mapping between visual and text signals. (2) Instruction Tuning. To enable the model with instruction-following capability, MLLMs are further finetuned on visual instruction data, including collections of existing human-annotated datasets [14, 28, 34], and generated data from ChatGPT/GPT-4 [28, 33, 35, 60]. Despite the success, current MLLMs suffer from serious hallucination problems [29, 32, 33, 48]. Notably, even after extensive efforts, GPT-4V has still been found to be prone to hallucinations, making basic factual errors confidently [37]. The problem undermines practical applications of MLLMs especially in high-stakes scenarios, which has recently drawn increasing attention from the community. Behavior Alignment for LLMs. Aligning language agent behaviors with human preference has emerged as a promising research direction [22, 24]. Pivotal approaches in LLMs include instruction tuning (or supervised finetuning) and RLHF [39, 47]. While supervised fine-tuning is suitable for basic behavior alignment [15, 49], due to the mismatch between likelihood maximization objective and human preference, it may introduce or amplify hallucination [38, 39]. Therefore, RLHF is widely accepted for further behavior and preference alignment [8, 13, 38], where proximal policy optimization (PPO) [45] is recognized as the major technique. Later adaptations attempt to stabilize the optimization process [42] and enclose more fine-grained signals [30, 56]. However, RLHF has rarely been explored in MLLMs to align model behaviors with humans. Reducing Hallucination for MLLMs. Some preliminary efforts have been made to alleviate hallucination problems in MLLMs. LRV [33] generates instruction data with negative responses, and mitigates hallucination by limiting the response length. However, limiting the response length does not essentially address the problem, and also undermines the response helpfulness. VIGC [52] iteratively refines the instruction data for better instruction tuning. Woodpecker [59] proposes to post-edit hallucinations by merging the output of MLLMs and a more accurate expert VQA model using GPT-3.5. The post-editing procedure involves external tools and LLMs much larger than the target MLLM online in multiple stages, which leads to high inference costs and delays. Gunjal et al. [19] distinguishes the inaccurate parts in responses via human annotation, and internally discourages the hallucinated parts by direct preference optimization. However, the positive behaviors for hallucinated parts are unknown, making the human feed- back not complete enough to learn the behavior boundary. The concurrent LLaVA-RLHF [48] employs the traditional RLHF approach [39] on MLLMs, and augments the reward model with rich additional text descriptions. It is therefore similarly challenged with label ambiguity, learning efficiency, and complex training. In comparison, RLHF-V presents the first fine-grained correctional human feedback learning framework for behavior alignment, and systematically addresses different hallucination sources in training MLLMs, achieving strong performance in trustworthiness.
mavis
Visual Instruction Tuning. The advancement of large language models (LLMs) [5, 28, 59, 16] with instruction tuning has significantly enhanced zero-shot capabilities across a range of tasks. Drawing inspiration from this, LLaMA-Adapter series [70, 20, 26] propose a zero-initialized attention mechanism to align frozen vision encoders [51] with LLaMA [58] for multi-modal learning. LLaVA series [41, 39] employ a linear projector for vision-language alignment, establishing visual instruction tuning as a standard training approach in the multi-modal field. Flamingo [2] and OpenFlamingo [3] have honed visual representation by integrating a cross-attention resampler with vision encoders. SPHINX series [21, 38] utilize a blend of visual encoders to make the LLM cognizant of various image aspects. InternVL series [15, 17, 56] employ a large vision encoder and QFormer [34] to incorporate high-quality visual information through a multi-stage training methodology. LLaVANexT [40, 31, 33] further introduces the ‘AnyRes’ technique to manage images at any given resolution, and LLaVA-NexT-Interleave [32] extends the scope widely to interleave multi-image settings. There are also recent efforts to apply visual instruction tuning to 3D [25, 62] and video [35, 18] scenarios. Despite the impressive strides made in both model capability and training efficiency by multi-modal large language models (MLLMs) through visual instruction tuning, there is currently no MLLM specifically designed for mathematical problem-solving, nor a substantial dataset available for such purposes in the open-source community. In this paper, we mitigate the issue by proposing MAVIS with high-quality mathematical visual datasets and training paradigms.