3IXC2.5-Reward
Data Preparation
Reward models are trained using pairwise preference annotations (e.g., prompts x with chosen responses yc and rejected responses yr) that reflect human preferences. While existing public preference data is primarily textual, with limited image and scarce video examples, we train IXC-2.5-Reward using both open-source data and a newly collected dataset to ensure broader domain coverage.
Tab. 2 lists the open-source pairwise data used in IXC-2.5-Reward, primarily focused on instruction following, safety, and general knowledge. Tab. 2 details the source of our newly collected data, which is initially the supervised fine-tuning (SFT) data consisting of prompts x and corresponding chosen responses yc across diverse domains: text-rich document understanding, math reasoning, and video understanding. We also collect some in-house data about the instruction following, which will be released in the future. To obtain rejected responses yr, we prompt the SFT model, InternLM-XComposer-2.5 (IXC-2.5) [112] to generate multiple outputs for each prompt and then employ distinct selection criteria. For general and text-rich data, we use GPT-4o [31] with pairwise evaluation prompts to determine the rejected response that was evaluated worse than the SFT ground-truth answer. For math reasoning and instruction following data, we build verifier functions [40] that compare generated responses against ground-truth solutions to label the chosen and rejected data. Our newly collected data complements existing open-source data, creating a comprehensive, high-quality multi-modal preference dataset.
Model Architecture
Our reward model InternLM-XComposer 2.5-Reward (IXC-2.5-Reward) is built upon the SFT model (IXC-2.5) [111]. As shown in Fig. 1 (b), we use the pre-trained weights of IXC-2.5-Chat for most of the parts, such as the visual encoder and the MLP projector, which has aligned the image and video data with text modalities. Thus, the IXC-2.5-Reward is merely required to train preference data to predict the reward score and avoid using other pre-training data for modality alignment.
We replace the final linear layer of IXC-2.5 with a score head f for IXC-2.5-Reward that predicts the reward score. Given the input prompt x and the response y, the score head f transforms the averaged hidden state features of all tokens into a binary scalar r(x,y). This scalar value r(x,y) serves as the predicted reward score for the inputs.
![](https://blog.kakaocdn.net/dn/bChFd4/btsMgKJTY7e/pfPX4qTevGZZbvSsgSRzK1/img.png)
Training Strategy
As shown in Fig. 1 (b), we froze the model’s vision encoder and projector that are initialized from IXC-2.5 [112], training only the LLM (InternLM [112]) and the score head. Other components of IXC-2.5, such as the dynamic image partitioning mechanism for high-resolution inputs, remained unchanged.
Length Constraints
We remove data pairs where the length of the chosen response yw is significantly longer than the length of the rejected response yl. This helps prevent the reward model from learning to associate length with quality. Notably, we found that the vulnerability of LLM-based evaluation to length bias, a known issue in LLMs [23], has also significant implications for LVLMs. Specifically, open-ended Visual Question Answering (VQA) benchmarks that employ LVLMs (e.g., GPT-4o) as judges are susceptible to inflated scores from overly long responses. Consequently, removing the length constraint on the reward model resulted in improved PPO policy performance. A detailed analysis is provided in Tab. 7.
![](https://blog.kakaocdn.net/dn/l8bmk/btsMgOk2qmU/WR2lT0sZjs1CKbAamKs8P0/img.png)