CLIP is one of the most important multimodal foundational models today, aligning visual and textual signals into a shared feature space using a simple contrastive learning loss on large-scale image-text pairs.
What powers CLIP's capabilities?
The rich supervision signals provided by natural language — the carrier of human knowledge — shape a powerful cross-modal represen-tation space.
As a result, CLIP supports a variety of tasks, including zero-shot classification, detection, segmentation, and cross-modal retrieval, significantly inf l uencing the entire multi-modal domain.
However, with the rapid advancements in large language models (LLMs) like GPT-4 and LLaMA, the boundaries of language comprehension and generation are continually being pushed.
This raises an intriguing question: can the capabilities of LLMs be harnessed to further improve multimodal representation learning?
The potential benef i ts of incorporating LLMs into CLIP are clear.
LLMs' strong textual understanding can fundamentally improve CLIP's ability to handle image captions, drastically enhancing its ability to process long and complex texts — a well-known limitation of vanilla CLIP.
Moreover, LLMs are trained on a vast corpus of text, possessing open-world knowledge.
This allows them to expand on cap-tion information during training, increasing the efficiency of the learning process.
However, realizing this potential is challenging.
Despite LLMs' powerful internal comprehension, their autoregressive nature hides this capability within the model, leading to output features with poor discriminability.
Our experiments show that directly integrating LLMs into CLIP results in catastrophic performance drops.
In this paper, we propose LLM2CLIP, a novel approach that embraces the power of LLMs to unlock CLIP's potential.
By fine-tuning the LLM in the caption space with contrastive learning, we extract its textual capabilities into the output embeddings, significantly improving the output layer's textual discriminability.
We then design an efficient training process where the fine-tuned LLM acts as a powerful teacher for CLIP's visual encoder.
Thanks to the LLM's presence, we can now incorporate longer and more complex captions without being restricted by vanilla CLIP text encoder's context window and ability limitations.
Our experiments demonstrate that this approach brings substantial improvements in cross-modal tasks.
Our method directly boosted the performance of the previously SOTA EVA02 model by 16.5% on both long-text and short-text retrieval tasks, transforming a CLIP model trained solely on English data into a state-of-the-art cross-lingual model.
Moreover, when integrated into mul-timodal training with models like Llava 1.5, it consistently outperformed CLIP across nearly all benchmarks, demonstrating comprehensive performance improvements.
I'll help you format this text with line breaks between each sentence:
The contributions of our methodology are threefold: First, we designed experiments to analyze the key reason preventing LLMs from directly participating in multimodal representation learning — the weak discriminability of their output features.
Second, we introduced the caption contrastive fine-tuning method, which significantly improves feature discriminability.
Third, we developed the LLM2CLIP training framework, which has been proven to be an efficient and effective method for leveraging LLMs to deliver substantial performance improvements to pretrained CLIP models.
Large Language Models (LLMs), such as Llama-3, are trained on massive world corpora through auto-regression, equipping them with open-world knowledge and enabling them to perform various tasks.
We initially hoped to leverage the capabilities of LLMs to directly retrain a CLIP model.
Although LLMs exhibit strong text comprehension, they are difficult to use directly as text embedding models.
This is because their knowledge is encapsulated within the model, and their output features are heavily skewed towards individual word predictions.
As generative models, they are not trained to ensure good linear separabil-ity of output features, making them less effective when used to interpret captions for CLIP.
As highlighted by Chen et al. (2022), cross-modal contrastive learning in CLIP requires that each modality possesses strong internal discriminability.
To evaluate the effectiveness of various language models in terms of text discriminability, and to test whether the native output features of LLMs, as hypothesized, struggle to distinguish image captions, we introduce a new metric: the MS COCO Caption Retrieve Accracy (CRA).
MS COCO is a widely used multimodal dataset, con-taining over 330K images, each with five captions.
These captions are written by different human annotators and provide diverse descriptions for each image.
In our evaluation on the MS COCO validation set, we use only the fi rst two captions of each image and treat the captions of the same image as positive pairs, while all other captions serve as negative samples.
We then perform caption-to-caption retrieval and assess Top-1 accuracy using different language models, defining the result as their CRA score.
Higher CRA scores indicate better discriminability of the language models on image captions.
As shown in Table 1, using a pure LLM results in a CRA score of only 18.4%, indicating that the majority of captions cannot be well-separated in the output space.
In fact, captions with similar distances may be entirely unrelated, as illustrated in Figure 2.
However, the text encoder from the original state-of-the-art CLIP model achieves a CRA score of 66%, proving the inadequacy of native LLM output features in caption discriminability.
Consequently, it is challenging to apply LLMs directly in CLIP model training.
Subsequent experiments in Ta-ble 6 also confirm that replacing CLIP's text encoder and the corresponding ViT with Llama-3 8B for contrastive learning signifi cantly underperforms the original CLIP.
In this section, we aim to fine-tune the LLM's token outputs to better capture features that can distinguish between image captions.
The process of improving the discriminability of LLM output features on caption text is quite straightforward: we need the distances between captions of the same image to be closer, and those of different images to be further apart.
Therefore, we apply a caption contrastive (CC) fine-tuning to the LLM's output features, treating different captions of the same image as positive samples and the rest of the captions as negative samples.
To obtain enough varied descriptions for the same image, we use the ShareCaptioner (Zheng et al., 2024; Chen et al., 2023) modified CC-3M (Sharma et al., 2018) dataset, which provides both original captions and augmented dense captions for each image.
These can be treated as positive pairs.
We followed the training methodology of LLM2Vec (BehnamGhader et al., 2024), first expanding the LLM's attention mechanism to bidirectional attention, and employing Masked Next Token Prediction (MNTP) for initialization to achieve better results.
Specifically: First, we transform the LLM's causal attention mechanism into bidirectional attention, as we no longer need it to retain generative capabilities but rather function as an encoder.
Since autoregressive training is no longer required, switching to bidirectional attention improves its ability to capture contextual information.
Second, we employ MNTP to train the newly added bidirectional attention mechanism, providing a strong initialization.
For a given sequence of N tokens, we mask a subset and predict their values, similar to BERT (Devlin et al., 2018).
However, unlike BERT, we adapt this process to fit the nature of LLMs by predicting tokens just before the masked token.
We train on both image captions and pure text with equal weighting.
In addition to CC-3M, we also use the Wikitext-103 (Merity et al., 2016) dataset to preserve the LLM's text capabilities and prevent divergence from its original strengths.
Finally, we perform the actual caption contrastive fine-tuning, using a supervised SimCSE loss to bring captions of the same image closer together and push apart captions of different images.
We use two prompt templates: "Given a caption, retrieve a detailed relevant caption" and "Given a detailed caption, retrieve a short relevant caption," which are prepended to the query (either original or dense caption) to retrieve the corresponding dense or original caption.
Similarly, we use a 1.5M-paired general text dataset curated from Springer et al. (2024) to maintain strong performance in pure language tasks.
All training is eff i ciently conducted using LoRA and is completed in just one epoch, ensuring low computational cost.
A remarkable result followed: after fine-tuning the LLM as a caption encoder for only one epoch on CC-3M, the CRA score for Llama-3 8B jumped from 18.4% to 73.0%, surpassing the previous state-of-the-art CLIP and EVA models' text encoders trained on the massive Laion-2B (Schuhmann et al., 2022) and COYO-700M (Byeon et al., 2022) image-text datasets.
As shown in Table 6, subsequent experiments demonstrated that after CC f i ne-tuning, the LLM f i nally unleashed its powerful capabilities, signif i cantly boosting the performance of the previously state-of-the-art CLIP model, in stark contrast to the results without CC fine-tuning.
This breakthrough uncovers the potential of LLMs in CLIP training and removes a major obstacle in leveraging LLMs to advance vision foundation models.
With the modif i cations to the LLM discussed earlier, we have now obtained a super text encoder that is well-suited for CLIP training.
The next step is to use this LLM in conjunction with the pretrained state-of-the-art CLIP visual encoder to reconstruct a more powerful cross-modal feature space.
As shown in Figure 1, in the LLM2CLIP training phase, we freeze the gradients of the LLM to preserve its inherent capabilities, for two primary reasons.
First, this signif i cantly reduces the computational cost and memory footprint of f i ne-tuning.
CLIP training requires a very large batch size to maintain the effectiveness of negative samples.
Allocating memory to the LLM could compromise CLIP's performance.
Second, by freezing the LLM, we ensure that the open-world knowledge it has acquired from large-scale corpora remains intact during the multimodal alignment process.
To compensate for the frozen LLM, and inspired by methods like FuseMix (Vouitsis et al., 2023) and APE (Rosen-feld et al., 2022), we introduce several new linear layers after the LLM as adapters.
These layers serve as learnable parameters to improve the alignment between the LLM and CLIP visual encoder.
Following the original design of CLIP, we also employ a projector layer to align the dimensions of the two encoders, facilitating the use of CLIP loss for training.
With this powerful LLM-based super text encoder, we achieve a substantial qualitative leap in CLIP's language comprehension capabilities.
The LLM's open-world knowledge allows the CLIP visual encoder to learn more structured and globally informed visual representations that are aligned with human knowledge.
Moreover, this approach enables us to fully utilize high-quality, long, and dense caption datasets without requiring any spe-cial architectural adjustments, which previous works like DCI, DreamLip, and Recaption struggled to effectively leverage.
As shown in Table 2, LLM2CLIP makes any existing SOTA CLIP model even more SOTA, signif i cantly surpassing the performance of previous one.
I'll help you format this text with line breaks between each sentence:
We propose LLM2CLIP as a method that eff i ciently incorporates large language models (LLMs) into CLIP train-ing, leveraging the capabilities of LLMs to make cross-modal representation learning signif i cantly more powerful.
In our experiments, We evaluated large language models, including Llama with 1B and 8B parameters, as well as Mistral-Nemo with 12B parameters.
It might seem that incorporating such large LLMs would greatly increase the computational burden of training CLIP, especially since CLIP itself is computationally expensive, requiring a large batch size.
However, our proposed LLM2CLIP is remarkably lightweight.
The training overhead is nearly iden-tical to f i ne-tuning the original CLIP model, with minimal additional cost, yet the LLM provides much stronger supervision.
Here, we highlight some of the design details that signif i cantly improve training eff i ciency:
1). During the caption contrastive f i ne-tuning stage, we employ LoRA training for the LLM.
Even for a 12B LLM, training with a batch size of 512 only requires around 70GB of GPU memory, making it feasible to run on a single 80GB 8 A100 GPU node.
2). In the LLM2CLIP stage, we freeze the LLM's gradients and only train the learnable adapter, CLIP's original Vision encoder, and two projectors.
The additional trainable parameters are roughly equivalent to those in the original CLIP, minimizing the overhead.
To further reduce the inference cost of using LLMs, we pre-extract all the text features from the training data and store them in memory.
This ensures that, even though LLMs provide powerful textual supervision, the memory and computational costs during training remain nearly identical to those of standard CLIP training.
For instance, when we trained LLM2CLIP using a Mistral-Nemo*12B model integrated with the commonly used EVA ViT-L/14-224, with a batch size of 4096 on 8 H100 GPUs, the memory usage per GPU was only 30GB, and the entire training process took just 9 hours.
Despite this eff i cient training cost, LLM2CLIP brought transformative improvements in downstream tasks such as long and short text retrieval, cross-language retrieval, and LLAVA training.
We utilized the ShareCaptioner-modified CC-3M dataset (Zheng et al., 2024; Chen et al., 2023), which provides both original captions and augmented dense captions for each image, for contrastive learn-ing.
For the Masked Next Token Prediction and caption contrastive fine-tuning stages, we employed the Wikitext-103 dataset (Merity et al., 2016) and the E5 dataset from Springer et al. (2024) to ensure that the general text domain was not excessively biased.
We trained the language model using LoRA, applying lightweight training with 1 epoch on all datasets.
We adopted the average of all LLM output tokens as the text global embedding for a caption.
All training parameters follow the design of BehnamGhader et al. (2024).
To compare different configurations, we designed three experimental settings based on dataset size:
LLM2CLIP-3M: This conf i guration only uses the CC-3M dataset for training, representing our lightweight setting.
LLM2CLIP-15M: This conf i guration uses both CC-3M and CC-12M datasets, which is our default setting.
All LLM2CLIP models without a specif i c dataset size label are trained with this 15M data.
LLM2CLIP-60M: This conf i guration scales up the dataset, combining CC-3M, CC-12M, YFCC-15M, and a ran-domly selected 30M subset from Recaption-1B (Li et al., 2024b).
All data used are dense captions rewritten by multimodal large language models (MLLMs).
The CC-3M, CC-12M, and YFCC datasets are sourced from Zheng et al. (2024) using the ShareCaptioner for rewriting, while Recaption data was rewritten using Llama3-LLAVA1.5.
We primarily used a mix of original captions and dense captions, with the default mixing ratio being 1:1.
LLM2CLIP-S represents a setting where only original data is used for training, matching the original CLIP pretraining distribution to analyze the model's benef i ts separately.
All experiments with different datasets were conducted for 4 epochs.
We froze the LLM gradients by default.
During training, we pre-extracted the text features from the captions using the LLM and stored them in memory to avoid repeated LLM inference, reducing additional computational overhead.
We trained the original CLIP vision encoder, the LLM's learnable adapter, and the projectors for both encoders.
Training on the 3M, 15M, and 60M datasets for CLIP ViT-L/14-224 required only 30G per GPU memory and 2, 9, and 45 hours on 8 H100 GPUs, respectively.
I'll help you format this text with line breaks between each sentence:
For short-text datasets, we used the commonly available COCO 2014 5k test set and the Flickr 1k test set as our test datasets.
For long-text datasets, we used the 1K subset of the ShareGPT4V (Chen et al., 2023) dataset and the Urban1k (Zhang et al., 2024) dataset, both provided by LongCLIP, along with the DOCCI (Onoe et al., 2024) dataset.
The ShareGPT4V-1M dataset was generated using annotations from GPT-4V and ShareCaptioner, with images from Laion, CC, SBU (Ordonez et al., 2011) and MS COCO.
We used a ran-domly selected 1K subset of this dataset.
Urban1k consists of captions generated by GPT-4V for 1,000 busy urban view images from Visual Genome.
Each caption is a long and complete sentence that describes the image, including types, colors, and relative locations of various attributes.
The model can only successfully match the images with the correct captions if it accurately understands and models the detailed attributes in both modalities.
The DOCCI dataset contains approximately 1.5K high-resolution images with detailed human-annotated descriptive captions.
DOCCI is divided into a training set of 9.6K pairs and a test set of 5.1K pairs.
We used the test set for image-lengthy caption retrieval experiments.
For the Chinese retrieval tasks, we used the FlickrCN (Lan et al., 2017) and CNCOCO (Li et al., 2018) datasets, which are translated versions of Flickr30K and MS COCO-1K, respectively.
These datasets were tested in Chinese.
For Llava (Liu et al., 2023) training, we used the standard training and test sets from LLAVA 1.5.
The experiments in this paper primarily focus on how LLM2CLIP enhances the perfor-mance of widely used vanilla CLIP across various dimensions.
We mainly compare our approach with the EVA (Fang et al., 2023) and OpenAI CLIP models as baselines, as they are the most widely used SOTA vi-sion encoders in the open-source community.
The CLIP models we use for comparison are ViT-B/16†, ViT-L/14, and ViT-L/14-336§.
For EVA02 models, we use EVA02 ViT-B/16, EVA02 ViT-L/14, and EVA02 ViT-L/14-336**, making up a total of six models.
For language models, we experimented with four different models: Jina-Embeddings-V2, and three popular LLMs from the Llama3 (Dubey et al., 2024) family, namely Llama 3.2 1B, Llama 3 8B, and Mistral-Nemo 12B.
To avoid any misunderstanding, the experiments in this paper, unless otherwise noted, use EVA02 ViT-L/14 as the default baseline for comparison.
The default language model used is LLaMA 3 8B, trained on a dataset of 15M setting mentioned above.
As shown in the experiments in Table 4, directly replacing the text encoder of EVA02 ViT-L/14 with Llama3-8B signif i cantly degrades retrieval performance.
For example, performance on the DOCCI benchmark nearly halves, dropping from 75.0/73.4 to 51.7/50.6.
Additionally, the CRA score of the vanilla Llama3-8B is particularly low, indicating that its output features exhibit very poor discriminability for captions.
These results clearly show that such an approach imposes a signif i cant burden on CLIP's learning process.
To enhance the discriminability of LLM output features, we applied Caption Contrastive fine-tuning.
Llama3-8B-TC and Llama3-8B-CC represent the results of using pure text corpora and a mix of text and augmented CC3M cor-pora from ShareCaptioner, respectively, both trained with supervised SimCSE loss.
As shown in the Table 6, the contrastive learning on mixed caption corpora yields higher CRA scores compared to training on generic text, with noticeable improvements across almost all benchmarks.
This difference stems from the fact that LLMs trained on caption-distributed data have stronger discriminability for caption features.
These findings highlight the impor-tance of text model discriminability for CLIP training and underscore the rationale behind the caption contrastive fine-tuning in LLM2CLIP.
We applied the LLM2CLIP fine-tuning method to both EVA02 and CLIP models and observed significant performance improvements in Table 2.
Even with lightweight fine-tuning, the results substantially surpassed those of the original models pretrained on datasets like Laion2B.
Compared to other methods such as LongCLIP and JinaCLIP, which also attempt to fine-tune pretrained CLIP models, our performance gains were transformative.
This demonstrates the effectiveness of LLM2CLIP as a method for introducing LLMs to enhance CLIP's performance.
To validate the knowledge transfer capabilities of the LLM, we designed an out-of-distribution (OOD) experiment by conducting an image-text retrieval task in a completely unfamiliar language.
This is an especially challenging experiment because all of our training was performed on purely English text, yet now we are testing the model on Chinese data.
As shown in Table 3, models like EVA02 CLIP and Jina CLIP, which performed well on English datasets, achieved near-zero accuracy on this task.
However, the magical power of LLM2CLIP became evident: the LLM's capabilities allowed the model to achieve impressive performance on Chinese retrieval tasks, even surpassing models that were originally trained on hundreds millions of Chinese data, such as Wukong (Gu et al.,2022) and CN-CLIP (Yang et al., 2022).
This result once again demonstrates that LLM2CLIP can effectively integrate the inherent abilities of the LLM into CLIP, enabling it to handle tasks far beyond its original scope.
As shown in Table 7, we conducted experiments using the Llava 1.5 Liu et al. (2023) VLLM (Vision-Language Large Model) training framework.
Llava incorporates a CLIP visual encoder into the LLM for multimodal instruction learning, meaning the quality of the visual encoder can significantly impact Llava's performance.
We compared the original CLIP with a version enhanced by our LLM2CLIP fine-tuning, running two versions of the experiment according to Llava's official implementation for a fair comparison.
The results showed that in over 87.5% of the benchmarks, we achieved substantial performance improvements, with the remaining benchmarks showing results that were very close.
This demonstrates the potential of LLM2CLIP for complex image reasoning and related tasks.
With the integration of an LLM, LLM2CLIP enhances our ability to understand dense captions.
In Table 5, we present an ablation study that examines the impact of different ratios of dense captions versus original captions on CLIP's training performance.
The "Ratio" column represents the proportion of dense captions used during training.
Overall, dense captions contribute to a noticeable improvement in performance, but more is not always better.
For example, when using 0% dense captions, the performance on datasets like COCO and Flickr30K remains decent, but long-text benchmarks such as ShareGPT4V, Urban-1K, and DOCCI show poor results.
Conversely, with 100% dense captions, performance on short-text retrieval bench-marks is the worst.
It's important to note that the dense captions we used were generated by the ShareCaptioner model, and there may be some distribution differences and noise compared to real data captions, which could somehow explain why using 100% dense data is suboptimal.
Our f i ndings indicate that the best performance is achieved when dense captions make up 50% to 75% of the training data, striking a balance between both short-text and long-text retrieval tasks.
As shown in Table 2, larger training datasets consistently yield positive results for LLM2CLIP, with clear improvements in both long-text and short-text retrieval tasks.
Our 60M dataset version has already pushed the limits of what CLIP could previously achieve.
Even the 3M version, though lightweight, still delivers signif i cant performance gains, demonstrating the eff i ciency of the LLM2CLIP approach.
It's worth noting that training on the 3M dataset takes only about 3 hours on an 8 H100 GPUs machine, yet results in a transformative leap in performance for a well-pretrained CLIP ViT model.