Authors:Chonghan Liu

Table of contents

TL;DR: The evolution of multimodal approaches for images initially employed contrastive learning, where image-text pairs were naively utilized as positive and negative samples for fundamental image classification tasks. This approach evolved to include the identification of object locations. With the advent of large language models (LLMs), there has been a significant enhancement in the ability to generate detailed descriptions of images, as well as capabilities such as image generation and correction. This blog will provide a comprehensive analysis of the historical development and the frameworks of models at various stages. Furthermore, it will address some of the existing challenges in current Multimodal Language Models (MLLMs).

0-introduction

In today's artificial intelligence landscape, multimodal learning has emerged as a significant area of research. With the rapid advancements in large language models (LLMs) and image processing technologies, researchers have increasingly recognized the potential of integrating visual and linguistic information. Initially, multimodal approaches primarily relied on contrastive learning, using image-text pairs as positive and negative samples for fundamental image classification tasks. However, as technology has progressed, this method has evolved to include the ability to identify object locations. Today, with the advent of large language models, capabilities such as generating detailed image descriptions, image generation, and correction have seen significant enhancement. This blog will provide a comprehensive analysis of the historical development and frameworks of multimodal language models (MLLMs) at various stages, as well as address some of the existing challenges faced by current multimodal language models. By comparing the architectures and training methods of different models, we will reveal the potential and limitations of multimodal learning in understanding and generating visual and linguistic information.

1-Contrastive Language-Image Pre-training

1.1 The Core Idea of CLIP

CLIP (Contrastive Language-Image Pre-training), introduced by OpenAI in early 2021, represents a significant advancement in image-text pair pre-training methodologies. This innovative approach, detailed in the paper "Learning Transferable Visual Models From Natural Language Supervision"[1] , demonstrates the potential for cross-modal understanding and unified representation of images and text without task-specific annotated data. The researchers at OpenAI curated an extensive dataset, WebImageText, comprising 400 million image-text pairs sourced from the internet. To evaluate the efficacy of their model, they conducted comprehensive benchmark tests across 30 distinct visual datasets. CLIP's architecture enables it to perform remarkably well on various downstream tasks in a zero-shot learning paradigm. This capability underscores the model's ability to transfer knowledge effectively, bridging the gap between visual and textual modalities without the need for task-specific fine-tuning. This groundbreaking work not only showcases the potential of large-scale contrastive learning in multimodal contexts but also opens new avenues for developing more versatile and adaptable AI systems capable of understanding and interpreting visual and textual information in a unified manner.

截屏2024-10-10 20.33.36.png

CLIP's architecture is predicated on mapping both images and text into a shared latent space. This is achieved through contrastive learning, which maximizes the similarity of positive samples (data pairs on the diagonal) while minimizing the similarity of negative samples (data pairs off the diagonal). This optimization process results in a common subspace that directly captures the semantic relationships between images and text. For the text encoder, CLIP consistently employs a transformer model with 63 million parameters. The image encoder, however, utilizes two distinct architectures: the conventional CNN-based ResNet and the transformer-based Vision Transformer (ViT). The ResNet implementation encompasses five models of varying complexity: ResNet50, ResNet101, RN50x4, RN50x16, and RNx64. The latter three models are derived by scaling up ResNet according to EfficientNet scaling rules, increasing the model size by factors of 4, 16, and 64, respectively. For the ViT architecture, three models of different scales are employed: ViT-B/32, ViT-B/16, and ViT-L/14. All models undergo training for 32 epochs using the AdamW optimizer. Notably, the training process utilizes a substantial batch size of 32,768. Given the extensive dataset, the computational requirements are significant. The largest ResNet model, RN50x64, necessitates 18 days of training on 592 V100 GPUs, while the largest ViT model, ViT-L/14, requires 12 days of training on 256 V100 GPUs.

1.2 The core formula of CLIP

This comprehensive approach to model architecture and training demonstrates CLIP's commitment to exploring various network configurations and leveraging substantial computational resources to achieve optimal performance in cross-modal understanding.

$$ \mathcal{D} = \{\tilde{\mathbf{I}}_i, \tilde{\mathbf{T}}i\}{i=1}^N $$

  1. The formula represents N <image, text> pairs within a batch.
  2. $\tilde{\mathbf{I}}i,$$\tilde{\mathbf{T}}i$ Represents the $i-th$ image and the $i-th$ text.
  3. $\phi_T(\cdot) \text{}$ Represents the text encoder,$\phi_I(\cdot) \text{}$Represents the image encoder.
  4. Map the images and texts to a common subspace, respectively.

First, pass all images and texts through the image and text encoders, respectively, and then perform normalization to obtain: