Khazzz1c’s Blog post

👋 Welcome to Khazzz1c’s Blog

Google Scholar: khazzz1c

Working in Multimodal Large Language Model

The history of Vision-Language Model development: from image and text alignment to MLLM

Authors: Chonghan Liu

TL;DR: The evolution of multimodal approaches for images initially employed contrastive learning, where image-text pairs were naively utilized as positive and negative samples for fundamental image classification tasks. This approach evolved to include the identification of object locations. With the advent of large language models (LLMs), there has been a significant enhancement in the ability to generate detailed descriptions of images, as well as capabilities such as image generation and correction. This blog will provide a comprehensive analysis of the historical development and the frameworks of models at various stages. Furthermore, it will address some of the existing challenges in current Multimodal Language Models (MLLMs).

A brief thought on the location of multimodal coding

Authors: Chonghan Liu

TL;DR Regarding positional encoding, both the original Sinusoidal and Su Jianlin's RoPE positional encoding are intended to be "absolute positional encoding of relative positional encoding". Sinusoidal achieves this, but it is not good enough. RoPE makes up for this part of the problem very well, but in a multimodal environment, the use of RoPE becomes more complicated because positional encoding in multimodality needs to meet compatibility, symmetry and equivalence. This blog will talk about Sinusoidal and RoPE, and then derive from 2D-RoPE and M-RoPE. Finally, Su Shen's RoPE-Tie and RoPE-TV will be introduced to help the model build a process from one-dimensional encoding to three-dimensional video positional encoding containing temporal information.

Some simple thoughts on InternVL3

Authors: Chonghan Liu

TL;DR: InternVL3 maintains the classic Vit + Connector + LM architecture while introducing several key innovations: a novel Variable Visual Position Encoding (V2PE) for handling visual tokens with flexible position increments, a native multimodal pre-training approach that simultaneously trains on text and multimodal data in a 1:3 ratio, and a Mixed Preference Optimization (MPO) strategy that combines preference, quality, and generation losses. The model's training efficiency is enhanced through the InternEVO framework, which decouples ViT and LM computation loads. While these improvements show promising results in multimodal understanding, the effectiveness of the V2PE position encoding design still requires further validation in real-world applications.

Working in Large Language Model

Exploring the Potential of In-Context Learning: New Pathways for Enhancing Chat-Based Large Language Model Performance