Khazzz1c’s Blog post

👋 Welcome to Khazzz1c’s Blog

Google Scholar: khazzz1c

Working in Multimodal Large Language Model

The history of Vision-Language Model development: from image and text alignment to MLLM

Authors: Chonghan Liu

TL;DR: The evolution of multimodal approaches for images initially employed contrastive learning, where image-text pairs were naively utilized as positive and negative samples for fundamental image classification tasks. This approach evolved to include the identification of object locations. With the advent of large language models (LLMs), there has been a significant enhancement in the ability to generate detailed descriptions of images, as well as capabilities such as image generation and correction. This blog will provide a comprehensive analysis of the historical development and the frameworks of models at various stages. Furthermore, it will address some of the existing challenges in current Multimodal Language Models (MLLMs).

Understanding Positional Encoding in Multimodal Models

Authors: Chonghan Liu

TL;DR: This blog reviews the evolution of positional encoding from Sinusoidal and RoPE to 2D-RoPE and M-RoPE in multimodal settings. It focuses on how positional encoding must handle compatibility, symmetry, and temporal structure when models move from text to images and videos.

InternVL3: Architecture, Training Design, and Open Questions

Authors: Chonghan Liu

TL;DR: This blog reviews the core design of InternVL3, including its architecture, training strategy, and multimodal innovations such as V2PE and MPO. It also discusses which design choices look promising in practice and which parts still need further validation.

Working in Large Language Model

Exploring the Potential of In-Context Learning: New Pathways for Enhancing Chat-Based Large Language Model Performance