š Welcome to Khazzz1cās Blog
Email: [email protected]
GitHub: https://github.com/khazic
Google Scholar: khazzz1c
The history of Vision-Language Model development: from image and text alignment to MLLM
Authors: Chonghan Liu
TL;DR: The evolution of multimodal approaches for images initially employed contrastive learning, where image-text pairs were naively utilized as positive and negative samples for fundamental image classification tasks. This approach evolved to include the identification of object locations. With the advent of large language models (LLMs), there has been a significant enhancement in the ability to generate detailed descriptions of images, as well as capabilities such as image generation and correction. This blog will provide a comprehensive analysis of the historical development and the frameworks of models at various stages. Furthermore, it will address some of the existing challenges in current Multimodal Language Models (MLLMs).
Understanding Positional Encoding in Multimodal Models
Authors: Chonghan Liu
TL;DR: This blog reviews the evolution of positional encoding from Sinusoidal and RoPE to 2D-RoPE and M-RoPE in multimodal settings. It focuses on how positional encoding must handle compatibility, symmetry, and temporal structure when models move from text to images and videos.
InternVL3: Architecture, Training Design, and Open Questions
Authors: Chonghan Liu
TL;DR: This blog reviews the core design of InternVL3, including its architecture, training strategy, and multimodal innovations such as V2PE and MPO. It also discusses which design choices look promising in practice and which parts still need further validation.