Alibaba Releases Qwen 2.5-VL Multimodal Modeling

Alibaba released the Qwen 2.5-VL multimodal model, which is the latest flagship version of the Qwen family that combines powerful image and video comprehension with natural language processing capabilities to handle multiple forms of input (e.g., images, documents, videos, etc.) and generate corresponding textual output or perform complex tasks.

Simply put, it’s an “intelligent assistant that understands pictures and videos and can also chat”.The goal is to enable machines to better understand and interact with the real world, and it is widely applicable to various scenarios from edge devices to high-performance servers.Consists of a large-scale language model (LLM), a vision encoder, and a vision-language fusion module.

The Qwen2.5-VL multimodal model is capable of performing automated tasks on a computer, as well as generating hours of video, enabling better communication with humans, and supporting multi-language and complex conversations.

Qwen 2.5-VL is available in three different sizes:

Qwen2.5-VL-3B: Small model for devices with limited resources (e.g. cell phones).
Qwen2.5-VL-7B: Medium-sized for both performance and efficiency.
Qwen2.5-VL-72B: Flagship model with capabilities comparable to the industry’s top models (e.g. GPT-4o and Claude 3.5 Sonnet).

Performance Highlights

Documentation and Diagram Comprehension: Qwen 2.5-VL-72B rivals GPT-4o and Claude 3.5 Sonnet in these tasks.
Video comprehension: supports ultra-long video (hours) comprehension and fine-grained event localization.
Small Model Advantage: The Qwen2.5-VL-3B and 7B outperform similarly sized models for edge deployment.
Generalizability: excel across domains without task-specific fine-tuning.

Key Features

Visual recognition and object localization
Qwen2.5-VL excels at fine-grained vision tasks such as object detection, localization, and counting, supporting precise object localization using bounding boxes or points.
Support for absolute coordinates and JSON format output improves spatial reasoning.
Powerful Document Parsing
Upgrade text recognition to full document parsing and support multi-scene, multi-language document processing, including handwriting, tables, charts, chemical formulas, and sheet music.
Excels in structured data extraction (e.g. invoices, tables) as well as chart and layout analysis.
Dynamic resolution and long video comprehension
– The introduction of dynamic resolution processing and absolute time coding enables the model to handle images of different sizes and hours-long videos, and enables second-level event localization.
Reduce the computational overhead while preserving the native resolution by training the Vision Transformer (ViT) and Window Attention mechanism for dynamic resolution from scratch.
Enhanced Intelligent Body Functions
With advanced localization, reasoning and decision-making capabilities, Qwen2.5-VL is an interactive visual intelligence that performs complex tasks on computers and mobile devices.It can read the content of the screen and reason about what to do next, making it particularly suited to automated tasks.It can manipulate computer or mobile interfaces, such as tapping buttons and filling out forms for you.
Strong generalization capabilities across domains without task-specific fine-tuning.
Efficient Architecture Optimization
Introducing a Window Attention Mechanism in Visual Coders to Optimize Reasoning Efficiency.
Dynamic FPS (Frame Rate) Sampling Extended to Time Dimension for Improved Video Understanding.
Upgrading MRoPE (Multimodal Rotational Position Embedding) to be aligned with absolute time for enhanced time series learning.
Multi-language support
Supports multiple languages (Chinese, English, French, Japanese, etc.) and can handle text and content worldwide.

model structure

Qwen2.5-VL consists of three main components:

Large Language Model (LLM)
Based on Qwen2.5 LLM initialization, tuned to 1D RoPE upgraded to multimodal MRoPE with time alignment support.
Visual coder
Utilizes a redesigned Vision Transformer (ViT) to support native resolution input.
Introduces 2D RoPE and windowed attention, images are grouped into patches at 14 pixel steps, and videos are grouped by two frames to reduce the number of tokens.
Use RMSNorm normalization and SwiGLU activation functions to improve efficiency and compatibility.
Visual-verbal fusion module
Compressing visual feature sequences using an MLP-based approach dynamically adapts inputs of different lengths to reduce computational costs.

Alibaba Releases Qwen 2.5-VL Multimodal Modeling

Performance Highlights

Key Features

model structure

Try it

Leave a Reply

Alibaba Releases Qwen 2.5-VL Multimodal Modeling

Performance Highlights

Key Features

model structure

Try it

Related

Leave a Reply