Alibaba Releases Qwen 2.5-VL Multimodal Modeling

阿里巴巴发布Qwen2.5-VL多模态模型

Alibaba released the Qwen 2.5-VL multimodal model, which is the latest flagship version of the Qwen family that combines powerful image and video comprehension with natural language processing capabilities to handle multiple forms of input (e.g., images, documents, videos, etc.) and generate corresponding textual output or perform complex tasks.

Simply put, it’s an “intelligent assistant that understands pictures and videos and can also chat”.The goal is to enable machines to better understand and interact with the real world, and it is widely applicable to various scenarios from edge devices to high-performance servers.Consists of a large-scale language model (LLM), a vision encoder, and a vision-language fusion module.

The Qwen2.5-VL multimodal model is capable of performing automated tasks on a computer, as well as generating hours of video, enabling better communication with humans, and supporting multi-language and complex conversations.

Qwen 2.5-VL is available in three different sizes:

  • Qwen2.5-VL-3B: Small model for devices with limited resources (e.g. cell phones).
  • Qwen2.5-VL-7B: Medium-sized for both performance and efficiency.
  • Qwen2.5-VL-72B: Flagship model with capabilities comparable to the industry’s top models (e.g. GPT-4o and Claude 3.5 Sonnet).

Performance Highlights

  • Documentation and Diagram Comprehension: Qwen 2.5-VL-72B rivals GPT-4o and Claude 3.5 Sonnet in these tasks.
  • Video comprehension: supports ultra-long video (hours) comprehension and fine-grained event localization.
  • Small Model Advantage: The Qwen2.5-VL-3B and 7B outperform similarly sized models for edge deployment.
  • Generalizability: excel across domains without task-specific fine-tuning.

Key Features

  1. Visual recognition and object localization
    Qwen2.5-VL excels at fine-grained vision tasks such as object detection, localization, and counting, supporting precise object localization using bounding boxes or points.
    Support for absolute coordinates and JSON format output improves spatial reasoning.
  2. Powerful Document Parsing
    Upgrade text recognition to full document parsing and support multi-scene, multi-language document processing, including handwriting, tables, charts, chemical formulas, and sheet music.
    Excels in structured data extraction (e.g. invoices, tables) as well as chart and layout analysis.
  3. Dynamic resolution and long video comprehension
    – The introduction of dynamic resolution processing and absolute time coding enables the model to handle images of different sizes and hours-long videos, and enables second-level event localization.
    Reduce the computational overhead while preserving the native resolution by training the Vision Transformer (ViT) and Window Attention mechanism for dynamic resolution from scratch.
  4. Enhanced Intelligent Body Functions
    With advanced localization, reasoning and decision-making capabilities, Qwen2.5-VL is an interactive visual intelligence that performs complex tasks on computers and mobile devices.It can read the content of the screen and reason about what to do next, making it particularly suited to automated tasks.It can manipulate computer or mobile interfaces, such as tapping buttons and filling out forms for you.
    Strong generalization capabilities across domains without task-specific fine-tuning.
  5. Efficient Architecture Optimization
    Introducing a Window Attention Mechanism in Visual Coders to Optimize Reasoning Efficiency.
    Dynamic FPS (Frame Rate) Sampling Extended to Time Dimension for Improved Video Understanding.
    Upgrading MRoPE (Multimodal Rotational Position Embedding) to be aligned with absolute time for enhanced time series learning.
  6. Multi-language support
    Supports multiple languages (Chinese, English, French, Japanese, etc.) and can handle text and content worldwide.

model structure

Qwen2.5-VL consists of three main components:

  1. Large Language Model (LLM)
    Based on Qwen2.5 LLM initialization, tuned to 1D RoPE upgraded to multimodal MRoPE with time alignment support.
  2. Visual coder
    Utilizes a redesigned Vision Transformer (ViT) to support native resolution input.
    Introduces 2D RoPE and windowed attention, images are grouped into patches at 14 pixel steps, and videos are grouped by two frames to reduce the number of tokens.
    Use RMSNorm normalization and SwiGLU activation functions to improve efficiency and compatibility.
  3. Visual-verbal fusion module
    Compressing visual feature sequences using an MLP-based approach dynamically adapts inputs of different lengths to reduce computational costs.

Try it

Online Experience:https://chat.qwenlm.ai

Model Download:https://huggingface.co/Qwen

Model Download:https://modelscope.cn/organization/qwen

GitHub:https://github.com/QwenLM/Qwen2.5-VL

Master Tech TipsMaster Tech Tips
Previous 19 hours ago

Related

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.