Alibaba has unveiled Qwen2- VL, a cutting-edge vision-language model that can process information from various sources, including images and videos. This AI-powered model is capable of recognizing and understanding text in multiple languages, making it accessible to a global audience.
The advanced 72 billion model of Qwen2-VL has achieved state-of-the-art visuals understanding across 20 benchmarks, demonstrating top-tier performance in metrics such as document understanding and math reasoning. The model excels in tasks like summarizing videos longer than 20 minutes, answering real-world questions using visual information, and controlling devices with visual cues and text commands.
Qwen2-VL also boasts the ability to adapt to images of different sizes and clarity, thanks to its implementation of Naive Dynamic Resolution support. This allows the model to process images more like human vision perceives them, making it a significant improvement over its predecessor.
Another key innovation is Multimodal Rotary Position Embedding (M-ROPE), which enables the model to integrate 1D textual, 2D visual, and 3D video positional information. This allows Qwen2-VL to understand and process text, images, and videos seamlessly.
The potential applications of this technology are vast. For instance, it can be used for handwritten text recognition (HTR), as demonstrated by digital nomad William J.B. Mattingly on X. It can also be applied to problem-solving, as shown by user Ashutosh Shrivastava, who successfully used the model to solve a calculus problem.
The Qwen2-VL model is available for use through API and open-source models, offering developers and researchers a powerful tool for advancing AI capabilities.
Source: https://analyticsindiamag.com/ai-news-updates/alibaba-launches-qwen2-vl-surpasses-gpt-4o-claude-3-5-sonnet/