Skip to Content

ERNIE 4.5 VL-28B-A3B-Thinking: The Next Leap in Multimodal AI

Welcome to the Future of Multimodal AI

Get All The Latest to Your Inbox!

Thanks for registering!

 

Advertise Here!

Gain premium exposure to our growing audience of professionals. Learn More

ERNIE 4.5 VL-28B-A3B-Thinking is a potentially transformative advancement for anyone building or researching sophisticated visual-language applications. This model stands out by leveraging an enormous and diverse blend of images and text during training. 

Its underlying architecture uses advanced multimodal reinforcement learning techniques, specifically GSPO and IcePo, paired with dynamic difficulty sampling. This ensures robust, stable learning in a Mixture of Experts (MoE) framework. Despite activating only 3 billion parameters at inference, ERNIE 4.5 matches or even surpasses much larger models, delivering efficient yet powerful performance.

What Makes ERNIE 4.5 Exceptionally Capable?

  • Visual Reasoning: Tackles multi-step logic tasks, such as interpreting complex charts and deriving insights from visual data.

  • STEM Problem Solving: Solves science and math challenges directly from images, like calculating electrical resistance from a circuit diagram.

  • Visual Grounding: Links language instructions to visual elements, excelling at object identification and localization within images.

  • Thinking with Images: Uses functions like zoom to inspect fine details, mirroring human analysis strategies.

  • Tool Integration: Connects with external resources (e.g., image search) to enhance understanding and answer questions beyond its initial training set.

  • Video Understanding: Demonstrates temporal and spatial awareness by extracting subtitles, timestamps, and identifying specific scenes within videos.

Flexible Deployment and Customization

ERNIE 4.5 is engineered for easy integration. Developers can deploy it with major frameworks including Hugging Face Transformers, vLLM, and FastDeploy by using detailed configurations for optimized inference. For teams with specialized requirements, the ERNIEKit toolkit enables advanced fine-tuning, including instruction and function-specific training. This adaptability makes it simple to tailor the model for unique business or research needs.

Open-Source Impact and Licensing

Distributed under the Apache License 2.0, ERNIE 4.5 is accessible for both commercial and research endeavors. Its open-source nature not only fosters innovation and reproducibility but also encourages collaborative development, establishing ERNIE 4.5 as a key driver in the next generation of AI solutions.

Real-World Problem Solving in Action

  • Visual Data Analysis: Accurately determines optimal business hours from a customer traffic chart, considering both visual data and user-defined constraints.

  • STEM Expertise: Reads a bridge circuit diagram, applies physics laws, and calculates equivalent resistance with clear, stepwise reasoning.

  • Grounded Object Detection: Identifies people in suits within an image and outputs bounding boxes in a structured format.

  • Detail-Oriented Image Reading: Zooms in on a blue sign within a photo to precisely read and relay the text “HOTEL BUZA.”

  • Adaptive Tool Use: Recognizes an unfamiliar plush toy by leveraging image search tools, confirming its identity as “Dundun” from MINISO.

  • Advanced Video Analysis: Extracts and timestamps subtitles from video and identifies scenes filmed on a bridge, showcasing nuanced temporal reasoning.

The Takeaway: Setting the Standard for AI Reasoning

ERNIE 4.5 VL-28B-A3B-Thinking redefines the boundaries of multimodal AI. Its ability to analyze, reason, and interact across images, text, and videos while seamlessly integrating external tools, empowers developers to create intelligent systems with human-like perception and decision-making. As multimodal agents become central to AI progress, ERNIE 4.5 stands out as a robust, efficient, and open foundation for the future.

Models page: https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-Thinking

Source: ernie.baidu.com | HuggingFace

ERNIE 4.5 VL-28B-A3B-Thinking: The Next Leap in Multimodal AI
Joshua Berkowitz November 12, 2025
Views 2794
Share this post