Skip to Content

IBM's Granite Docling: A Compact VLM for End-to-End Document Conversion

From PDFs to LLM-Ready Data with a 258M-Parameter Vision-Language Workhorse

Background: Docling's Journey

Docling addresses a persistent bottleneck in AI workflows: converting messy, unstructured PDFs and scans into clean, structured, model-ready data. In its first year, the project amassed more than 37,000 GitHub stars, was contributed to the Linux Foundation, and expanded with the Docling OpenShift Operator for large-scale ingestion, relevant for regulated industries. 

The team emphasizes local-first efficiency, high precision/recall, and rapid iteration, and is now pushing toward agentic systems that not only parse documents but also generate and manipulate them (Nicoud, 2025).

docling-project

docling-project

Organization

docling

Get your documents ready for gen AI
39.8k
2.8k
605
2.8k Network
169 Subscribers
Python
aiconvertdocument-parserdocument-parsingdocuments ...

Key Takeaways

  • Purpose-built for documents: A 258M-parameter image+text-to-text VLM optimized for end-to-end conversion inside the Docling toolkit, not a general image reasoning model (IBM Granite, 2025).

  • Higher fidelity outputs: Improved OCR, equations (LaTeX), tables (OTSL), code, and charts, with better stability and region-guided inference (IBM Granite, 2025).

  • Open source momentum: Docling has 37k+ GitHub stars, joined the Linux Foundation, and collaborates with Red Hat on an OpenShift Operator for large-scale ingestion (Nicoud, 2025).

  • Enterprise-ready building block: Enables cleaner RAG corpora, audit/compliance extraction, enterprise search enrichment, and scientific parsing.

Why It Matters

For teams bottlenecked by messy PDFs and scans, IBM's Docling project has become a go-to pipeline for turning unstructured pages into structured, model-ready text and tables, amassing tens of thousands of GitHub stars and graduating into the Linux Foundation. 

The latest step is Granite Docling, a compact yet capable vision-language model (VLM) that brings end-to-end document conversion closer to a single-pass workflow, and points Docling toward agentic systems that not only read documents but also generate and manipulate the.

What's New in Granite Docling

Granite Docling 258M is an image+text-to-text model designed to sit inside the Docling toolkit rather than replace it. Released under Apache 2.0 on September 17, 2025, it focuses on accurate, layout-aware conversion with strong OCR, table, code, formula, and chart handling, all aligned to Docling's structured outputs and tags (IBM Granite, 2025).

  • Better math and inline equations: improved detection and LaTeX rendering.
  • Flexible inference: full-page or bbox-guided region processing for precision.
  • Stability: reduced risk of looping and stuck generations.
  • Element-level QA: ask about structure, order, headers, or footers.
  • Multi-language (experimental): early support for Japanese, Arabic, and Chinese.

It also supports targeted instructions such as converting charts to tables, formulas to LaTeX, code blocks to text, and tables to OTSL, plus location-scoped actions that let you OCR or identify content inside specific regions, useful for invoices, forms, and scientific figures.

How It Works (Architecture)

Under the hood, Granite Docling builds on IDEFICS3 with three notable choices: a SigLIP2-base-patch16-512 vision encoder, a pixel-shuffle projector to connect vision and text, and a Granite 165M language model. 

Training used the nanoVLM framework with supervised fine-tuning that bakes DocTags into the model's habits, speeding convergence. The corpus blends real pages (DoclingMatix) with synthetic sets, SynthCodeNet (code), SynthFormulaNet (math), and SynthChartNet (charts), to explicitly teach the outputs Docling needs. 

IBM trained on the Blue Vela supercomputing cluster with NVIDIA H100 GPUs. Benchmarks in the model card show across-the-board gains versus prior SmolDocling releases, including OCR accuracy and layout fidelity.

Intended Use and Limitations

This model is purpose-built for document understanding and conversion as part of the Docling pipeline. It is not intended for generic image understanding. For broader image-text tasks, IBM recommends Granite Vision models paired with Granite Guardian for safety filtering. As with any compact model, validate outputs on your own domain and monitor for hallucinations in generative scenarios (IBM Granite, 2025).

Evaluation Highlights

The model card reports gains over prior SmolDocling releases on document-oriented benchmarks, including OCR accuracy and layout fidelity. Evaluations can be reproduced with the docling-eval framework for document tasks and lmms-eval for multimodal benchmarks (IBM Granite, 2025). As always, performance should be validated on domain-specific corpora.

Open-Source Momentum and Use Cases

Docling's philosophy, local-first efficiency, high precision/recall, and rapid iteration, underpins adoption across industries. Contributions via Red Hat have yielded an OpenShift Operator for bank-scale ingestion, while IBM teams use Docling to accelerate consulting and client engineering work. The roadmap extends beyond conversion toward structured extraction at scale and agent workflows that can synthesize and transform documents on the fly (Nicoud, 2025).

With Granite Docling slotted directly into the pipeline, teams can aim for cleaner RAG corpora, smarter audit/compliance extraction, enterprise search enrichment, and reproducible scientific parsing, without stitching together a patchwork of single-purpose models.

Getting Started

The fastest path is the Docling CLI and SDK, which automatically fetch Granite Docling and output HTML/Markdown with layout-aware fidelity. One-line example:

docling --to html --to md --pipeline vlm --vlm-model granite_docling "https://arxiv.org/pdf/2501.17887"

You can also run plain Transformers or VLLM for throughput, and MLX variants on Apple Silicon. The model ships under Apache 2.0. Remember it is optimized for documents and best used inside Docling pipelines for stable, structured outputs (IBM Granite, 2025).

# Minimal illustration with Transformers (single-page image)
# Prefer Docling SDK for multi-page documents and structured outputs.
# from transformers import AutoProcessor, AutoModelForVision2Seq
# import torch
# processor = AutoProcessor.from_pretrained("ibm-granite/granite-docling-258M")
# model = AutoModelForVision2Seq.from_pretrained(
#     "ibm-granite/granite-docling-258M", torch_dtype=torch.bfloat16
# )
# inputs = processor(images="page.png", text="Convert this page to docling.", return_tensors="pt")
# outputs = model.generate(**inputs, max_new_tokens=512)
# print(processor.batch_decode(outputs, skip_special_tokens=True)[0])

Bottom Line

Granite Docling is a pragmatic release: small enough to run widely, specialized enough to deliver high-quality conversions, and open enough to fit enterprise pipelines. For anyone turning PDFs into features, it is a welcome upgrade that keeps Docling's momentum, and the march toward agentic document systems, moving forward.

References

(1) IBM Granite Docling model card. (IBM Granite, 2025).
(2) Docling's rise: The IBM toolkit turning unstructured documents into LLM-ready data. (Nicoud, 2025).


IBM's Granite Docling: A Compact VLM for End-to-End Document Conversion
Joshua Berkowitz September 26, 2025
Views 9328
Share this post