From documents to answers: inside Azure's GPT-RAG Data Ingestion service

How an open-source pipeline turns PDFs, images, spreadsheets, and SharePoint into search-ready knowledge

Get All The Latest to Your Inbox!

Advertise Here!

Gain premium exposure to our growing audience of professionals. Learn More

Inquire Now

Enterprise knowledge rarely lives in one tidy place. Files sprawl across shares and SharePoint sites, formats range from PDFs and images to spreadsheets and transcripts, and AI teams need a reliable way to transform that chaos into retrieval-ready chunks.

Azure/gpt-rag-ingestion tackles exactly that: a production-minded, open-source ingestion service that prepares heterogeneous content for Azure AI Search and Retrieval-Augmented Generation (RAG) applications.

Azure

Organization

gpt-rag-ingestion

Key features & functionality

Out of the box, the repository ships with pragmatic building blocks that map cleanly to real-world operations. In code terms, look to main.py for the HTTP API and scheduler, chunking/ for document strategies, and connectors/ for data sources:

Format-aware chunking: chunking/chunker_factory.py routes each file extension to a purpose-built chunker (PDF, images, DOCX, PPTX, VTT transcripts, JSON, spreadsheets). Multimodal mode pairs text with figure captions and embeddings.

Embeddings service: The /text-embedding endpoint uses Azure OpenAI via managed identity to produce vectors consistently (see tools/aoai.py).

SharePoint ingestion: connectors/sharepoint streams metadata from Microsoft Graph, downloads content, chunks, and indexes it into Azure AI Search, tracking successes and failures.

Scheduled freshness: APScheduler cron jobs (configured with CRON_RUN_* variables in main.py) periodically index new files and purge deleted ones, including a dedicated images purge for multimodal scenarios.

Operational safety: App Configuration-driven settings, API key validation for HTTP endpoints, and OpenTelemetry instrumentation for FastAPI and HTTPX help with safe rollout and observability.

The problem it tackles

RAG systems live or die by their data prep. It is not enough to toss documents at an index; you need format-aware chunking, consistent embeddings, metadata that preserves provenance, and connectors that keep content fresh. Without this discipline, you get hallucinations, brittle results, or indexes that drift out of sync with the source of truth.

The solution at a glance

The GPT-RAG Data Ingestion service is a FastAPI application that automates document processing end to end. It detects each file's shape, applies a fit-for-purpose chunking strategy, generates text and image embeddings, and writes enriched chunks into an Azure AI Search index.

It also includes a SharePoint connector with streaming metadata reads, scheduled jobs to keep indexes current, and optional multimodal support that attaches figure captions and embeddings to the right text passages. In short, it is the missing intake valve for the broader GPT-RAG solution.

Under the hood

Two HTTP endpoints power programmatic ingestion: /document-chunking validates requests with JSON Schema, downloads the document (via tools.blob.BlobClient when given a SAS URL), and returns chunk objects; /text-embedding turns text into embeddings using Azure OpenAI.

A FastAPI lifespan hook starts an AsyncIO scheduler that registers three cron-driven jobs for SharePoint indexing, SharePoint purge, and multimodal image purge. Configuration is pulled once at startup through dependencies.get_config() and used across chunkers, connectors, and tools.

The chunking architecture is intentionally pluggable. ChunkerFactory inspects each filename and returns the right implementation: TranscriptionChunker for VTT, SpreadsheetChunker for Excel, JSONChunker, LangChainChunker as a fallback, and for PDFs and images, either DocAnalysisChunker or MultimodalChunker.

The multimodal path calls Azure Document Intelligence to identify figures, uploads extracted images to blob storage, generates captions and embeddings, and appends those to the chunk that references the figure. By preserving page offsets and figure IDs, retrieved answers can cite specific evidence, not just generic text.

class ChunkerFactory:
    def get_chunker(self, data):
        extension = get_file_extension(get_filename_from_data(data))
        if extension == 'vtt':
            return TranscriptionChunker(data)
        elif extension == 'json':
            return JSONChunker(data)
        elif extension in ('xlsx', 'xls'):
            return SpreadsheetChunker(data)
        elif extension in ('pdf', 'png', 'jpeg', 'jpg', 'bmp', 'tiff'):
            return MultimodalChunker(data) if self.multimodality else DocAnalysisChunker(data)
        elif extension in ('docx', 'pptx'):
            if self.docint_40_api:
                return MultimodalChunker(data) if self.multimodality else DocAnalysisChunker(data)
            raise RuntimeError('Processing docx and pptx requires Doc Intelligence 4.0.')
        else:
            return LangChainChunker(data)

On the indexing side, tools/aisearch.py wraps the Azure AI Search async client with managed identity or Azure CLI credentials, batches uploads, and offers helpers to delete and query by filters. The SharePoint ingestor consults the index before re-chunking, skipping unchanged files and deleting old chunks when updates are detected. Telemetry spans are emitted through OpenTelemetry and Azure Monitor exporters so you can tie ingestion health to infrastructure signals (Microsoft, 2025).

Usage & deployment

Provision the GPT-RAG infrastructure first as described in the parent project, then deploy this service. The repository assumes Azure resource names and secrets are provided via App Configuration and Key Vault. When you are ready to roll out the web app:

azd env refresh
azd deploy

Ensure you reuse the same subscription, resource group, and environment name you used for the infrastructure so components resolve to the right endpoints. For HTTP calls, include a valid X-API-KEY header as enforced by dependencies.validate_api_key_header.

Community & contribution

The project is maintained under the Microsoft Azure organization with active contributors and regular releases. Contributions follow a standard CLA and code of conduct; see CONTRIBUTING.md in the repository. Issues and PRs focus on new connectors, chunker improvements, and operational robustness. If you are building adjacent RAG components, consider contributing chunkers for domain-specific formats, or telemetry dashboards for ingestion SLAs.

License & terms

The repository is MIT licensed as stated in LICENSE.md. This grants broad rights to use, modify, and distribute the software with attribution and without warranty. Microsoft trademarks and brand assets require adherence to Microsoft's Trademark & Brand Guidelines, and third-party trademarks remain subject to their respective policies. Review the SECURITY.md and CODE_OF_CONDUCT guidance in the repo for responsible disclosure and community expectations.

About Microsoft Azure

Azure maintains a wide portfolio of SDKs, services, and open-source tooling for building cloud-native and AI applications. Notable related projects include the Azure SDKs for .NET and Python and platform services like Azure AI Search and Azure OpenAI (Microsoft, 2025; Microsoft, 2025). The GPT-RAG family exemplifies a practical, batteries-included approach to applied AI on Azure.

Impact & what comes next

By codifying ingestion patterns, this project reduces the time from messy documents to answerable knowledge. Its modular chunkers, SharePoint connector, and managed identity-first design make it a strong starting point for enterprise RAG. Looking ahead, richer connector coverage (data lakes, wikis, ticketing systems), adaptive chunking guided by retrieval feedback, and tighter observability could push quality even further. Because the code is open and MIT-licensed, teams can fork, specialize, and share improvements back.

Conclusion

Azure's GPT-RAG Data Ingestion service turns diverse enterprise files into reliable retrieval units for modern AI assistants. If you are standing up a RAG application on Azure, this is a pragmatic, extensible intake layer that meets you where your content lives and prepares it for fast, explainable answers. Explore the code, run it in your environment, and help shape the next wave of open tooling for enterprise AI.

Explore the repository and join the community:

https://github.com/Azure/gpt-rag-ingestion

in Github Repos

Joshua Berkowitz August 8, 2025

Views 4543

Share this post

blogs

Our latest content

Check out what's new !

See all

Ads

Prompt Maker Image Generator

Struggling with the perfect AI image prompt? My free app helps you generate brilliant ideas and instantly creates an image to match. Go from concept to creation in two clicks!

Try It

Most Popular Articles

Check out what the hot topics are!