LangExtract: Grounded, Structured Extraction for Long Text

Reliable, Schema-guided Information Extraction to Any LLM Backend

Google

Get All The Latest to Your Inbox!

Advertise Here!

Gain premium exposure to our growing audience of professionals. Learn More

Inquire Now

LangExtract is a focused open-source library from Google that turns unstructured text into structured data you can trust. It combines schema-guided prompts, precise span alignment to the source text, and an interactive HTML visualizer into a single workflow that scales from short notes to full-length documents.

If your work involves finding entities and relationships in messy prose - think clinical notes, reports, or novels - this repository gives you production-grade building blocks and a clear path from prototype to repeatable pipelines.

google

Organization

langextract

A Python library for extracting structured information from unstructured text using LLMs with precise source grounding and interactive visualization.

14.8k

999

Apache License 2.0

Homepage

10.5k KB

999 Network

80 Subscribers

Python

geminigemini-aigemini-apigemini-flashgemini-pro ...

The problem and the solution

Extracting reliable structure from long, noisy text is hard. LLMs can identify entities, but outputs often drift, lose grounding, or get tangled in format mistakes. LangExtract tackles this by orchestrating the full pipeline: chunk long text sensibly, prompt with few-shot examples that specify the schema, run batched inferences across chunks and passes, then align every extraction back to exact character positions in the original text. The result is traceable, reviewable data that can be visualized and audited - without fine-tuning a model or writing bespoke alignment code.

Key features

Source-grounded extractions: Every entity includes an exact char interval, enabling deep linking and reliable HTML highlighting.

Schema guidance from examples: Few-shot examples define classes and attributes; supported models can enforce structured output via controlled generation.

Chunking and multi-pass recall: Long documents are split into sensible chunks, processed in parallel, and optionally re-run across multiple passes to merge non-overlapping findings (longer_text_example.md).

Interactive visualization: Turn JSONL results into a self-contained HTML viewer you can share with reviewers or stakeholders.

Pluggable providers: Gemini and Ollama are built in; OpenAI is available via an optional extra; external providers are discovered via entry points (providers/README.md, COMMUNITY_PROVIDERS.md).

Modern Python packaging: Clean pyproject.toml with optional extras for dev, test, and OpenAI.

Why I like it

The repository is practical and opinionated in the right places. The README shows working examples (including a full Romeo and Juliet run) and the code foregrounds the hard parts - alignment, chunking, and provider pluggability.

The annotation.py logic is readable and explicit about tradeoffs (e.g., merging non-overlapping extractions across multiple passes).

The provider design uses a registry and entry points so you can swap between Gemini, OpenAI (optional extra), Ollama, or a third-party plugin with minimal friction. And the HTML visualizer is built-in, so you get a review tool for free.

Under the hood

LangExtract's architecture revolves around four core components that work together to transform messy text into reliable structured data. Understanding these components reveals how you can customize the system for your specific domain and requirements.

Document Processing Pipeline

The process begins with the Annotator class, which orchestrates the entire extraction flow. Documents are first tokenized and intelligently chunked using the ChunkIterator, which respects sentence boundaries and maintains context while staying within your model's character limits. Each TextChunk preserves token intervals and character positions, ensuring every extraction can be traced back to its exact source location.

Prompt Engineering System

The QAPromptGenerator transforms your few-shot examples into model-specific prompts. It handles both JSON and YAML output formats, optionally wrapping responses in code fences for models that benefit from explicit structure markers. The prompt system includes attribute handling, where entities can carry additional metadata (like "severity" for medical conditions), and automatically formats examples to establish clear input-output patterns for the model.

Multi-Pass Extraction Logic

When you set extraction_passes > 1, LangExtract implements a sophisticated "first-pass wins" strategy. The system processes each chunk multiple times, but when extractions from different passes overlap at the character level, earlier extractions take precedence. This prevents double-counting while improving recall, particularly valuable for complex documents where entities might be missed in a single pass. The merge logic checks character intervals to resolve conflicts deterministically.

Provider Registry and Customization

LangExtract's provider system uses regex patterns to automatically route model IDs to the appropriate backend. Patterns like ^gemini, ^openai, and ^ollama map to their respective provider classes, while entry points enable community plugins to register themselves at import time. You can override auto-detection by explicitly choosing a provider, or write custom providers by extending BaseLanguageModel and registering new patterns. The resolver system handles output parsing with fuzzy text alignment, when the model's extraction text doesn't exactly match the source, sophisticated difflib-based alignment finds the best character positions with configurable similarity thresholds.

Outputs are Portable

JSONL is the canonical artifact, and the visualizer consumes that file to render an interactive, shareable review surface. The pyproject shows a modest dependency footprint, pytest-based tests, and import-linter contracts to keep boundaries clean between core and provider code.

Provider architecture: pluggable backends for any LLM

One of LangExtract's most powerful features is its provider system. It is a powerful pluggable architecture that lets you swap between different LLM backends without changing your extraction code.

Whether you're running local models with Ollama, cloud models from OpenAI, or enterprise solutions like AWS Bedrock, the provider system handles the complexity of API differences while maintaining a consistent interface.

How it Works

The system uses a registry pattern with automatic discovery. When you call lx.extract(model_id="gemini-2.5-flash"), the provider registry matches your model ID against regex patterns like ^gemini, ^openai, or ^ollama. The first matching provider handles your request, instantiating the appropriate client and translating your extraction parameters into provider-specific API calls. This pattern-based routing means you can switch from "gemini-2.5-flash" to "gpt-4" to "ollama/llama3.2" with just a model ID change.

Three Tiers of Providers

LangExtract ships with core providers (Gemini and Ollama) that have no extra dependencies, built-in providers with optional dependencies (OpenAI requires pip install langextract[openai]), and external community plugins. The community has built providers for AWS Bedrock, LiteLLM (which supports 100+ models), and llama.cpp for local GGUF files—all discoverable via Python entry points and installable with standard pip commands like pip install langextract-bedrock.

When to use Custom Providers

The provider system shines when you need enterprise models behind custom APIs, want to optimize for specific hardware (like GPU clusters), or need to integrate proprietary models. Creating a provider involves extending BaseLanguageModel, implementing the infer method, and registering patterns with the @register decorator. The provider documentation includes a complete plugin template and checklist for publishing to PyPI, making it straightforward to share custom backends with the community.

Advanced Customization

Beyond basic model switching, providers can implement schema support for structured output constraints, handle provider-specific parameters (like temperature or token limits), and optimize API usage patterns. The system supports explicit provider selection when multiple providers support the same model, lazy loading to avoid importing heavy dependencies until needed, and environment variable configuration for seamless deployment across different environments.

import langextract as lx

prompt = "Extract characters and emotions in order of appearance. Use exact text; no overlaps."
examples = [lx.data.ExampleData(text="ROMEO. But soft!", extractions=[
    lx.data.Extraction(extraction_class="character", extraction_text="ROMEO"),
    lx.data.Extraction(extraction_class="emotion", extraction_text="But soft!")
])]

result = lx.extract(
    text_or_documents="https://www.gutenberg.org/files/1513/1513-0.txt",
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
    extraction_passes=3,
    max_workers=16,
    max_char_buffer=1000,
)

lx.io.save_annotated_documents([result], "romeo.jsonl", ".")
html = lx.visualize("romeo.jsonl")

Use cases

Medical and healthcare teams are using LangExtract to extract medication information, dosages, and relationships from clinical notes with reliable source grounding—the repository's medical examples mirror techniques first studied in ML4H (Goel et al., 2023) (Goel et al., 2023). For a live demonstration, try the radiology report structuring app, RadExtract, on HuggingFace (google/radextract).

Legal and contract analysis teams are processing large documents to extract key-value pairs, though users report challenges with noisy content and recommend careful prompt engineering for domain-specific extraction (Issue #200). Document processing companies are tackling large PDF files (20-30 pages) for structured data extraction, with some success using chunking strategies to handle complex layouts (Issue #178).

Research and academic teams are applying LangExtract to literature analysis, structured data extraction from markdown tables (Issue #169), and comparing local versus cloud model performance for cost-effective extraction pipelines. One community comparison found that local models like Gemma2:2b through Ollama provide reasonable quality while maintaining privacy and cost control.

Data engineering teams are integrating LangExtract with local model setups to replace more complex Langchain-based extraction pipelines, finding the library's simplified approach more efficient for structured extraction tasks. The built-in visualization and JSONL output format make it particularly appealing for teams that need to review and validate extractions before downstream processing.

Community and contribution

The project welcomes issues and pull requests under an Apache-2.0 license. Community providers - like Bedrock, LiteLLM, and llama.cpp - are cataloged in COMMUNITY_PROVIDERS.md. If you are adding a new backend, the plugin architecture keeps the core small and your dependencies isolated; follow the checklist in the provider docs and announce your plugin by opening a tracking issue. Contribution basics, CI expectations, and formatting conventions are documented in CONTRIBUTING.md.

Usage and license terms

LangExtract is licensed under Apache License 2.0, which allows use, modification, and distribution with attribution and a broad patent grant. You must include a copy of the license and preserve notices when redistributing; modified files should be marked as such. See LICENSE for the full terms. For health-related applications, the README also points to the Health AI Developer Foundations terms that apply to Google's health-related tooling (Google, 2025).

About Google

LangExtract comes from Google's open-source efforts around practical, safe deployment of LLMs for information extraction. The contribution and community links reference the Health AI Developer Foundations program and guidelines, which emphasize responsible development and reproducibility (Google, 2025). The maintainers publish the package to PyPI and track releases in the repository's Releases.

Impact and what's next

By standardizing three pillars - schema guidance, source alignment, and visualization - LangExtract compresses the time from idea to reliable output. The design choices (registry-driven providers, JSONL artifacts, multi-pass merging) are simple enough to run anywhere but expressive enough to cover many verticals. Logical extensions include broader provider coverage via community plugins, richer cross-chunk coreference, and more robust parsing for non-Latin scripts. The existing issues and examples already hint at steady iteration in these areas.

Conclusion

If you need structured data from text and want grounded, reviewable results, LangExtract is a strong starting point. Read the README, scan the annotation code and the provider guide, then run one of the examples - literary, medical, or radiology - to see the full loop from prompt to visualization. The API surface is small, the outputs are auditable, and the path to customization is clear.

in Github Repos

# Gemini information extraction langextract LLM NLP Ollama OpenAI plugins Python

Authors:

Google

Joshua Berkowitz September 9, 2025

Views 30910

Share this post

blogs

Our latest content

Check out what's new !

See all

Ads

Prompt Maker Image Generator

Struggling with the perfect AI image prompt? My free app helps you generate brilliant ideas and instantly creates an image to match. Go from concept to creation in two clicks!

Try It

Most Popular Articles

Check out what the hot topics are!