Last updated: January 15, 2026

Tutorials

Intermediate

How to Extract Structured Data from Documents with LangExtract and Together AI

Learn how to use Google's LangExtract library to extract structured information from unstructured text. Includes a critical bug fix for Together AI integration and practical examples for research papers, contracts, and medical notes.

Data ExtractionLLMPythonTogether AI

TL;DR

The Problem. You have unstructured documents (research papers, contracts, medical notes) and need structured data. Plain LLMs hallucinate and provide no source verification.
The Solution. Google's LangExtract library extracts structured information while mapping every extraction to its exact location in the source text. No fine-tuning required - just provide a few examples.
The Setup. About 15 minutes: install LangExtract, configure your LLM provider, define your extraction schema with a few examples, and run your first extraction.

Try it now

Want to skip local setup? LangExtract works directly in Google Colab with Gemini (no bug fix needed). Create a new Colab notebook and run:

!pip install langextract
import langextract as lx
# Uses Gemini by default - just set GOOGLE_API_KEY in Colab secrets

For Together AI or other providers, follow the full guide below.

Why LangExtract

Every data pipeline has the same problem: valuable information is trapped in unstructured text. Research findings buried in papers. Contract terms scattered across paragraphs. Patient information in clinical notes.

Traditional approaches have tradeoffs:

Approach	Pros	Cons
Regex patterns	Fast, deterministic	Breaks on natural language variation
NER models	Accurate for trained entities	Fixed entity types, needs training data
Plain LLM prompts	Flexible, understands context	No source verification, hallucination risk

Approach

Regex patterns

Pros

Fast, deterministic

Cons

Breaks on natural language variation

Approach

NER models

Pros

Accurate for trained entities

Cons

Fixed entity types, needs training data

Approach

Plain LLM prompts

Pros

Flexible, understands context

Cons

No source verification, hallucination risk

Traditional vs LangExtract Pipeline

Source grounding is the key difference

LangExtract fills the gap. It uses LLMs for flexible extraction but adds source grounding - every extracted piece of information is mapped to its exact location in the original text. When someone asks "where did this number come from?", you can point to the exact sentence.

Key features:

Source grounding: Click on any extraction to see the original text
Schema enforcement: Define your output structure with few-shot examples
Long document support: Automatic chunking and parallel processing
Provider flexibility: Works with Gemini, OpenAI, Together AI, local models via Ollama
Interactive visualization: Generate HTML reports to review extractions

This guide walks through setting up LangExtract with Together AI's Llama models, including a bug fix we discovered during testing.

The business case

Executives need numbers. Here's how LangExtract compares on cost:

Approach	Cost per 100 Docs	Accuracy	Rework Rate	True Cost
Manual data entry	$500-1000 (labor)	95-99%	2-5%	$500-1050
Plain LLM extraction	$2-5 (tokens)	80-90%	15-25%	$30-80 (rework)
LangExtract + verification	$3-8 (tokens)	90-95%	5-10%	$8-20 (rework)

Approach

Manual data entry

Cost per 100 Docs

$500-1000 (labor)

Accuracy

95-99%

Rework Rate

2-5%

True Cost

$500-1050

Approach

Plain LLM extraction

Cost per 100 Docs

$2-5 (tokens)

Accuracy

80-90%

Rework Rate

15-25%

True Cost

$30-80 (rework)

Approach

LangExtract + verification

Cost per 100 Docs

$3-8 (tokens)

Accuracy

90-95%

Rework Rate

5-10%

True Cost

$8-20 (rework)

The math: Plain LLM extraction is cheap until you factor in hallucination rework. A 20% error rate on 100 legal documents means 20 manual reviews at $15-30 each. LangExtract's source grounding cuts verification time from minutes to seconds - you click the extraction and see the original text immediately.

Break-even point: LangExtract pays for itself after ~50 documents where accuracy matters. For one-off tasks, plain LLM prompts are faster. For production pipelines, the audit trail alone justifies the setup time.

What you will need

Before starting, make sure you have:

Python 3.10 or higher installed
A Together AI account with API key (or Google/OpenAI API key)
About 15 minutes for setup

Why Together AI?

Together AI provides access to open-source models like Llama 3.3 70B at competitive prices. You get strong extraction quality without vendor lock-in. The same code works with Gemini, OpenAI, or local models by changing one line.

Step 1: Install LangExtract

Create a project directory and set up a virtual environment. This keeps LangExtract's dependencies isolated from your other Python projects.

Create your project folder

mkdir langextract-project
cd langextract-project

Create a virtual environment

python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install LangExtract

pip install langextract

This installs the core library with support for Google's Gemini models by default.

Step 2: Install the LiteLLM provider

To use Together AI (or OpenAI, Anthropic, and 100+ other providers), install the community LiteLLM provider:

pip install langextract-litellm

What is LiteLLM?

LiteLLM is a unified interface for calling different LLM providers. Instead of learning each provider's API, you use one consistent format. The langextract-litellm package bridges LangExtract to this ecosystem, giving you access to Together AI, OpenAI, Anthropic, Azure, and many more.

Step 3: Fix the Together AI bug

Here is something the documentation does not tell you: the LiteLLM provider has a bug when used with Together AI. It passes internal LangExtract parameters that Together AI's API cannot serialize.

The error you will see:

InternalServerError: Together_aiException - Object of type FormatType is not JSON serializable

The fix:

Open the provider file and filter out non-standard parameters. Find the file at:

.venv/lib/python3.12/site-packages/langextract_litellm/provider.py

Locate this code block (around line 89):

response = litellm.completion(
    model=self.model_id,
    messages=messages,
    **self.provider_kwargs,
)

Replace it with:

# Filter out non-serializable kwargs that LiteLLM doesn't understand
filtered_kwargs = {
    k: v for k, v in self.provider_kwargs.items()
    if k in ('temperature', 'max_tokens', 'top_p', 'frequency_penalty',
             'presence_penalty', 'timeout', 'api_base', 'api_key',
             'stop', 'n', 'stream', 'response_format')
}
 
response = litellm.completion(
    model=self.model_id,
    messages=messages,
    **filtered_kwargs,
)

This filters the kwargs to only include parameters that LiteLLM and Together AI understand, preventing the serialization error.

Why does this bug exist?

LangExtract passes internal configuration objects (like FormatType enums) through to the model provider. The Gemini provider handles these correctly, but the LiteLLM provider passes everything through to the underlying API. Together AI's API tries to JSON-serialize all parameters for the request body and fails on Python enum objects.

Step 4: Set up your API key

Get your Together AI API key from api.together.xyz and set it as an environment variable:

export TOGETHER_AI_API_KEY="your-api-key-here"

On Windows PowerShell:

$env:TOGETHER_AI_API_KEY = "your-api-key-here"

To make this permanent, add the export line to your ~/.bashrc or ~/.zshrc file.

Alternative: Local execution with Ollama

For regulated industries (healthcare, legal, finance) where data cannot leave your network, run extraction locally with Ollama:

Install Ollama and pull a model:

# Install Ollama (macOS)
brew install ollama
 
# Pull Llama 3.1 8B (runs on most laptops)
ollama pull llama3.1:8b
 
# Or pull a larger model if you have GPU
ollama pull llama3.1:70b

Change one line in your extraction code:

# Instead of Together AI:
# model_id="litellm/together_ai/meta-llama/Llama-3.3-70B-Instruct-Turbo"
 
# Use local Ollama:
model_id="litellm/ollama/llama3.1:8b"

No API keys needed. No data leaves your machine. Same extraction code, different model ID.

Performance note

Local models are slower (10-30 seconds per extraction vs 1-3 seconds with cloud APIs) and smaller models may have lower accuracy. Test extraction quality on your specific documents before committing to local-only deployment. Many teams use a hybrid: local for sensitive documents, cloud for bulk processing.

Step 5: Define your extraction schema

LangExtract learns what to extract from examples you provide. No fine-tuning, no training data preparation - just show it what you want.

Create a new Python file called extract.py:

import langextract as lx
import textwrap
 
# 1. Define what you want to extract
prompt = textwrap.dedent("""\
    Extract key information from this text.
    Identify methods, metrics, comparisons, and applications.
    Use exact text from the source. Do not paraphrase.
    Extract in order of appearance.""")
 
# 2. Provide an example to guide the model
examples = [
    lx.data.ExampleData(
        text="We propose FastNet, achieving 95% accuracy on ImageNet, 2x faster than ResNet. Applications include medical imaging.",
        extractions=[
            lx.data.Extraction(
                extraction_class="method",
                extraction_text="FastNet",
                attributes={"type": "model"}
            ),
            lx.data.Extraction(
                extraction_class="metric",
                extraction_text="95% accuracy on ImageNet",
                attributes={"benchmark": "ImageNet", "value": "95%"}
            ),
            lx.data.Extraction(
                extraction_class="comparison",
                extraction_text="2x faster than ResNet",
                attributes={"baseline": "ResNet", "improvement": "2x"}
            ),
            lx.data.Extraction(
                extraction_class="application",
                extraction_text="medical imaging",
                attributes={"domain": "healthcare"}
            ),
        ]
    )
]

The example teaches LangExtract:

What classes to extract: method, metric, comparison, application
What attributes to capture: type, benchmark, value, baseline, domain
How to format extractions: exact text spans, structured attributes

Few-shot learning

This approach is called few-shot learning. Instead of training a model on thousands of examples, you provide a handful of high-quality examples that demonstrate exactly what you want. The LLM generalizes from these examples to handle new, unseen text.

Step 6: Run your first extraction

Add this code to your extract.py file:

# 3. Your input document
input_text = """
We introduce MAXS (Meta-Adaptive eXploration Strategy), a framework
that enables LLM agents to look ahead before committing to actions.
On five reasoning benchmarks, MAXS achieves 63.46% average accuracy
compared to 52.93% for Chain-of-Thought, while using 100x fewer
tokens than Monte Carlo Tree Search. The framework supports tool use
including code execution and web search.
"""
 
# 4. Run the extraction
result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="litellm/together_ai/meta-llama/Llama-3.3-70B-Instruct-Turbo",
)
 
# 5. Print results
print(f"Found {len(result.extractions)} extractions:\n")
for ext in result.extractions:
    print(f"[{ext.extraction_class}] {ext.extraction_text}")
    if ext.attributes:
        for key, value in ext.attributes.items():
            print(f"    {key}: {value}")
    print()

Run it:

python extract.py

Expected output:

Found 6 extractions:

[method] MAXS (Meta-Adaptive eXploration Strategy)
    type: framework

[metric] 63.46% average accuracy
    benchmark: five reasoning benchmarks
    value: 63.46%

[comparison] compared to 52.93% for Chain-of-Thought
    baseline: Chain-of-Thought
    improvement: +10.5 points

[comparison] using 100x fewer tokens than Monte Carlo Tree Search
    baseline: Monte Carlo Tree Search
    improvement: 100x fewer tokens

[application] code execution
    domain: tool use

[application] web search
    domain: tool use

Step 7: Visualize results

LangExtract can generate interactive HTML visualizations showing extractions highlighted in their original context:

# Save results to JSONL
lx.io.save_annotated_documents([result],
    output_name="extractions.jsonl",
    output_dir=".")
 
# Generate visualization
html_content = lx.visualize("extractions.jsonl")
with open("visualization.html", "w") as f:
    f.write(html_content if isinstance(html_content, str) else html_content.data)
 
print("Open visualization.html in your browser")

The visualization shows:

Original text with highlighted extractions
Color-coded by extraction class
Click any extraction to see its attributes
Navigate between multiple documents

Practical examples

Research paper mining

Extract methods, results, and comparisons from paper abstracts to build a structured research database:

prompt = """\
    Extract research findings from AI paper abstracts.
    - methods: The proposed technique or model name
    - metrics: Quantitative results with numbers and benchmarks
    - comparisons: How it compares to baselines
    - limitations: Any mentioned drawbacks or constraints"""
 
examples = [
    lx.data.ExampleData(
        text="Our model achieves 92% F1 on SQuAD, outperforming BERT by 3 points. Training requires 8 GPUs for 2 days.",
        extractions=[
            lx.data.Extraction(
                extraction_class="metric",
                extraction_text="92% F1 on SQuAD",
                attributes={"benchmark": "SQuAD", "metric_type": "F1", "value": "92%"}
            ),
            lx.data.Extraction(
                extraction_class="comparison",
                extraction_text="outperforming BERT by 3 points",
                attributes={"baseline": "BERT", "delta": "+3 points"}
            ),
            lx.data.Extraction(
                extraction_class="limitation",
                extraction_text="Training requires 8 GPUs for 2 days",
                attributes={"type": "compute_cost"}
            ),
        ]
    )
]

Contract clause extraction

Extract obligations, dates, and parties from legal contracts:

prompt = """\
    Extract key clauses from legal contracts.
    - party: Named entities (companies, individuals)
    - obligation: What someone must do
    - date: Deadlines and time periods
    - penalty: Consequences for non-compliance"""
 
examples = [
    lx.data.ExampleData(
        text="Acme Corp shall deliver the software by December 31, 2026. Late delivery incurs a 5% penalty per week.",
        extractions=[
            lx.data.Extraction(
                extraction_class="party",
                extraction_text="Acme Corp",
                attributes={"role": "obligor"}
            ),
            lx.data.Extraction(
                extraction_class="obligation",
                extraction_text="shall deliver the software",
                attributes={"type": "delivery"}
            ),
            lx.data.Extraction(
                extraction_class="date",
                extraction_text="December 31, 2026",
                attributes={"type": "deadline"}
            ),
            lx.data.Extraction(
                extraction_class="penalty",
                extraction_text="5% penalty per week",
                attributes={"trigger": "late delivery"}
            ),
        ]
    )
]

Medical note structuring

Extract medications, dosages, and symptoms from clinical notes:

prompt = """\
    Extract medical information from clinical notes.
    - medication: Drug names
    - dosage: Amounts and frequencies
    - symptom: Patient complaints or findings
    - diagnosis: Identified conditions"""
 
examples = [
    lx.data.ExampleData(
        text="Patient reports persistent headache for 3 days. Prescribed ibuprofen 400mg twice daily. Suspected tension headache.",
        extractions=[
            lx.data.Extraction(
                extraction_class="symptom",
                extraction_text="persistent headache for 3 days",
                attributes={"duration": "3 days"}
            ),
            lx.data.Extraction(
                extraction_class="medication",
                extraction_text="ibuprofen",
                attributes={"type": "NSAID"}
            ),
            lx.data.Extraction(
                extraction_class="dosage",
                extraction_text="400mg twice daily",
                attributes={"amount": "400mg", "frequency": "twice daily"}
            ),
            lx.data.Extraction(
                extraction_class="diagnosis",
                extraction_text="tension headache",
                attributes={"certainty": "suspected"}
            ),
        ]
    )
]

Understanding source grounding

Source grounding is LangExtract's key differentiator. Every extraction includes character offsets pointing to the original text:

for ext in result.extractions:
    print(f"Text: {ext.extraction_text}")
    print(f"Start: {ext.start_offset}, End: {ext.end_offset}")
 
    # Verify by slicing original
    original_slice = input_text[ext.start_offset:ext.end_offset]
    print(f"Verified: {original_slice == ext.extraction_text}")

Why this matters:

Use Case	Why Source Grounding Helps
Compliance audits	Prove where every data point originated
Legal discovery	Link extracted clauses to exact document locations
Medical records	Trace diagnoses back to clinical notes
Research databases	Cite exact sources for extracted findings
Quality assurance	Quickly verify LLM extractions are accurate

Use Case

Compliance audits

Why Source Grounding Helps

Prove where every data point originated

Use Case

Legal discovery

Why Source Grounding Helps

Link extracted clauses to exact document locations

Use Case

Medical records

Why Source Grounding Helps

Trace diagnoses back to clinical notes

Use Case

Research databases

Why Source Grounding Helps

Cite exact sources for extracted findings

Use Case

Quality assurance

Why Source Grounding Helps

Quickly verify LLM extractions are accurate

Without source grounding, you are trusting the LLM blindly. With it, you can verify every extraction in seconds.

Confidence and verification patterns

Business users always ask: "How do I know it's right?" Here are patterns to build trust in your extraction pipeline.

Add confidence scores to extractions

Request a confidence attribute in your schema:

examples = [
    lx.data.ExampleData(
        text="Revenue increased 23% to $4.2 billion in Q3 2026.",
        extractions=[
            lx.data.Extraction(
                extraction_class="metric",
                extraction_text="$4.2 billion",
                attributes={
                    "metric_type": "revenue",
                    "period": "Q3 2026",
                    "confidence": "high"  # Model self-reports confidence
                }
            ),
        ]
    )
]

The model will learn to add confidence ratings. Use these to flag extractions for review.

Human-in-the-loop flagging

Route low-confidence extractions to human reviewers:

def process_with_review(result, confidence_threshold=0.8):
    approved = []
    needs_review = []
 
    for ext in result.extractions:
        confidence = ext.attributes.get("confidence", "medium")
        confidence_score = {"high": 0.9, "medium": 0.7, "low": 0.5}.get(confidence, 0.5)
 
        if confidence_score >= confidence_threshold:
            approved.append(ext)
        else:
            needs_review.append({
                "extraction": ext,
                "source_text": input_text[ext.start_offset:ext.end_offset],
                "reason": f"Confidence: {confidence}"
            })
 
    return approved, needs_review
 
approved, flagged = process_with_review(result)
print(f"Auto-approved: {len(approved)}, Needs review: {len(flagged)}")

Verification checklist

For critical applications, implement these checks:

Check	Implementation	When to Use
Source match	input_text[start:end] == extraction_text	Always
Attribute completeness	all(k in ext.attributes for k in required_keys)	Structured data
Value validation	Regex or Pydantic for dates, numbers, emails	Financial/medical
Duplicate detection	Compare against previous extractions	Incremental processing

Check

Source match

Implementation

input_text[start:end] == extraction_text

When to Use

Always

Check

Attribute completeness

Implementation

all(k in ext.attributes for k in required_keys)

When to Use

Structured data

Check

Value validation

Implementation

Regex or Pydantic for dates, numbers, emails

When to Use

Financial/medical

Check

Duplicate detection

Implementation

Compare against previous extractions

When to Use

Incremental processing

When to use LangExtract

The sweet spot

LangExtract shines when you need structured data + audit trail from documents. The source grounding is the killer feature - without it, you're just using an LLM with extra steps.

Ideal use cases:

Use Case	Why LangExtract Fits	Example
Compliance pipelines	Auditors ask 'where did this come from?'	Extracting financial figures from SEC filings
Legal document review	Every clause must trace to source	Contract analysis for M&A due diligence
Medical data extraction	Clinical decisions need verification	Structuring EHR notes for research
Building searchable databases	Need structured fields from many docs	Research paper database with queryable metrics
Bulk document processing	Same extraction across 100s of files	Processing insurance claims or invoices

Use Case

Compliance pipelines

Why LangExtract Fits

Auditors ask 'where did this come from?'

Example

Extracting financial figures from SEC filings

Use Case

Legal document review

Why LangExtract Fits

Every clause must trace to source

Example

Contract analysis for M&A due diligence

Use Case

Medical data extraction

Why LangExtract Fits

Clinical decisions need verification

Example

Structuring EHR notes for research

Use Case

Building searchable databases

Why LangExtract Fits

Need structured fields from many docs

Example

Research paper database with queryable metrics

Use Case

Bulk document processing

Why LangExtract Fits

Same extraction across 100s of files

Example

Processing insurance claims or invoices

When it's overkill

Do not use LangExtract just because it exists. For many tasks, a simple LLM prompt is faster and cheaper.

Skip LangExtract when:

Situation	Why It's Overkill	Better Alternative
Writing articles/summaries	You need synthesis, not extraction	Direct LLM conversation
One-off analysis	Setup time exceeds task time	ChatGPT or Claude chat
Highly structured data	JSON, XML, CSV already have structure	Traditional parsers
Real-time applications	LLM latency is 1-5 seconds per call	Pre-trained NER models
You don't need source proof	Source grounding is the main value	Plain LLM extraction

Situation

Writing articles/summaries

Why It's Overkill

You need synthesis, not extraction

Better Alternative

Direct LLM conversation

Situation

One-off analysis

Why It's Overkill

Setup time exceeds task time

Better Alternative

ChatGPT or Claude chat

Situation

Highly structured data

Why It's Overkill

JSON, XML, CSV already have structure

Better Alternative

Traditional parsers

Situation

Real-time applications

Why It's Overkill

LLM latency is 1-5 seconds per call

Better Alternative

Pre-trained NER models

Situation

You don't need source proof

Why It's Overkill

Source grounding is the main value

Better Alternative

Plain LLM extraction

Extraction vs RAG: Know the difference

AI enthusiasts often confuse these. They solve different problems:

Aspect	LangExtract (Extraction)	RAG (Retrieval-Augmented Generation)
Goal	Structure documents into databases	Answer questions using documents
Output	Structured records with source offsets	Natural language responses
Query type	Extract all metrics from this paper	What was the accuracy on SQuAD?
Data flow	Documents → Structured data → Database	Question → Retrieve chunks → Generate answer
Best for	Building searchable datasets	Conversational Q&A over docs

Aspect

Goal

LangExtract (Extraction)

Structure documents into databases