How to Extract Structured Data from Documents with LangExtract and Together AI

Learn how to use Google's LangExtract library to extract structured information from unstructured text. Includes a critical bug fix for Together AI integration and practical examples for research papers, contracts, and medical notes.

TL;DR
  1. The Problem. You have unstructured documents (research papers, contracts, medical notes) and need structured data. Plain LLMs hallucinate and provide no source verification.

  2. The Solution. Google's LangExtract library extracts structured information while mapping every extraction to its exact location in the source text. No fine-tuning required - just provide a few examples.

  3. The Setup. About 15 minutes: install LangExtract, configure your LLM provider, define your extraction schema with a few examples, and run your first extraction.

Try it now

Want to skip local setup? LangExtract works directly in Google Colab with Gemini (no bug fix needed). Create a new Colab notebook and run:

!pip install langextract
import langextract as lx
# Uses Gemini by default - just set GOOGLE_API_KEY in Colab secrets

For Together AI or other providers, follow the full guide below.

Why LangExtract

Every data pipeline has the same problem: valuable information is trapped in unstructured text. Research findings buried in papers. Contract terms scattered across paragraphs. Patient information in clinical notes.

Traditional approaches have tradeoffs:

ApproachProsCons
Regex patternsFast, deterministicBreaks on natural language variation
NER modelsAccurate for trained entitiesFixed entity types, needs training data
Plain LLM promptsFlexible, understands contextNo source verification, hallucination risk
Approach
Regex patterns
Pros
Fast, deterministic
Cons
Breaks on natural language variation
Approach
NER models
Pros
Accurate for trained entities
Cons
Fixed entity types, needs training data
Approach
Plain LLM prompts
Pros
Flexible, understands context
Cons
No source verification, hallucination risk

Traditional vs LangExtract Pipeline

Source grounding is the key difference

LangExtract fills the gap. It uses LLMs for flexible extraction but adds source grounding - every extracted piece of information is mapped to its exact location in the original text. When someone asks "where did this number come from?", you can point to the exact sentence.

Key features:

  • Source grounding: Click on any extraction to see the original text
  • Schema enforcement: Define your output structure with few-shot examples
  • Long document support: Automatic chunking and parallel processing
  • Provider flexibility: Works with Gemini, OpenAI, Together AI, local models via Ollama
  • Interactive visualization: Generate HTML reports to review extractions

This guide walks through setting up LangExtract with Together AI's Llama models, including a bug fix we discovered during testing.

The business case

Executives need numbers. Here's how LangExtract compares on cost:

ApproachCost per 100 DocsAccuracyRework RateTrue Cost
Manual data entry$500-1000 (labor)95-99%2-5%$500-1050
Plain LLM extraction$2-5 (tokens)80-90%15-25%$30-80 (rework)
LangExtract + verification$3-8 (tokens)90-95%5-10%$8-20 (rework)
Approach
Manual data entry
Cost per 100 Docs
$500-1000 (labor)
Accuracy
95-99%
Rework Rate
2-5%
True Cost
$500-1050
Approach
Plain LLM extraction
Cost per 100 Docs
$2-5 (tokens)
Accuracy
80-90%
Rework Rate
15-25%
True Cost
$30-80 (rework)
Approach
LangExtract + verification
Cost per 100 Docs
$3-8 (tokens)
Accuracy
90-95%
Rework Rate
5-10%
True Cost
$8-20 (rework)

The math: Plain LLM extraction is cheap until you factor in hallucination rework. A 20% error rate on 100 legal documents means 20 manual reviews at $15-30 each. LangExtract's source grounding cuts verification time from minutes to seconds - you click the extraction and see the original text immediately.

Break-even point: LangExtract pays for itself after ~50 documents where accuracy matters. For one-off tasks, plain LLM prompts are faster. For production pipelines, the audit trail alone justifies the setup time.

What you will need

Before starting, make sure you have:

  • Python 3.10 or higher installed
  • A Together AI account with API key (or Google/OpenAI API key)
  • About 15 minutes for setup
Why Together AI?

Together AI provides access to open-source models like Llama 3.3 70B at competitive prices. You get strong extraction quality without vendor lock-in. The same code works with Gemini, OpenAI, or local models by changing one line.

Step 1: Install LangExtract

Create a project directory and set up a virtual environment. This keeps LangExtract's dependencies isolated from your other Python projects.

Create your project folder

mkdir langextract-project
cd langextract-project

Create a virtual environment

python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install LangExtract

pip install langextract

This installs the core library with support for Google's Gemini models by default.

Step 2: Install the LiteLLM provider

To use Together AI (or OpenAI, Anthropic, and 100+ other providers), install the community LiteLLM provider:

pip install langextract-litellm
What is LiteLLM?

LiteLLM is a unified interface for calling different LLM providers. Instead of learning each provider's API, you use one consistent format. The langextract-litellm package bridges LangExtract to this ecosystem, giving you access to Together AI, OpenAI, Anthropic, Azure, and many more.

Step 3: Fix the Together AI bug

Here is something the documentation does not tell you: the LiteLLM provider has a bug when used with Together AI. It passes internal LangExtract parameters that Together AI's API cannot serialize.

The error you will see:

InternalServerError: Together_aiException - Object of type FormatType is not JSON serializable

The fix:

Open the provider file and filter out non-standard parameters. Find the file at:

.venv/lib/python3.12/site-packages/langextract_litellm/provider.py

Locate this code block (around line 89):

response = litellm.completion(
    model=self.model_id,
    messages=messages,
    **self.provider_kwargs,
)

Replace it with:

# Filter out non-serializable kwargs that LiteLLM doesn't understand
filtered_kwargs = {
    k: v for k, v in self.provider_kwargs.items()
    if k in ('temperature', 'max_tokens', 'top_p', 'frequency_penalty',
             'presence_penalty', 'timeout', 'api_base', 'api_key',
             'stop', 'n', 'stream', 'response_format')
}
 
response = litellm.completion(
    model=self.model_id,
    messages=messages,
    **filtered_kwargs,
)

This filters the kwargs to only include parameters that LiteLLM and Together AI understand, preventing the serialization error.

Why does this bug exist?

LangExtract passes internal configuration objects (like FormatType enums) through to the model provider. The Gemini provider handles these correctly, but the LiteLLM provider passes everything through to the underlying API. Together AI's API tries to JSON-serialize all parameters for the request body and fails on Python enum objects.

Step 4: Set up your API key

Get your Together AI API key from api.together.xyz and set it as an environment variable:

export TOGETHER_AI_API_KEY="your-api-key-here"

On Windows PowerShell:

$env:TOGETHER_AI_API_KEY = "your-api-key-here"

To make this permanent, add the export line to your ~/.bashrc or ~/.zshrc file.

Alternative: Local execution with Ollama

For regulated industries (healthcare, legal, finance) where data cannot leave your network, run extraction locally with Ollama:

Install Ollama and pull a model:

# Install Ollama (macOS)
brew install ollama
 
# Pull Llama 3.1 8B (runs on most laptops)
ollama pull llama3.1:8b
 
# Or pull a larger model if you have GPU
ollama pull llama3.1:70b

Change one line in your extraction code:

# Instead of Together AI:
# model_id="litellm/together_ai/meta-llama/Llama-3.3-70B-Instruct-Turbo"
 
# Use local Ollama:
model_id="litellm/ollama/llama3.1:8b"

No API keys needed. No data leaves your machine. Same extraction code, different model ID.

Performance note

Local models are slower (10-30 seconds per extraction vs 1-3 seconds with cloud APIs) and smaller models may have lower accuracy. Test extraction quality on your specific documents before committing to local-only deployment. Many teams use a hybrid: local for sensitive documents, cloud for bulk processing.

Step 5: Define your extraction schema

LangExtract learns what to extract from examples you provide. No fine-tuning, no training data preparation - just show it what you want.

Create a new Python file called extract.py:

import langextract as lx
import textwrap
 
# 1. Define what you want to extract
prompt = textwrap.dedent("""\
    Extract key information from this text.
    Identify methods, metrics, comparisons, and applications.
    Use exact text from the source. Do not paraphrase.
    Extract in order of appearance.""")
 
# 2. Provide an example to guide the model
examples = [
    lx.data.ExampleData(
        text="We propose FastNet, achieving 95% accuracy on ImageNet, 2x faster than ResNet. Applications include medical imaging.",
        extractions=[
            lx.data.Extraction(
                extraction_class="method",
                extraction_text="FastNet",
                attributes={"type": "model"}
            ),
            lx.data.Extraction(
                extraction_class="metric",
                extraction_text="95% accuracy on ImageNet",
                attributes={"benchmark": "ImageNet", "value": "95%"}
            ),
            lx.data.Extraction(
                extraction_class="comparison",
                extraction_text="2x faster than ResNet",
                attributes={"baseline": "ResNet", "improvement": "2x"}
            ),
            lx.data.Extraction(
                extraction_class="application",
                extraction_text="medical imaging",
                attributes={"domain": "healthcare"}
            ),
        ]
    )
]

The example teaches LangExtract:

  • What classes to extract: method, metric, comparison, application
  • What attributes to capture: type, benchmark, value, baseline, domain
  • How to format extractions: exact text spans, structured attributes
Few-shot learning

This approach is called few-shot learning. Instead of training a model on thousands of examples, you provide a handful of high-quality examples that demonstrate exactly what you want. The LLM generalizes from these examples to handle new, unseen text.

Step 6: Run your first extraction

Add this code to your extract.py file:

# 3. Your input document
input_text = """
We introduce MAXS (Meta-Adaptive eXploration Strategy), a framework
that enables LLM agents to look ahead before committing to actions.
On five reasoning benchmarks, MAXS achieves 63.46% average accuracy
compared to 52.93% for Chain-of-Thought, while using 100x fewer
tokens than Monte Carlo Tree Search. The framework supports tool use
including code execution and web search.
"""
 
# 4. Run the extraction
result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="litellm/together_ai/meta-llama/Llama-3.3-70B-Instruct-Turbo",
)
 
# 5. Print results
print(f"Found {len(result.extractions)} extractions:\n")
for ext in result.extractions:
    print(f"[{ext.extraction_class}] {ext.extraction_text}")
    if ext.attributes:
        for key, value in ext.attributes.items():
            print(f"    {key}: {value}")
    print()

Run it:

python extract.py

Expected output:

Found 6 extractions:

[method] MAXS (Meta-Adaptive eXploration Strategy)
    type: framework

[metric] 63.46% average accuracy
    benchmark: five reasoning benchmarks
    value: 63.46%

[comparison] compared to 52.93% for Chain-of-Thought
    baseline: Chain-of-Thought
    improvement: +10.5 points

[comparison] using 100x fewer tokens than Monte Carlo Tree Search
    baseline: Monte Carlo Tree Search
    improvement: 100x fewer tokens

[application] code execution
    domain: tool use

[application] web search
    domain: tool use

Step 7: Visualize results

LangExtract can generate interactive HTML visualizations showing extractions highlighted in their original context:

# Save results to JSONL
lx.io.save_annotated_documents([result],
    output_name="extractions.jsonl",
    output_dir=".")
 
# Generate visualization
html_content = lx.visualize("extractions.jsonl")
with open("visualization.html", "w") as f:
    f.write(html_content if isinstance(html_content, str) else html_content.data)
 
print("Open visualization.html in your browser")

The visualization shows:

  • Original text with highlighted extractions
  • Color-coded by extraction class
  • Click any extraction to see its attributes
  • Navigate between multiple documents

Practical examples

Research paper mining

Extract methods, results, and comparisons from paper abstracts to build a structured research database:

prompt = """\
    Extract research findings from AI paper abstracts.
    - methods: The proposed technique or model name
    - metrics: Quantitative results with numbers and benchmarks
    - comparisons: How it compares to baselines
    - limitations: Any mentioned drawbacks or constraints"""
 
examples = [
    lx.data.ExampleData(
        text="Our model achieves 92% F1 on SQuAD, outperforming BERT by 3 points. Training requires 8 GPUs for 2 days.",
        extractions=[
            lx.data.Extraction(
                extraction_class="metric",
                extraction_text="92% F1 on SQuAD",
                attributes={"benchmark": "SQuAD", "metric_type": "F1", "value": "92%"}
            ),
            lx.data.Extraction(
                extraction_class="comparison",
                extraction_text="outperforming BERT by 3 points",
                attributes={"baseline": "BERT", "delta": "+3 points"}
            ),
            lx.data.Extraction(
                extraction_class="limitation",
                extraction_text="Training requires 8 GPUs for 2 days",
                attributes={"type": "compute_cost"}
            ),
        ]
    )
]

Contract clause extraction

Extract obligations, dates, and parties from legal contracts:

prompt = """\
    Extract key clauses from legal contracts.
    - party: Named entities (companies, individuals)
    - obligation: What someone must do
    - date: Deadlines and time periods
    - penalty: Consequences for non-compliance"""
 
examples = [
    lx.data.ExampleData(
        text="Acme Corp shall deliver the software by December 31, 2026. Late delivery incurs a 5% penalty per week.",
        extractions=[
            lx.data.Extraction(
                extraction_class="party",
                extraction_text="Acme Corp",
                attributes={"role": "obligor"}
            ),
            lx.data.Extraction(
                extraction_class="obligation",
                extraction_text="shall deliver the software",
                attributes={"type": "delivery"}
            ),
            lx.data.Extraction(
                extraction_class="date",
                extraction_text="December 31, 2026",
                attributes={"type": "deadline"}
            ),
            lx.data.Extraction(
                extraction_class="penalty",
                extraction_text="5% penalty per week",
                attributes={"trigger": "late delivery"}
            ),
        ]
    )
]

Medical note structuring

Extract medications, dosages, and symptoms from clinical notes:

prompt = """\
    Extract medical information from clinical notes.
    - medication: Drug names
    - dosage: Amounts and frequencies
    - symptom: Patient complaints or findings
    - diagnosis: Identified conditions"""
 
examples = [
    lx.data.ExampleData(
        text="Patient reports persistent headache for 3 days. Prescribed ibuprofen 400mg twice daily. Suspected tension headache.",
        extractions=[
            lx.data.Extraction(
                extraction_class="symptom",
                extraction_text="persistent headache for 3 days",
                attributes={"duration": "3 days"}
            ),
            lx.data.Extraction(
                extraction_class="medication",
                extraction_text="ibuprofen",
                attributes={"type": "NSAID"}
            ),
            lx.data.Extraction(
                extraction_class="dosage",
                extraction_text="400mg twice daily",
                attributes={"amount": "400mg", "frequency": "twice daily"}
            ),
            lx.data.Extraction(
                extraction_class="diagnosis",
                extraction_text="tension headache",
                attributes={"certainty": "suspected"}
            ),
        ]
    )
]

Understanding source grounding

Source grounding is LangExtract's key differentiator. Every extraction includes character offsets pointing to the original text:

for ext in result.extractions:
    print(f"Text: {ext.extraction_text}")
    print(f"Start: {ext.start_offset}, End: {ext.end_offset}")
 
    # Verify by slicing original
    original_slice = input_text[ext.start_offset:ext.end_offset]
    print(f"Verified: {original_slice == ext.extraction_text}")

Why this matters:

Use CaseWhy Source Grounding Helps
Compliance auditsProve where every data point originated
Legal discoveryLink extracted clauses to exact document locations
Medical recordsTrace diagnoses back to clinical notes
Research databasesCite exact sources for extracted findings
Quality assuranceQuickly verify LLM extractions are accurate
Use Case
Compliance audits
Why Source Grounding Helps
Prove where every data point originated
Use Case
Legal discovery
Why Source Grounding Helps
Link extracted clauses to exact document locations
Use Case
Medical records
Why Source Grounding Helps
Trace diagnoses back to clinical notes
Use Case
Research databases
Why Source Grounding Helps
Cite exact sources for extracted findings
Use Case
Quality assurance
Why Source Grounding Helps
Quickly verify LLM extractions are accurate

Without source grounding, you are trusting the LLM blindly. With it, you can verify every extraction in seconds.

Confidence and verification patterns

Business users always ask: "How do I know it's right?" Here are patterns to build trust in your extraction pipeline.

Add confidence scores to extractions

Request a confidence attribute in your schema:

examples = [
    lx.data.ExampleData(
        text="Revenue increased 23% to $4.2 billion in Q3 2026.",
        extractions=[
            lx.data.Extraction(
                extraction_class="metric",
                extraction_text="$4.2 billion",
                attributes={
                    "metric_type": "revenue",
                    "period": "Q3 2026",
                    "confidence": "high"  # Model self-reports confidence
                }
            ),
        ]
    )
]

The model will learn to add confidence ratings. Use these to flag extractions for review.

Human-in-the-loop flagging

Route low-confidence extractions to human reviewers:

def process_with_review(result, confidence_threshold=0.8):
    approved = []
    needs_review = []
 
    for ext in result.extractions:
        confidence = ext.attributes.get("confidence", "medium")
        confidence_score = {"high": 0.9, "medium": 0.7, "low": 0.5}.get(confidence, 0.5)
 
        if confidence_score >= confidence_threshold:
            approved.append(ext)
        else:
            needs_review.append({
                "extraction": ext,
                "source_text": input_text[ext.start_offset:ext.end_offset],
                "reason": f"Confidence: {confidence}"
            })
 
    return approved, needs_review
 
approved, flagged = process_with_review(result)
print(f"Auto-approved: {len(approved)}, Needs review: {len(flagged)}")

Verification checklist

For critical applications, implement these checks:

CheckImplementationWhen to Use
Source matchinput_text[start:end] == extraction_textAlways
Attribute completenessall(k in ext.attributes for k in required_keys)Structured data
Value validationRegex or Pydantic for dates, numbers, emailsFinancial/medical
Duplicate detectionCompare against previous extractionsIncremental processing
Check
Source match
Implementation
input_text[start:end] == extraction_text
When to Use
Always
Check
Attribute completeness
Implementation
all(k in ext.attributes for k in required_keys)
When to Use
Structured data
Check
Value validation
Implementation
Regex or Pydantic for dates, numbers, emails
When to Use
Financial/medical
Check
Duplicate detection
Implementation
Compare against previous extractions
When to Use
Incremental processing

When to use LangExtract

The sweet spot

LangExtract shines when you need structured data + audit trail from documents. The source grounding is the killer feature - without it, you're just using an LLM with extra steps.

Ideal use cases:

Use CaseWhy LangExtract FitsExample
Compliance pipelinesAuditors ask 'where did this come from?'Extracting financial figures from SEC filings
Legal document reviewEvery clause must trace to sourceContract analysis for M&A due diligence
Medical data extractionClinical decisions need verificationStructuring EHR notes for research
Building searchable databasesNeed structured fields from many docsResearch paper database with queryable metrics
Bulk document processingSame extraction across 100s of filesProcessing insurance claims or invoices
Use Case
Compliance pipelines
Why LangExtract Fits
Auditors ask 'where did this come from?'
Example
Extracting financial figures from SEC filings
Use Case
Legal document review
Why LangExtract Fits
Every clause must trace to source
Example
Contract analysis for M&A due diligence
Use Case
Medical data extraction
Why LangExtract Fits
Clinical decisions need verification
Example
Structuring EHR notes for research
Use Case
Building searchable databases
Why LangExtract Fits
Need structured fields from many docs
Example
Research paper database with queryable metrics
Use Case
Bulk document processing
Why LangExtract Fits
Same extraction across 100s of files
Example
Processing insurance claims or invoices

When it's overkill

Do not use LangExtract just because it exists. For many tasks, a simple LLM prompt is faster and cheaper.

Skip LangExtract when:

SituationWhy It's OverkillBetter Alternative
Writing articles/summariesYou need synthesis, not extractionDirect LLM conversation
One-off analysisSetup time exceeds task timeChatGPT or Claude chat
Highly structured dataJSON, XML, CSV already have structureTraditional parsers
Real-time applicationsLLM latency is 1-5 seconds per callPre-trained NER models
You don't need source proofSource grounding is the main valuePlain LLM extraction
Situation
Writing articles/summaries
Why It's Overkill
You need synthesis, not extraction
Better Alternative
Direct LLM conversation
Situation
One-off analysis
Why It's Overkill
Setup time exceeds task time
Better Alternative
ChatGPT or Claude chat
Situation
Highly structured data
Why It's Overkill
JSON, XML, CSV already have structure
Better Alternative
Traditional parsers
Situation
Real-time applications
Why It's Overkill
LLM latency is 1-5 seconds per call
Better Alternative
Pre-trained NER models
Situation
You don't need source proof
Why It's Overkill
Source grounding is the main value
Better Alternative
Plain LLM extraction

Extraction vs RAG: Know the difference

AI enthusiasts often confuse these. They solve different problems:

AspectLangExtract (Extraction)RAG (Retrieval-Augmented Generation)
GoalStructure documents into databasesAnswer questions using documents
OutputStructured records with source offsetsNatural language responses
Query typeExtract all metrics from this paperWhat was the accuracy on SQuAD?
Data flowDocuments → Structured data → DatabaseQuestion → Retrieve chunks → Generate answer
Best forBuilding searchable datasetsConversational Q&A over docs
Aspect
Goal
LangExtract (Extraction)
Structure documents into databases
RAG (Retrieval-Augmented Generation)
Answer questions using documents
Aspect
Output
LangExtract (Extraction)
Structured records with source offsets
RAG (Retrieval-Augmented Generation)
Natural language responses
Aspect
Query type
LangExtract (Extraction)
Extract all metrics from this paper
RAG (Retrieval-Augmented Generation)
What was the accuracy on SQuAD?
Aspect
Data flow
LangExtract (Extraction)
Documents → Structured data → Database
RAG (Retrieval-Augmented Generation)
Question → Retrieve chunks → Generate answer
Aspect
Best for
LangExtract (Extraction)
Building searchable datasets
RAG (Retrieval-Augmented Generation)
Conversational Q&A over docs

When to use extraction:

  • "I need to populate a database with contract terms"
  • "Build a dashboard from 500 research papers"
  • "Create a structured feed from news articles"

When to use RAG:

  • "Let users ask questions about our documentation"
  • "Build a chatbot that answers from internal knowledge"
  • "Help analysts explore reports conversationally"

Combining both: Extract structured data with LangExtract, store in a database, then use RAG to answer questions over that structured data. This gives you the best of both: queryable fields AND conversational access.

The decision framework

Ask yourself these questions:

QuestionYesNo
Will someone ask 'where did this data come from?'LangExtract (source grounding matters)Plain LLM is simpler
Are you processing many similar documents?LangExtract (define schema once, reuse)One-off LLM prompt is faster
Do you need a structured database?LangExtract (enforces consistent schema)Unstructured summaries work fine
Is accuracy critical enough to verify every extraction?LangExtract (click to verify each one)Trust the LLM, spot-check occasionally
Question
Will someone ask 'where did this data come from?'
Yes
LangExtract (source grounding matters)
No
Plain LLM is simpler
Question
Are you processing many similar documents?
Yes
LangExtract (define schema once, reuse)
No
One-off LLM prompt is faster
Question
Do you need a structured database?
Yes
LangExtract (enforces consistent schema)
No
Unstructured summaries work fine
Question
Is accuracy critical enough to verify every extraction?
Yes
LangExtract (click to verify each one)
No
Trust the LLM, spot-check occasionally

Real example: Research paper analysis

Scenario A: Writing an article about one paper

  • You read the paper, understand context, write narrative
  • Need: synthesis, editorial judgment, visualization ideas
  • Verdict: Skip LangExtract - use direct LLM conversation

Scenario B: Building a paper comparison database

  • Extract (method, benchmark, accuracy, model_size) from 200 papers
  • Query: "papers with >80% on MathVista using fewer than 10B parameters"
  • Verdict: Use LangExtract - structured extraction at scale

Scenario C: Daily paper discovery pipeline

  • Auto-extract key metrics from new arXiv papers
  • Filter by thresholds, surface interesting ones
  • Verdict: Use LangExtract - consistent schema across documents

Troubleshooting

ProblemLikely CauseSolution
FormatType serialization errorLiteLLM provider bugApply the fix from Step 3
Empty extractionsExamples do not match input formatImprove example quality and coverage
Wrong extraction classesAmbiguous promptBe more specific in prompt description
Missed extractionsText is too longIncrease extraction_passes parameter
API rate limitsToo many parallel requestsReduce max_workers parameter
Problem
FormatType serialization error
Likely Cause
LiteLLM provider bug
Solution
Apply the fix from Step 3
Problem
Empty extractions
Likely Cause
Examples do not match input format
Solution
Improve example quality and coverage
Problem
Wrong extraction classes
Likely Cause
Ambiguous prompt
Solution
Be more specific in prompt description
Problem
Missed extractions
Likely Cause
Text is too long
Solution
Increase extraction_passes parameter
Problem
API rate limits
Likely Cause
Too many parallel requests
Solution
Reduce max_workers parameter

Checking extraction quality

If extractions seem wrong, check the prompt alignment:

# Enable warnings for misaligned examples
import warnings
warnings.filterwarnings("default")
 
result = lx.extract(...)  # Warnings will show if examples have issues

Common alignment issues:

  • Extraction text not verbatim from example text
  • Extractions not in order of appearance
  • Overlapping extraction spans

Production-ready error handling

API calls fail. Add retry logic with exponential backoff:

import time
from functools import wraps
 
def retry_with_backoff(max_retries=3, base_delay=1):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    if attempt == max_retries - 1:
                        raise
                    delay = base_delay * (2 ** attempt)
                    print(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay}s...")
                    time.sleep(delay)
            return None
        return wrapper
    return decorator
 
@retry_with_backoff(max_retries=3)
def extract_with_retry(text, prompt, examples, model_id):
    return lx.extract(
        text_or_documents=text,
        prompt_description=prompt,
        examples=examples,
        model_id=model_id,
    )

Or use the tenacity library for more control:

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
 
@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=30),
    retry=retry_if_exception_type((TimeoutError, ConnectionError))
)
def robust_extract(text, prompt, examples, model_id):
    return lx.extract(text_or_documents=text, prompt_description=prompt,
                      examples=examples, model_id=model_id)

Tips for production use

Batch your documents

For large document sets, process in parallel:

result = lx.extract(
    text_or_documents=documents,  # List of texts or URLs
    prompt_description=prompt,
    examples=examples,
    model_id="litellm/together_ai/meta-llama/Llama-3.3-70B-Instruct-Turbo",
    max_workers=10,  # Parallel processing
    extraction_passes=2,  # Multiple passes for better recall
)

Monitor costs

Track token usage to avoid surprise bills:

# Estimate tokens before running
from langextract.tokenizer import count_tokens
token_count = count_tokens(your_text)
estimated_cost = token_count * 0.0001  # Adjust for your model pricing

Cache results

Save extractions to avoid re-processing:

import json
import hashlib
 
def get_cached_or_extract(text, cache_dir="./cache"):
    text_hash = hashlib.md5(text.encode()).hexdigest()
    cache_file = f"{cache_dir}/{text_hash}.json"
 
    if os.path.exists(cache_file):
        with open(cache_file) as f:
            return json.load(f)
 
    result = lx.extract(text_or_documents=text, ...)
 
    with open(cache_file, "w") as f:
        json.dump(result.to_dict(), f)
 
    return result

Validate with Pydantic

Wrap extractions in Pydantic models to catch type errors before they hit your database:

from pydantic import BaseModel, field_validator
from datetime import datetime
from typing import Optional
 
class MetricExtraction(BaseModel):
    text: str
    benchmark: str
    value: float
    period: Optional[str] = None
 
    @field_validator('value', mode='before')
    @classmethod
    def parse_percentage(cls, v):
        if isinstance(v, str):
            return float(v.strip('%')) / 100
        return v
 
class ContractExtraction(BaseModel):
    party: str
    obligation: str
    deadline: datetime
 
    @field_validator('deadline', mode='before')
    @classmethod
    def parse_date(cls, v):
        if isinstance(v, str):
            # Handle common date formats
            for fmt in ['%Y-%m-%d', '%B %d, %Y', '%m/%d/%Y']:
                try:
                    return datetime.strptime(v, fmt)
                except ValueError:
                    continue
        return v
 
# Validate extractions
def validate_extraction(ext, model_class):
    try:
        return model_class(**ext.attributes, text=ext.extraction_text)
    except Exception as e:
        print(f"Validation failed: {e}")
        return None
 
validated = [validate_extraction(e, MetricExtraction)
             for e in result.extractions
             if e.extraction_class == "metric"]

Version your examples

As requirements change, track schema versions for backward compatibility:

EXTRACTION_SCHEMA_VERSION = "1.2"
 
# Schema changelog:
# v1.2: Added "limitation" class, improved metric attributes
# v1.1: Added confidence scores to all extractions
# v1.0: Initial schema with method, metric, comparison
 
examples = [
    lx.data.ExampleData(...)
]
 
# Store version with extractions for migration support
def save_extractions(result, version=EXTRACTION_SCHEMA_VERSION):
    return {
        "schema_version": version,
        "extracted_at": datetime.now().isoformat(),
        "extractions": [e.to_dict() for e in result.extractions]
    }

When business needs shift (e.g., capturing a new "clause type"), bump the version and add migration logic for historical data.

Downstream integration

Extraction isn't the end goal - using the data is. Here's how to get extractions into your data stack.

Load into Pandas for analysis:

import pandas as pd
 
def extractions_to_dataframe(result):
    rows = []
    for ext in result.extractions:
        row = {
            "class": ext.extraction_class,
            "text": ext.extraction_text,
            "start": ext.start_offset,
            "end": ext.end_offset,
            **ext.attributes  # Flatten attributes into columns
        }
        rows.append(row)
    return pd.DataFrame(rows)
 
df = extractions_to_dataframe(result)
print(df[df["class"] == "metric"][["text", "benchmark", "value"]])

Insert into SQLite/PostgreSQL:

import sqlite3
 
def create_extractions_table(conn):
    conn.execute("""
        CREATE TABLE IF NOT EXISTS extractions (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            document_id TEXT,
            extraction_class TEXT,
            extraction_text TEXT,
            start_offset INTEGER,
            end_offset INTEGER,
            attributes JSON,
            created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)
 
def insert_extractions(conn, doc_id, result):
    for ext in result.extractions:
        conn.execute("""
            INSERT INTO extractions (document_id, extraction_class, extraction_text,
                                     start_offset, end_offset, attributes)
            VALUES (?, ?, ?, ?, ?, ?)
        """, (doc_id, ext.extraction_class, ext.extraction_text,
              ext.start_offset, ext.end_offset, json.dumps(ext.attributes)))
    conn.commit()
 
# Usage
conn = sqlite3.connect("extractions.db")
create_extractions_table(conn)
insert_extractions(conn, "paper_001", result)
 
# Query: Find all metrics above 90%
df = pd.read_sql("""
    SELECT * FROM extractions
    WHERE extraction_class = 'metric'
    AND json_extract(attributes, '$.value') > 90
""", conn)

Export for visualization tools:

# Export to JSON for dashboards
with open("extractions.json", "w") as f:
    json.dump([{
        "class": e.extraction_class,
        "text": e.extraction_text,
        **e.attributes
    } for e in result.extractions], f)
 
# Export to CSV for spreadsheets
df.to_csv("extractions.csv", index=False)

Next steps

Now that LangExtract is set up, start with a small document set to validate extraction quality before scaling up.

The combination of LLM flexibility and source grounding solves a real problem in data pipelines: getting structured data from messy documents while maintaining an audit trail. Unlike regex (brittle) or plain LLMs (no verification), LangExtract gives you the best of both worlds.

For teams processing documents at scale, this can replace hours of manual extraction work. The few-shot approach means you can adapt to new document types in minutes by adding examples, not retraining models.

Resources: