-
The Problem. You have unstructured documents (research papers, contracts, medical notes) and need structured data. Plain LLMs hallucinate and provide no source verification.
-
The Solution. Google's LangExtract library extracts structured information while mapping every extraction to its exact location in the source text. No fine-tuning required - just provide a few examples.
-
The Setup. About 15 minutes: install LangExtract, configure your LLM provider, define your extraction schema with a few examples, and run your first extraction.
Want to skip local setup? LangExtract works directly in Google Colab with Gemini (no bug fix needed). Create a new Colab notebook and run:
!pip install langextract
import langextract as lx
# Uses Gemini by default - just set GOOGLE_API_KEY in Colab secretsFor Together AI or other providers, follow the full guide below.
Why LangExtract
Every data pipeline has the same problem: valuable information is trapped in unstructured text. Research findings buried in papers. Contract terms scattered across paragraphs. Patient information in clinical notes.
Traditional approaches have tradeoffs:
| Approach | Pros | Cons |
|---|---|---|
| Regex patterns | Fast, deterministic | Breaks on natural language variation |
| NER models | Accurate for trained entities | Fixed entity types, needs training data |
| Plain LLM prompts | Flexible, understands context | No source verification, hallucination risk |
Traditional vs LangExtract Pipeline
Source grounding is the key difference
LangExtract fills the gap. It uses LLMs for flexible extraction but adds source grounding - every extracted piece of information is mapped to its exact location in the original text. When someone asks "where did this number come from?", you can point to the exact sentence.
Key features:
- Source grounding: Click on any extraction to see the original text
- Schema enforcement: Define your output structure with few-shot examples
- Long document support: Automatic chunking and parallel processing
- Provider flexibility: Works with Gemini, OpenAI, Together AI, local models via Ollama
- Interactive visualization: Generate HTML reports to review extractions
This guide walks through setting up LangExtract with Together AI's Llama models, including a bug fix we discovered during testing.
The business case
Executives need numbers. Here's how LangExtract compares on cost:
| Approach | Cost per 100 Docs | Accuracy | Rework Rate | True Cost |
|---|---|---|---|---|
| Manual data entry | $500-1000 (labor) | 95-99% | 2-5% | $500-1050 |
| Plain LLM extraction | $2-5 (tokens) | 80-90% | 15-25% | $30-80 (rework) |
| LangExtract + verification | $3-8 (tokens) | 90-95% | 5-10% | $8-20 (rework) |
The math: Plain LLM extraction is cheap until you factor in hallucination rework. A 20% error rate on 100 legal documents means 20 manual reviews at $15-30 each. LangExtract's source grounding cuts verification time from minutes to seconds - you click the extraction and see the original text immediately.
Break-even point: LangExtract pays for itself after ~50 documents where accuracy matters. For one-off tasks, plain LLM prompts are faster. For production pipelines, the audit trail alone justifies the setup time.
What you will need
Before starting, make sure you have:
- Python 3.10 or higher installed
- A Together AI account with API key (or Google/OpenAI API key)
- About 15 minutes for setup
Together AI provides access to open-source models like Llama 3.3 70B at competitive prices. You get strong extraction quality without vendor lock-in. The same code works with Gemini, OpenAI, or local models by changing one line.
Step 1: Install LangExtract
Create a project directory and set up a virtual environment. This keeps LangExtract's dependencies isolated from your other Python projects.
Create your project folder
mkdir langextract-project
cd langextract-projectCreate a virtual environment
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activateInstall LangExtract
pip install langextractThis installs the core library with support for Google's Gemini models by default.
Step 2: Install the LiteLLM provider
To use Together AI (or OpenAI, Anthropic, and 100+ other providers), install the community LiteLLM provider:
pip install langextract-litellmLiteLLM is a unified interface for calling different LLM providers. Instead of learning each provider's API, you use one consistent format. The langextract-litellm package bridges LangExtract to this ecosystem, giving you access to Together AI, OpenAI, Anthropic, Azure, and many more.
Step 3: Fix the Together AI bug
Here is something the documentation does not tell you: the LiteLLM provider has a bug when used with Together AI. It passes internal LangExtract parameters that Together AI's API cannot serialize.
The error you will see:
InternalServerError: Together_aiException - Object of type FormatType is not JSON serializable
The fix:
Open the provider file and filter out non-standard parameters. Find the file at:
.venv/lib/python3.12/site-packages/langextract_litellm/provider.py
Locate this code block (around line 89):
response = litellm.completion(
model=self.model_id,
messages=messages,
**self.provider_kwargs,
)Replace it with:
# Filter out non-serializable kwargs that LiteLLM doesn't understand
filtered_kwargs = {
k: v for k, v in self.provider_kwargs.items()
if k in ('temperature', 'max_tokens', 'top_p', 'frequency_penalty',
'presence_penalty', 'timeout', 'api_base', 'api_key',
'stop', 'n', 'stream', 'response_format')
}
response = litellm.completion(
model=self.model_id,
messages=messages,
**filtered_kwargs,
)This filters the kwargs to only include parameters that LiteLLM and Together AI understand, preventing the serialization error.
LangExtract passes internal configuration objects (like FormatType enums) through to the model provider. The Gemini provider handles these correctly, but the LiteLLM provider passes everything through to the underlying API. Together AI's API tries to JSON-serialize all parameters for the request body and fails on Python enum objects.
Step 4: Set up your API key
Get your Together AI API key from api.together.xyz and set it as an environment variable:
export TOGETHER_AI_API_KEY="your-api-key-here"On Windows PowerShell:
$env:TOGETHER_AI_API_KEY = "your-api-key-here"To make this permanent, add the export line to your ~/.bashrc or ~/.zshrc file.
Alternative: Local execution with Ollama
For regulated industries (healthcare, legal, finance) where data cannot leave your network, run extraction locally with Ollama:
Install Ollama and pull a model:
# Install Ollama (macOS)
brew install ollama
# Pull Llama 3.1 8B (runs on most laptops)
ollama pull llama3.1:8b
# Or pull a larger model if you have GPU
ollama pull llama3.1:70bChange one line in your extraction code:
# Instead of Together AI:
# model_id="litellm/together_ai/meta-llama/Llama-3.3-70B-Instruct-Turbo"
# Use local Ollama:
model_id="litellm/ollama/llama3.1:8b"No API keys needed. No data leaves your machine. Same extraction code, different model ID.
Local models are slower (10-30 seconds per extraction vs 1-3 seconds with cloud APIs) and smaller models may have lower accuracy. Test extraction quality on your specific documents before committing to local-only deployment. Many teams use a hybrid: local for sensitive documents, cloud for bulk processing.
Step 5: Define your extraction schema
LangExtract learns what to extract from examples you provide. No fine-tuning, no training data preparation - just show it what you want.
Create a new Python file called extract.py:
import langextract as lx
import textwrap
# 1. Define what you want to extract
prompt = textwrap.dedent("""\
Extract key information from this text.
Identify methods, metrics, comparisons, and applications.
Use exact text from the source. Do not paraphrase.
Extract in order of appearance.""")
# 2. Provide an example to guide the model
examples = [
lx.data.ExampleData(
text="We propose FastNet, achieving 95% accuracy on ImageNet, 2x faster than ResNet. Applications include medical imaging.",
extractions=[
lx.data.Extraction(
extraction_class="method",
extraction_text="FastNet",
attributes={"type": "model"}
),
lx.data.Extraction(
extraction_class="metric",
extraction_text="95% accuracy on ImageNet",
attributes={"benchmark": "ImageNet", "value": "95%"}
),
lx.data.Extraction(
extraction_class="comparison",
extraction_text="2x faster than ResNet",
attributes={"baseline": "ResNet", "improvement": "2x"}
),
lx.data.Extraction(
extraction_class="application",
extraction_text="medical imaging",
attributes={"domain": "healthcare"}
),
]
)
]The example teaches LangExtract:
- What classes to extract: method, metric, comparison, application
- What attributes to capture: type, benchmark, value, baseline, domain
- How to format extractions: exact text spans, structured attributes
This approach is called few-shot learning. Instead of training a model on thousands of examples, you provide a handful of high-quality examples that demonstrate exactly what you want. The LLM generalizes from these examples to handle new, unseen text.
Step 6: Run your first extraction
Add this code to your extract.py file:
# 3. Your input document
input_text = """
We introduce MAXS (Meta-Adaptive eXploration Strategy), a framework
that enables LLM agents to look ahead before committing to actions.
On five reasoning benchmarks, MAXS achieves 63.46% average accuracy
compared to 52.93% for Chain-of-Thought, while using 100x fewer
tokens than Monte Carlo Tree Search. The framework supports tool use
including code execution and web search.
"""
# 4. Run the extraction
result = lx.extract(
text_or_documents=input_text,
prompt_description=prompt,
examples=examples,
model_id="litellm/together_ai/meta-llama/Llama-3.3-70B-Instruct-Turbo",
)
# 5. Print results
print(f"Found {len(result.extractions)} extractions:\n")
for ext in result.extractions:
print(f"[{ext.extraction_class}] {ext.extraction_text}")
if ext.attributes:
for key, value in ext.attributes.items():
print(f" {key}: {value}")
print()Run it:
python extract.pyExpected output:
Found 6 extractions:
[method] MAXS (Meta-Adaptive eXploration Strategy)
type: framework
[metric] 63.46% average accuracy
benchmark: five reasoning benchmarks
value: 63.46%
[comparison] compared to 52.93% for Chain-of-Thought
baseline: Chain-of-Thought
improvement: +10.5 points
[comparison] using 100x fewer tokens than Monte Carlo Tree Search
baseline: Monte Carlo Tree Search
improvement: 100x fewer tokens
[application] code execution
domain: tool use
[application] web search
domain: tool use
Step 7: Visualize results
LangExtract can generate interactive HTML visualizations showing extractions highlighted in their original context:
# Save results to JSONL
lx.io.save_annotated_documents([result],
output_name="extractions.jsonl",
output_dir=".")
# Generate visualization
html_content = lx.visualize("extractions.jsonl")
with open("visualization.html", "w") as f:
f.write(html_content if isinstance(html_content, str) else html_content.data)
print("Open visualization.html in your browser")The visualization shows:
- Original text with highlighted extractions
- Color-coded by extraction class
- Click any extraction to see its attributes
- Navigate between multiple documents
Practical examples
Research paper mining
Extract methods, results, and comparisons from paper abstracts to build a structured research database:
prompt = """\
Extract research findings from AI paper abstracts.
- methods: The proposed technique or model name
- metrics: Quantitative results with numbers and benchmarks
- comparisons: How it compares to baselines
- limitations: Any mentioned drawbacks or constraints"""
examples = [
lx.data.ExampleData(
text="Our model achieves 92% F1 on SQuAD, outperforming BERT by 3 points. Training requires 8 GPUs for 2 days.",
extractions=[
lx.data.Extraction(
extraction_class="metric",
extraction_text="92% F1 on SQuAD",
attributes={"benchmark": "SQuAD", "metric_type": "F1", "value": "92%"}
),
lx.data.Extraction(
extraction_class="comparison",
extraction_text="outperforming BERT by 3 points",
attributes={"baseline": "BERT", "delta": "+3 points"}
),
lx.data.Extraction(
extraction_class="limitation",
extraction_text="Training requires 8 GPUs for 2 days",
attributes={"type": "compute_cost"}
),
]
)
]Contract clause extraction
Extract obligations, dates, and parties from legal contracts:
prompt = """\
Extract key clauses from legal contracts.
- party: Named entities (companies, individuals)
- obligation: What someone must do
- date: Deadlines and time periods
- penalty: Consequences for non-compliance"""
examples = [
lx.data.ExampleData(
text="Acme Corp shall deliver the software by December 31, 2026. Late delivery incurs a 5% penalty per week.",
extractions=[
lx.data.Extraction(
extraction_class="party",
extraction_text="Acme Corp",
attributes={"role": "obligor"}
),
lx.data.Extraction(
extraction_class="obligation",
extraction_text="shall deliver the software",
attributes={"type": "delivery"}
),
lx.data.Extraction(
extraction_class="date",
extraction_text="December 31, 2026",
attributes={"type": "deadline"}
),
lx.data.Extraction(
extraction_class="penalty",
extraction_text="5% penalty per week",
attributes={"trigger": "late delivery"}
),
]
)
]Medical note structuring
Extract medications, dosages, and symptoms from clinical notes:
prompt = """\
Extract medical information from clinical notes.
- medication: Drug names
- dosage: Amounts and frequencies
- symptom: Patient complaints or findings
- diagnosis: Identified conditions"""
examples = [
lx.data.ExampleData(
text="Patient reports persistent headache for 3 days. Prescribed ibuprofen 400mg twice daily. Suspected tension headache.",
extractions=[
lx.data.Extraction(
extraction_class="symptom",
extraction_text="persistent headache for 3 days",
attributes={"duration": "3 days"}
),
lx.data.Extraction(
extraction_class="medication",
extraction_text="ibuprofen",
attributes={"type": "NSAID"}
),
lx.data.Extraction(
extraction_class="dosage",
extraction_text="400mg twice daily",
attributes={"amount": "400mg", "frequency": "twice daily"}
),
lx.data.Extraction(
extraction_class="diagnosis",
extraction_text="tension headache",
attributes={"certainty": "suspected"}
),
]
)
]Understanding source grounding
Source grounding is LangExtract's key differentiator. Every extraction includes character offsets pointing to the original text:
for ext in result.extractions:
print(f"Text: {ext.extraction_text}")
print(f"Start: {ext.start_offset}, End: {ext.end_offset}")
# Verify by slicing original
original_slice = input_text[ext.start_offset:ext.end_offset]
print(f"Verified: {original_slice == ext.extraction_text}")Why this matters:
| Use Case | Why Source Grounding Helps |
|---|---|
| Compliance audits | Prove where every data point originated |
| Legal discovery | Link extracted clauses to exact document locations |
| Medical records | Trace diagnoses back to clinical notes |
| Research databases | Cite exact sources for extracted findings |
| Quality assurance | Quickly verify LLM extractions are accurate |
Without source grounding, you are trusting the LLM blindly. With it, you can verify every extraction in seconds.
Confidence and verification patterns
Business users always ask: "How do I know it's right?" Here are patterns to build trust in your extraction pipeline.
Add confidence scores to extractions
Request a confidence attribute in your schema:
examples = [
lx.data.ExampleData(
text="Revenue increased 23% to $4.2 billion in Q3 2026.",
extractions=[
lx.data.Extraction(
extraction_class="metric",
extraction_text="$4.2 billion",
attributes={
"metric_type": "revenue",
"period": "Q3 2026",
"confidence": "high" # Model self-reports confidence
}
),
]
)
]The model will learn to add confidence ratings. Use these to flag extractions for review.
Human-in-the-loop flagging
Route low-confidence extractions to human reviewers:
def process_with_review(result, confidence_threshold=0.8):
approved = []
needs_review = []
for ext in result.extractions:
confidence = ext.attributes.get("confidence", "medium")
confidence_score = {"high": 0.9, "medium": 0.7, "low": 0.5}.get(confidence, 0.5)
if confidence_score >= confidence_threshold:
approved.append(ext)
else:
needs_review.append({
"extraction": ext,
"source_text": input_text[ext.start_offset:ext.end_offset],
"reason": f"Confidence: {confidence}"
})
return approved, needs_review
approved, flagged = process_with_review(result)
print(f"Auto-approved: {len(approved)}, Needs review: {len(flagged)}")Verification checklist
For critical applications, implement these checks:
| Check | Implementation | When to Use |
|---|---|---|
| Source match | input_text[start:end] == extraction_text | Always |
| Attribute completeness | all(k in ext.attributes for k in required_keys) | Structured data |
| Value validation | Regex or Pydantic for dates, numbers, emails | Financial/medical |
| Duplicate detection | Compare against previous extractions | Incremental processing |
When to use LangExtract
The sweet spot
LangExtract shines when you need structured data + audit trail from documents. The source grounding is the killer feature - without it, you're just using an LLM with extra steps.
Ideal use cases:
| Use Case | Why LangExtract Fits | Example |
|---|---|---|
| Compliance pipelines | Auditors ask 'where did this come from?' | Extracting financial figures from SEC filings |
| Legal document review | Every clause must trace to source | Contract analysis for M&A due diligence |
| Medical data extraction | Clinical decisions need verification | Structuring EHR notes for research |
| Building searchable databases | Need structured fields from many docs | Research paper database with queryable metrics |
| Bulk document processing | Same extraction across 100s of files | Processing insurance claims or invoices |
When it's overkill
Do not use LangExtract just because it exists. For many tasks, a simple LLM prompt is faster and cheaper.
Skip LangExtract when:
| Situation | Why It's Overkill | Better Alternative |
|---|---|---|
| Writing articles/summaries | You need synthesis, not extraction | Direct LLM conversation |
| One-off analysis | Setup time exceeds task time | ChatGPT or Claude chat |
| Highly structured data | JSON, XML, CSV already have structure | Traditional parsers |
| Real-time applications | LLM latency is 1-5 seconds per call | Pre-trained NER models |
| You don't need source proof | Source grounding is the main value | Plain LLM extraction |
Extraction vs RAG: Know the difference
AI enthusiasts often confuse these. They solve different problems:
| Aspect | LangExtract (Extraction) | RAG (Retrieval-Augmented Generation) |
|---|---|---|
| Goal | Structure documents into databases | Answer questions using documents |
| Output | Structured records with source offsets | Natural language responses |
| Query type | Extract all metrics from this paper | What was the accuracy on SQuAD? |
| Data flow | Documents → Structured data → Database | Question → Retrieve chunks → Generate answer |
| Best for | Building searchable datasets | Conversational Q&A over docs |
When to use extraction:
- "I need to populate a database with contract terms"
- "Build a dashboard from 500 research papers"
- "Create a structured feed from news articles"
When to use RAG:
- "Let users ask questions about our documentation"
- "Build a chatbot that answers from internal knowledge"
- "Help analysts explore reports conversationally"
Combining both: Extract structured data with LangExtract, store in a database, then use RAG to answer questions over that structured data. This gives you the best of both: queryable fields AND conversational access.
The decision framework
Ask yourself these questions:
| Question | Yes | No |
|---|---|---|
| Will someone ask 'where did this data come from?' | LangExtract (source grounding matters) | Plain LLM is simpler |
| Are you processing many similar documents? | LangExtract (define schema once, reuse) | One-off LLM prompt is faster |
| Do you need a structured database? | LangExtract (enforces consistent schema) | Unstructured summaries work fine |
| Is accuracy critical enough to verify every extraction? | LangExtract (click to verify each one) | Trust the LLM, spot-check occasionally |
Real example: Research paper analysis
Scenario A: Writing an article about one paper
- You read the paper, understand context, write narrative
- Need: synthesis, editorial judgment, visualization ideas
- Verdict: Skip LangExtract - use direct LLM conversation
Scenario B: Building a paper comparison database
- Extract (method, benchmark, accuracy, model_size) from 200 papers
- Query: "papers with >80% on MathVista using fewer than 10B parameters"
- Verdict: Use LangExtract - structured extraction at scale
Scenario C: Daily paper discovery pipeline
- Auto-extract key metrics from new arXiv papers
- Filter by thresholds, surface interesting ones
- Verdict: Use LangExtract - consistent schema across documents
Troubleshooting
| Problem | Likely Cause | Solution |
|---|---|---|
| FormatType serialization error | LiteLLM provider bug | Apply the fix from Step 3 |
| Empty extractions | Examples do not match input format | Improve example quality and coverage |
| Wrong extraction classes | Ambiguous prompt | Be more specific in prompt description |
| Missed extractions | Text is too long | Increase extraction_passes parameter |
| API rate limits | Too many parallel requests | Reduce max_workers parameter |
Checking extraction quality
If extractions seem wrong, check the prompt alignment:
# Enable warnings for misaligned examples
import warnings
warnings.filterwarnings("default")
result = lx.extract(...) # Warnings will show if examples have issuesCommon alignment issues:
- Extraction text not verbatim from example text
- Extractions not in order of appearance
- Overlapping extraction spans
Production-ready error handling
API calls fail. Add retry logic with exponential backoff:
import time
from functools import wraps
def retry_with_backoff(max_retries=3, base_delay=1):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(max_retries):
try:
return func(*args, **kwargs)
except Exception as e:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt)
print(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay}s...")
time.sleep(delay)
return None
return wrapper
return decorator
@retry_with_backoff(max_retries=3)
def extract_with_retry(text, prompt, examples, model_id):
return lx.extract(
text_or_documents=text,
prompt_description=prompt,
examples=examples,
model_id=model_id,
)Or use the tenacity library for more control:
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=30),
retry=retry_if_exception_type((TimeoutError, ConnectionError))
)
def robust_extract(text, prompt, examples, model_id):
return lx.extract(text_or_documents=text, prompt_description=prompt,
examples=examples, model_id=model_id)Tips for production use
Batch your documents
For large document sets, process in parallel:
result = lx.extract(
text_or_documents=documents, # List of texts or URLs
prompt_description=prompt,
examples=examples,
model_id="litellm/together_ai/meta-llama/Llama-3.3-70B-Instruct-Turbo",
max_workers=10, # Parallel processing
extraction_passes=2, # Multiple passes for better recall
)Monitor costs
Track token usage to avoid surprise bills:
# Estimate tokens before running
from langextract.tokenizer import count_tokens
token_count = count_tokens(your_text)
estimated_cost = token_count * 0.0001 # Adjust for your model pricingCache results
Save extractions to avoid re-processing:
import json
import hashlib
def get_cached_or_extract(text, cache_dir="./cache"):
text_hash = hashlib.md5(text.encode()).hexdigest()
cache_file = f"{cache_dir}/{text_hash}.json"
if os.path.exists(cache_file):
with open(cache_file) as f:
return json.load(f)
result = lx.extract(text_or_documents=text, ...)
with open(cache_file, "w") as f:
json.dump(result.to_dict(), f)
return resultValidate with Pydantic
Wrap extractions in Pydantic models to catch type errors before they hit your database:
from pydantic import BaseModel, field_validator
from datetime import datetime
from typing import Optional
class MetricExtraction(BaseModel):
text: str
benchmark: str
value: float
period: Optional[str] = None
@field_validator('value', mode='before')
@classmethod
def parse_percentage(cls, v):
if isinstance(v, str):
return float(v.strip('%')) / 100
return v
class ContractExtraction(BaseModel):
party: str
obligation: str
deadline: datetime
@field_validator('deadline', mode='before')
@classmethod
def parse_date(cls, v):
if isinstance(v, str):
# Handle common date formats
for fmt in ['%Y-%m-%d', '%B %d, %Y', '%m/%d/%Y']:
try:
return datetime.strptime(v, fmt)
except ValueError:
continue
return v
# Validate extractions
def validate_extraction(ext, model_class):
try:
return model_class(**ext.attributes, text=ext.extraction_text)
except Exception as e:
print(f"Validation failed: {e}")
return None
validated = [validate_extraction(e, MetricExtraction)
for e in result.extractions
if e.extraction_class == "metric"]Version your examples
As requirements change, track schema versions for backward compatibility:
EXTRACTION_SCHEMA_VERSION = "1.2"
# Schema changelog:
# v1.2: Added "limitation" class, improved metric attributes
# v1.1: Added confidence scores to all extractions
# v1.0: Initial schema with method, metric, comparison
examples = [
lx.data.ExampleData(...)
]
# Store version with extractions for migration support
def save_extractions(result, version=EXTRACTION_SCHEMA_VERSION):
return {
"schema_version": version,
"extracted_at": datetime.now().isoformat(),
"extractions": [e.to_dict() for e in result.extractions]
}When business needs shift (e.g., capturing a new "clause type"), bump the version and add migration logic for historical data.
Downstream integration
Extraction isn't the end goal - using the data is. Here's how to get extractions into your data stack.
Load into Pandas for analysis:
import pandas as pd
def extractions_to_dataframe(result):
rows = []
for ext in result.extractions:
row = {
"class": ext.extraction_class,
"text": ext.extraction_text,
"start": ext.start_offset,
"end": ext.end_offset,
**ext.attributes # Flatten attributes into columns
}
rows.append(row)
return pd.DataFrame(rows)
df = extractions_to_dataframe(result)
print(df[df["class"] == "metric"][["text", "benchmark", "value"]])Insert into SQLite/PostgreSQL:
import sqlite3
def create_extractions_table(conn):
conn.execute("""
CREATE TABLE IF NOT EXISTS extractions (
id INTEGER PRIMARY KEY AUTOINCREMENT,
document_id TEXT,
extraction_class TEXT,
extraction_text TEXT,
start_offset INTEGER,
end_offset INTEGER,
attributes JSON,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
def insert_extractions(conn, doc_id, result):
for ext in result.extractions:
conn.execute("""
INSERT INTO extractions (document_id, extraction_class, extraction_text,
start_offset, end_offset, attributes)
VALUES (?, ?, ?, ?, ?, ?)
""", (doc_id, ext.extraction_class, ext.extraction_text,
ext.start_offset, ext.end_offset, json.dumps(ext.attributes)))
conn.commit()
# Usage
conn = sqlite3.connect("extractions.db")
create_extractions_table(conn)
insert_extractions(conn, "paper_001", result)
# Query: Find all metrics above 90%
df = pd.read_sql("""
SELECT * FROM extractions
WHERE extraction_class = 'metric'
AND json_extract(attributes, '$.value') > 90
""", conn)Export for visualization tools:
# Export to JSON for dashboards
with open("extractions.json", "w") as f:
json.dump([{
"class": e.extraction_class,
"text": e.extraction_text,
**e.attributes
} for e in result.extractions], f)
# Export to CSV for spreadsheets
df.to_csv("extractions.csv", index=False)Next steps
Now that LangExtract is set up, start with a small document set to validate extraction quality before scaling up.
The combination of LLM flexibility and source grounding solves a real problem in data pipelines: getting structured data from messy documents while maintaining an audit trail. Unlike regex (brittle) or plain LLMs (no verification), LangExtract gives you the best of both worlds.
For teams processing documents at scale, this can replace hours of manual extraction work. The few-shot approach means you can adapt to new document types in minutes by adding examples, not retraining models.
Resources:
- LangExtract GitHub - Source code and documentation
- LiteLLM Providers - Full list of supported LLM providers
- Together AI Models - Available models and pricing