-
Problem. RAG systems re-embed entire documents on every update, wasting 85-95% of compute on unchanged content. Version history is typically lost, breaking compliance requirements.
-
Solution. LiveVectorLake splits storage into hot (Milvus for current) and cold (Delta Lake for history) tiers. SHA-256 chunk hashing detects exactly what changed, embedding only modified content.
-
Results. 10-15% content reprocessing vs 85-95% baseline. Current queries: 65ms median. Historical queries: 1.2s median. 100% change detection accuracy with zero temporal leakage.
-
Viability. Open-source implementation with Python, Milvus, Delta Lake. Designed for regulated industries needing both fast queries and complete audit trails.
Consider a knowledge base of 1,000 policy documents, each split into ~500 paragraphs (500K chunks total). At $0.0001 per embedding, a full daily re-index costs $50/day. With 5% of paragraphs actually changing, LiveVectorLake re-embeds only the 25K modified chunks: $2.50/day. That's $1,400/month in savings on a single corpus. For enterprises with millions of documents, the savings justify the architecture overhead in days.
When a historical query accidentally returns information that didn't exist at the requested point in time. For example, asking "What was our policy in January 2024?" and getting content added in March 2024. LiveVectorLake's validity filtering prevents this by tracking when each chunk was active.
Research overview
If you have built a RAG system that handles documents that change, you know the frustration. Someone updates a policy document. Your pipeline re-embeds all 500 chunks, even though only three paragraphs changed. That's 99% wasted compute, and you're paying for it.
But the bigger problem is what happens next. Compliance calls. "What did our policy say in January?" You check the knowledge base. It shows today's version. The January version? Gone. Overwritten. You have no answer.
This is the update-query tradeoff that LiveVectorLake solves. Think of it like a newspaper archive: you want today's edition on your desk for quick reference, but you also need every back issue in the basement for when someone asks about old coverage.
The system uses two storage tiers:
- Hot tier (Milvus): Only current content, optimized for fast queries (65ms)
- Cold tier (Delta Lake): Complete version history for audits and compliance
The clever part is how it avoids re-embedding unchanged content. Each paragraph gets a fingerprint (SHA-256 hash). If the fingerprint matches the previous version, skip re-embedding. Simple, but effective: 85% less compute.
A database optimized for storing and searching "embeddings" (lists of numbers that represent the meaning of text). Instead of exact keyword matching, vector databases find semantically similar content. "What's our refund policy?" matches documents about "return procedures" even without shared keywords. Milvus, Pinecone, and Weaviate are popular options.
LiveVectorLake System Architecture
Five-layer architecture with CDC ingestion and dual-tier storage
Hierarchical Navigable Small World. A graph-based algorithm for approximate nearest neighbor search. It builds a multi-layer graph where each node connects to semantically similar neighbors. Queries navigate from coarse upper layers to precise lower layers, achieving sub-linear search time. Milvus uses HNSW for fast vector similarity search.
The update problem in RAG
Standard RAG pipelines follow a simple pattern: document changes, re-embed everything, replace the old index. This works for static corpora but breaks down in production:
| Approach | Reprocessed | Latency |
|---|---|---|
| Full re-index | 100% | Minutes+ |
| Upsert | 85-95% | 2-4s |
| Batch (12h) | 15-20% | 12h delay |
| LiveVectorLake | 10-15% | 1.2-1.8s |
All approaches except LiveVectorLake lose version history on each update.
The inefficiency compounds. A 1000-document corpus with 5% daily changes means re-embedding 50 documents. With standard upsert, that's ~45 documents of wasted work. Over a year, you've paid for 16,000+ unnecessary embedding operations.
Update Performance Comparison
LiveVectorLake vs traditional update approaches
Text diffing finds character-level changes but doesn't map to semantic chunks. A single character edit might span chunk boundaries or leave embedding-relevant content unchanged. SHA-256 hashing at the chunk level captures exactly what the embedding model will see, making change detection precise and deterministic.
The compliance gap
Beyond efficiency, version history matters for regulated industries.
Concrete example: On March 12, 2024, a customer's loan application was denied. The system referenced "Standard Credit Policy v3.2", which set the minimum credit score at 680. On March 15, the policy was updated to v3.3, lowering the threshold to 660.
The regulator's inquiry arrives on March 20: "Why was this loan denied on March 12?"
With a standard RAG, you'd show them v3.3 (threshold 660) and look incompetent or dishonest. With LiveVectorLake, you query with timestamp 2024-03-12T08:30:00Z and retrieve the exact policy text that was active at that moment: v3.2, threshold 680. The denial was correct under the rules that existed at the time.
Other use cases:
- Healthcare: Audit trails for treatment protocol changes
- Legal: Contract version history for dispute resolution
- Technical docs: "What did the API say when this bug was reported?"
Debugging AI agents
Here's a use case the paper doesn't emphasize but practitioners will appreciate: debugging non-deterministic AI agents.
Your agent hallucinated yesterday. Users complained. You want to reproduce the error, but today's knowledge base has been updated three times since then. The context the agent saw yesterday no longer exists.
With LiveVectorLake, you can "time travel" to the exact state of the world when the error occurred. Query with yesterday's timestamp, get yesterday's context, reproduce the bug. This turns agent debugging from guesswork into science.
Five-layer architecture
Understanding the full system helps when debugging or extending it. LiveVectorLake separates concerns into five layers, from raw document ingestion down to user interfaces. Each layer has one job.
Implementation Technologies
LiveVectorLake technology stack components
Layer 1: Change Detection & Ingestion
- Semantic chunking at paragraph boundaries
- SHA-256 content-addressable hashing
- In-memory hash store for sub-millisecond CDC comparison
A technique for identifying what changed between two versions of data. Instead of comparing entire documents character-by-character, CDC uses fingerprints (hashes) to quickly detect which pieces changed. Database systems use CDC to sync replicas. LiveVectorLake uses it to avoid re-embedding unchanged paragraphs.
Layer 2: Embedding Generation
- Selective processing of only modified/new chunks
- SentenceTransformers (all-MiniLM-L6-v2, 384 dimensions)
- Temporal metadata attached: valid_from, valid_to, version_number
Layer 3: Dual-Tier Storage
- Hot tier: Milvus with HNSW indexing (M=16, efConstruction=200)
- Cold tier: Delta Lake with Parquet and Snappy compression
- Write-ahead logging for ACID consistency
Layer 4: Query Engine
- Automatic routing based on query type
- Current queries hit hot tier only
- Temporal queries scan cold tier with validity filtering
Layer 5: Interfaces
- CLI for batch operations
- Streamlit UI with version timeline visualization
Chunk-level change detection
This is the heart of the efficiency gain. Think of it as git diff for your knowledge base. Instead of asking "has this document changed?" the system asks "which specific paragraphs changed?" Just like Git only commits the lines you modified, LiveVectorLake only embeds the paragraphs that actually changed.
The mechanism is content fingerprinting:
chunk_id = SHA256(normalize(content))
Normalization ensures consistent hashing by stripping whitespace and applying case-folding:
def normalize(text):
# Collapse whitespace, lowercase, strip
return re.sub(r'\s+', ' ', text.lower().strip())Two properties emerge from this approach:
- Automatic deduplication: Identical paragraphs across documents share one embedding
- Deterministic change detection: Hash change guarantees content change
On document ingestion:
- Compute all chunk hashes for new version
- Compare against stored hashes
- Classify each chunk as new, modified, unchanged, or deleted
- Embed only new and modified chunks
The hash store lives in memory (persisted to JSON), enabling sub-millisecond comparison. Database queries for change detection would add ~100ms latency per document.
Collision probability is 2^-256, essentially zero. More importantly, SHA-256 is fast (microseconds per chunk) and deterministic. The same content always produces the same hash, making it perfect for change detection. Cryptographic strength is overkill here, but the speed and ubiquity make it practical.
Dual-tier storage
Not all queries are equal. Current queries need speed. Historical queries need completeness. Trying to optimize for both in one system means compromising on both. LiveVectorLake sidesteps this by using separate storage for each access pattern.
Picture a bustling café kitchen (the hot tier) and a deep cellar pantry (the cold tier). The kitchen holds only the ingredients the chef needs right now: fresh herbs, pre-sliced vegetables, today's soup. Dishes get plated in seconds. When a batch of soup is finished, the leftover broth goes down to the cellar, where barrels sit for weeks, aging slowly but safely preserved.
When a customer asks for today's special, the chef reaches into the kitchen. When a historian asks for the soup recipe from last winter, someone descends to the cellar and retrieves the exact barrel. The kitchen stays lean and lightning-fast; the cellar provides the full, unaltered record of every batch ever made.
Query Latency by Tier
Hot tier (current) vs Cold tier (historical) performance in milliseconds
Hot tier: Milvus
- Purpose: Current-state queries at interactive latency
- Contents: Only active (non-superseded) chunks
- Index: HNSW with M=16, efConstruction=200
- Latency: 65ms p50, 145ms p99
When a chunk is updated, the old version is marked inactive in the hot tier and migrated to cold storage. This keeps the hot tier small (90% fewer chunks than full history) while maintaining query speed.
Cold tier: Delta Lake
- Purpose: Historical queries and compliance audits
- Contents: Complete version history with temporal metadata
- Format: Parquet with Snappy compression
- Latency: 1,200ms p50, 2,100ms p99
A column-oriented file format optimized for analytical workloads. Because data for each column is stored together, queries that need only a subset of fields (like timestamps for validity filtering) can read far less data than row-oriented formats. Combined with Snappy compression, Parquet files are typically 70-90% smaller than equivalent JSON or CSV.
Delta Lake provides ACID transactions, time travel queries, and efficient columnar storage. The 18x latency difference versus hot tier is acceptable for audit use cases with lower latency requirements.
An open-source storage layer that adds reliability to data lakes. It stores data in Parquet files (columnar format, highly compressed) with a transaction log that tracks all changes. The "time travel" feature lets you query data as it existed at any past point. Think of it as Git for your data: every version is preserved and accessible.
Consistency guarantees
Write-ahead logging ensures atomicity across tiers:
A durability technique where changes are first recorded in a sequential log before being applied to the actual data stores. If a failure occurs mid-operation, the system can replay or roll back the log to restore consistency. WAL is how databases like PostgreSQL guarantee that committed transactions survive crashes.
- Write new chunks to WAL
- Update hot tier (Milvus)
- Append to cold tier (Delta Lake)
- Mark WAL entry complete
What happens on failure? If step 3 (cold tier write) fails after step 2 (hot tier) succeeded, you have "phantom data" in the hot tier that doesn't exist in cold storage. The compensating transaction rolls back the hot tier update by deleting the newly inserted chunks. On restart, the system replays uncommitted WAL entries from scratch.
This is the "scary" part for architects: two storage systems must stay synchronized. The WAL is the single source of truth that makes recovery deterministic.
Performance results
Numbers matter more than architecture diagrams. The authors tested on a 100-document corpus with 5 versions per document, simulating a typical enterprise knowledge base with regular updates.
| Metric | Value |
|---|---|
| Content reprocessed | 10-15% |
| Current query latency (p50) | 65ms |
| Historical query latency (p50) | 1,200ms |
| Change detection accuracy | 100% (147/147) |
| Hot tier storage reduction | 90% |
| Temporal query accuracy | 100% (0% leakage) |
The 10-15% reprocessing rate assumes typical document edits (a few paragraphs changed per version). Complete rewrites would approach 100%, but that's rare in practice.
Every historical query returned only chunks that were valid at the requested timestamp. No future information leaked into past queries. This is verified by comparing query results against ground-truth document states at each point in time.
Implementation blueprint
If you want to build this yourself, here's what the authors actually used. No exotic dependencies. Everything runs in Docker Compose on a single machine.
| Component | Technology |
|---|---|
| Language | Python 3.11+ |
| Embedding | SentenceTransformers (all-MiniLM-L6-v2) |
| Hot Tier | Milvus 2.4+ |
| Cold Tier | Delta Lake (deltalake-python) |
| Data Processing | Polars |
| UI | Streamlit |
| Orchestration | Docker Compose |
This architecture does not fit the standard LangChain/LlamaIndex "VectorStore" interface, which assumes a single storage backend. You'll need to implement a custom Retriever class that routes queries to the appropriate tier based on timestamp. Plan for this: it's not a drop-in replacement for your existing vector store.
Ingestion pipeline
The core workflow: load document, fingerprint each chunk, compare against stored fingerprints, embed only what changed. The six-step flow handles both new documents and updates.
def ingest_doc(path, doc_id, ts):
# Load and chunk
chunks = load_and_chunk(path)
# Fingerprint each chunk
new_fps = [sha256(normalize(c)) for c in chunks]
old_fps = hash_store.get(doc_id, [])
# Find what changed
diff = compare_hashes(new_fps, old_fps)
# Embed only changed chunks
for c in diff.new + diff.modified:
c.embedding = embed(c.text)
# Write to both tiers
write_hot_tier(diff.new + diff.modified)
write_cold_tier(chunks, ts)
# Update fingerprint store
hash_store[doc_id] = new_fpsQuery routing
The query router decides which tier to hit based on whether a timestamp is provided. Current queries go to the fast hot tier. Historical queries scan the cold tier with validity filtering.
def query(text, ts=None, k=5):
vec = embed(text)
if ts is None:
# Current: hot tier only
return milvus.search(
vectors=[vec], limit=k,
filter="status == 'active'"
)
else:
# Historical: cold tier with time filter
time_filter = f"valid_from <= {ts}"
time_filter += f" AND valid_to > {ts}"
return delta.search(
vectors=[vec], limit=k,
filter=time_filter
)Pitfalls to avoid
These mistakes won't show up in unit tests but will cause problems in production.
Hash collisions on short chunks. SHA-256 won't collide, but if your chunking produces very short paragraphs (under 50 characters), similar boilerplate text might legitimately hash the same. Consider minimum chunk sizes.
Timestamp granularity. If two versions are ingested in the same second, validity filtering breaks. Use millisecond-precision timestamps or version counters.
Cold tier scan costs. Historical queries scan Parquet files. For large corpora with many versions, this gets expensive. Consider partitioning by time period (year/quarter) to limit scan scope.
Memory pressure from hash store. The in-memory hash store grows with corpus size. A 100K document corpus with 500 chunks each means 50M hashes in memory (~3GB). For larger deployments, use Redis (or KeyDB for clustering). Redis handles hash lookups in microseconds and persists to disk automatically. The switch is straightforward: replace hash_store.get() with redis.hget() calls.
Embedding model changes. If you swap embedding models, all hashes remain valid but vectors become incompatible. You'll need a full re-embed of the hot tier. Plan for this.
When to use LiveVectorLake
Every architecture has tradeoffs. This one optimizes for frequently-updated corpora with compliance requirements. If that's not your situation, simpler approaches work fine.
Good fit:
- Frequently updated document corpora (daily/weekly changes)
- Compliance requirements for version history
- Mixed workload: fast current queries + occasional historical lookups
- Cost sensitivity around embedding compute
Not ideal:
- Static corpora (no updates = no benefit from CDC)
- Real-time streaming updates (current implementation is synchronous)
- Multi-modal content (text-only support currently)
- Distributed deployment (single-machine architecture)
Limitations
The paper acknowledges several constraints:
- Synchronous processing only: No streaming ingestion pipeline yet
- Text-only support: Images, tables, and other modalities not handled
- Single-machine deployment: No horizontal scaling for hot or cold tier
- Fixed chunking strategy: Paragraph-level only, no learned boundaries
- Embedding model fixed: all-MiniLM-L6-v2 hardcoded, no pluggable models
Future work mentions standard dataset benchmarking (BEIR, MS MARCO), learned temporal embeddings, and semantic change detection beyond hash comparison.
The bottom line
For ML/data engineers: If you're running a RAG system with documents that change weekly or more often, this architecture pays for itself in embedding costs. The dual-tier approach is straightforward to implement with the provided code.
For engineering managers: The ROI calculation is simple. Take your current embedding spend, multiply by 0.15 (the fraction you'd actually need), and compare against the engineering time to implement. For most teams with compliance requirements, it's worth building.
For compliance/legal teams: This is one of the few RAG architectures that can answer "what did the system know on date X?" with confidence. If you're in financial services, healthcare, or legal, that capability matters.
For researchers: The chunk-level CDC approach is the novel contribution. The dual-tier storage is standard practice. Interesting extensions would be learned chunk boundaries and cross-document deduplication.
Paper: arXiv:2601.05270 Authors: Tarun Prajapati et al. Code: github.com/praj-tarun/livevectorlake
Cite this paper
Tarun Prajapati et al. (2026). LiveVectorLake: Real-Time Versioned Knowledge Base for RAG. arXiv 2026.