LiveVectorLake: Real-Time Versioned Knowledge Base for RAG

TL;DR

Problem. RAG systems re-embed entire documents on every update, wasting 85-95% of compute on unchanged content. Version history is typically lost, breaking compliance requirements.
Solution. LiveVectorLake splits storage into hot (Milvus for current) and cold (Delta Lake for history) tiers. SHA-256 chunk hashing detects exactly what changed, embedding only modified content.
Results. 10-15% content reprocessing vs 85-95% baseline. Current queries: 65ms median. Historical queries: 1.2s median. 100% change detection accuracy with zero temporal leakage.
Viability. Open-source implementation with Python, Milvus, Delta Lake. Designed for regulated industries needing both fast queries and complete audit trails.

Business value

Consider a knowledge base of 1,000 policy documents, each split into ~500 paragraphs (500K chunks total). At $0.0001 per embedding, a full daily re-index costs $50/day. With 5% of paragraphs actually changing, LiveVectorLake re-embeds only the 25K modified chunks: $2.50/day. That's $1,400/month in savings on a single corpus. For enterprises with millions of documents, the savings justify the architecture overhead in days.

What is temporal leakage?

When a historical query accidentally returns information that didn't exist at the requested point in time. For example, asking "What was our policy in January 2024?" and getting content added in March 2024. LiveVectorLake's validity filtering prevents this by tracking when each chunk was active.

Research overview

If you have built a RAG system that handles documents that change, you know the frustration. Someone updates a policy document. Your pipeline re-embeds all 500 chunks, even though only three paragraphs changed. That's 99% wasted compute, and you're paying for it.

But the bigger problem is what happens next. Compliance calls. "What did our policy say in January?" You check the knowledge base. It shows today's version. The January version? Gone. Overwritten. You have no answer.

This is the update-query tradeoff that LiveVectorLake solves. Think of it like a newspaper archive: you want today's edition on your desk for quick reference, but you also need every back issue in the basement for when someone asks about old coverage.

The system uses two storage tiers:

Hot tier (Milvus): Only current content, optimized for fast queries (65ms)
Cold tier (Delta Lake): Complete version history for audits and compliance

The clever part is how it avoids re-embedding unchanged content. Each paragraph gets a fingerprint (SHA-256 hash). If the fingerprint matches the previous version, skip re-embedding. Simple, but effective: 85% less compute.

What is a vector database?

A database optimized for storing and searching "embeddings" (lists of numbers that represent the meaning of text). Instead of exact keyword matching, vector databases find semantically similar content. "What's our refund policy?" matches documents about "return procedures" even without shared keywords. Milvus, Pinecone, and Weaviate are popular options.

LiveVectorLake System Architecture

Five-layer architecture with CDC ingestion and dual-tier storage

What is HNSW?

Hierarchical Navigable Small World. A graph-based algorithm for approximate nearest neighbor search. It builds a multi-layer graph where each node connects to semantically similar neighbors. Queries navigate from coarse upper layers to precise lower layers, achieving sub-linear search time. Milvus uses HNSW for fast vector similarity search.

The update problem in RAG

Standard RAG pipelines follow a simple pattern: document changes, re-embed everything, replace the old index. This works for static corpora but breaks down in production:

Approach	Reprocessed	Latency
Full re-index	100%	Minutes+
Upsert	85-95%	2-4s
Batch (12h)	15-20%	12h delay
LiveVectorLake	10-15%	1.2-1.8s

All approaches except LiveVectorLake lose version history on each update.

The inefficiency compounds. A 1000-document corpus with 5% daily changes means re-embedding 50 documents. With standard upsert, that's ~45 documents of wasted work. Over a year, you've paid for 16,000+ unnecessary embedding operations.

Update Performance Comparison

LiveVectorLake vs traditional update approaches

Why not just diff the text?

Text diffing finds character-level changes but doesn't map to semantic chunks. A single character edit might span chunk boundaries or leave embedding-relevant content unchanged. SHA-256 hashing at the chunk level captures exactly what the embedding model will see, making change detection precise and deterministic.

The compliance gap

Beyond efficiency, version history matters for regulated industries.

Concrete example: On March 12, 2024, a customer's loan application was denied. The system referenced "Standard Credit Policy v3.2", which set the minimum credit score at 680. On March 15, the policy was updated to v3.3, lowering the threshold to 660.

The regulator's inquiry arrives on March 20: "Why was this loan denied on March 12?"

With a standard RAG, you'd show them v3.3 (threshold 660) and look incompetent or dishonest. With LiveVectorLake, you query with timestamp 2024-03-12T08:30:00Z and retrieve the exact policy text that was active at that moment: v3.2, threshold 680. The denial was correct under the rules that existed at the time.

Other use cases:

Healthcare: Audit trails for treatment protocol changes
Legal: Contract version history for dispute resolution
Technical docs: "What did the API say when this bug was reported?"

Debugging AI agents

Here's a use case the paper doesn't emphasize but practitioners will appreciate: debugging non-deterministic AI agents.

Your agent hallucinated yesterday. Users complained. You want to reproduce the error, but today's knowledge base has been updated three times since then. The context the agent saw yesterday no longer exists.

With LiveVectorLake, you can "time travel" to the exact state of the world when the error occurred. Query with yesterday's timestamp, get yesterday's context, reproduce the bug. This turns agent debugging from guesswork into science.

Five-layer architecture

Understanding the full system helps when debugging or extending it. LiveVectorLake separates concerns into five layers, from raw document ingestion down to user interfaces. Each layer has one job.

Implementation Technologies

LiveVectorLake technology stack components

Layer 1: Change Detection & Ingestion

Semantic chunking at paragraph boundaries
SHA-256 content-addressable hashing
In-memory hash store for sub-millisecond CDC comparison

What is CDC (Change Data Capture)?

A technique for identifying what changed between two versions of data. Instead of comparing entire documents character-by-character, CDC uses fingerprints (hashes) to quickly detect which pieces changed. Database systems use CDC to sync replicas. LiveVectorLake uses it to avoid re-embedding unchanged paragraphs.

Layer 2: Embedding Generation

Selective processing of only modified/new chunks
SentenceTransformers (all-MiniLM-L6-v2, 384 dimensions)
Temporal metadata attached: valid_from, valid_to, version_number

Layer 3: Dual-Tier Storage

Hot tier: Milvus with HNSW indexing (M=16, efConstruction=200)
Cold tier: Delta Lake with Parquet and Snappy compression
Write-ahead logging for ACID consistency

Layer 4: Query Engine

Automatic routing based on query type
Current queries hit hot tier only
Temporal queries scan cold tier with validity filtering

Layer 5: Interfaces

CLI for batch operations
Streamlit UI with version timeline visualization

Chunk-level change detection

This is the heart of the efficiency gain. Think of it as git diff for your knowledge base. Instead of asking "has this document changed?" the system asks "which specific paragraphs changed?" Just like Git only commits the lines you modified, LiveVectorLake only embeds the paragraphs that actually changed.

The mechanism is content fingerprinting:

chunk_id = SHA256(normalize(content))

Normalization ensures consistent hashing by stripping whitespace and applying case-folding:

def normalize(text):
    # Collapse whitespace, lowercase, strip
    return re.sub(r'\s+', ' ', text.lower().strip())

Two properties emerge from this approach:

Automatic deduplication: Identical paragraphs across documents share one embedding
Deterministic change detection: Hash change guarantees content change

On document ingestion:

Compute all chunk hashes for new version
Compare against stored hashes
Classify each chunk as new, modified, unchanged, or deleted
Embed only new and modified chunks

The hash store lives in memory (persisted to JSON), enabling sub-millisecond comparison. Database queries for change detection would add ~100ms latency per document.

Why SHA-256?

Collision probability is 2^-256, essentially zero. More importantly, SHA-256 is fast (microseconds per chunk) and deterministic. The same content always produces the same hash, making it perfect for change detection. Cryptographic strength is overkill here, but the speed and ubiquity make it practical.

Dual-tier storage

Not all queries are equal. Current queries need speed. Historical queries need completeness. Trying to optimize for both in one system means compromising on both. LiveVectorLake sidesteps this by using separate storage for each access pattern.

Picture a bustling café kitchen (the hot tier) and a deep cellar pantry (the cold tier). The kitchen holds only the ingredients the chef needs right now: fresh herbs, pre-sliced vegetables, today's soup. Dishes get plated in seconds. When a batch of soup is finished, the leftover broth goes down to the cellar, where barrels sit for weeks, aging slowly but safely preserved.

When a customer asks for today's special, the chef reaches into the kitchen. When a historian asks for the soup recipe from last winter, someone descends to the cellar and retrieves the exact barrel. The kitchen stays lean and lightning-fast; the cellar provides the full, unaltered record of every batch ever made.

Query Latency by Tier

Hot tier (current) vs Cold tier (historical) performance in milliseconds

Hot tier: Milvus

Purpose: Current-state queries at interactive latency
Contents: Only active (non-superseded) chunks
Index: HNSW with M=16, efConstruction=200
Latency: 65ms p50, 145ms p99

When a chunk is updated, the old version is marked inactive in the hot tier and migrated to cold storage. This keeps the hot tier small (90% fewer chunks than full history) while maintaining query speed.

Cold tier: Delta Lake

Purpose: Historical queries and compliance audits
Contents: Complete version history with temporal metadata
Format: Parquet with Snappy compression
Latency: 1,200ms p50, 2,100ms p99

What is Parquet?

A column-oriented file format optimized for analytical workloads. Because data for each column is stored together, queries that need only a subset of fields (like timestamps for validity filtering) can read far less data than row-oriented formats. Combined with Snappy compression, Parquet files are typically 70-90% smaller than equivalent JSON or CSV.

Delta Lake provides ACID transactions, time travel queries, and efficient columnar storage. The 18x latency difference versus hot tier is acceptable for audit use cases with lower latency requirements.

What is Delta Lake?

An open-source storage layer that adds reliability to data lakes. It stores data in Parquet files (columnar format, highly compressed) with a transaction log that tracks all changes. The "time travel" feature lets you query data as it existed at any past point. Think of it as Git for your data: every version is preserved and accessible.

Consistency guarantees

Write-ahead logging ensures atomicity across tiers:

What is Write-Ahead Logging (WAL)?

A durability technique where changes are first recorded in a sequential log before being applied to the actual data stores. If a failure occurs mid-operation, the system can replay or roll back the log to restore consistency. WAL is how databases like PostgreSQL guarantee that committed transactions survive crashes.

Write new chunks to WAL
Update hot tier (Milvus)
Append to cold tier (Delta Lake)
Mark WAL entry complete

What happens on failure? If step 3 (cold tier write) fails after step 2 (hot tier) succeeded, you have "phantom data" in the hot tier that doesn't exist in cold storage. The compensating transaction rolls back the hot tier update by deleting the newly inserted chunks. On restart, the system replays uncommitted WAL entries from scratch.

This is the "scary" part for architects: two storage systems must stay synchronized. The WAL is the single source of truth that makes recovery deterministic.

Performance results

Numbers matter more than architecture diagrams. The authors tested on a 100-document corpus with 5 versions per document, simulating a typical enterprise knowledge base with regular updates.

Metric	Value
Content reprocessed	10-15%
Current query latency (p50)	65ms
Historical query latency (p50)	1,200ms
Change detection accuracy	100% (147/147)
Hot tier storage reduction	90%
Temporal query accuracy	100% (0% leakage)

The 10-15% reprocessing rate assumes typical document edits (a few paragraphs changed per version). Complete rewrites would approach 100%, but that's rare in practice.

What does 0% temporal leakage mean?

Every historical query returned only chunks that were valid at the requested timestamp. No future information leaked into past queries. This is verified by comparing query results against ground-truth document states at each point in time.

Implementation blueprint

If you want to build this yourself, here's what the authors actually used. No exotic dependencies. Everything runs in Docker Compose on a single machine.

Component	Technology
Language	Python 3.11+
Embedding	SentenceTransformers (all-MiniLM-L6-v2)
Hot Tier	Milvus 2.4+
Cold Tier	Delta Lake (deltalake-python)
Data Processing	Polars
UI	Streamlit
Orchestration	Docker Compose

Framework integration note

This architecture does not fit the standard LangChain/LlamaIndex "VectorStore" interface, which assumes a single storage backend. You'll need to implement a custom Retriever class that routes queries to the appropriate tier based on timestamp. Plan for this: it's not a drop-in replacement for your existing vector store.

Ingestion pipeline

The core workflow: load document, fingerprint each chunk, compare against stored fingerprints, embed only what changed. The six-step flow handles both new documents and updates.

def ingest_doc(path, doc_id, ts):
    # Load and chunk
    chunks = load_and_chunk(path)
 
    # Fingerprint each chunk
    new_fps = [sha256(normalize(c)) for c in chunks]
    old_fps = hash_store.get(doc_id, [])
 
    # Find what changed
    diff = compare_hashes(new_fps, old_fps)
 
    # Embed only changed chunks
    for c in diff.new + diff.modified:
        c.embedding = embed(c.text)
 
    # Write to both tiers
    write_hot_tier(diff.new + diff.modified)
    write_cold_tier(chunks, ts)
 
    # Update fingerprint store
    hash_store[doc_id] = new_fps

Query routing

The query router decides which tier to hit based on whether a timestamp is provided. Current queries go to the fast hot tier. Historical queries scan the cold tier with validity filtering.

def query(text, ts=None, k=5):
    vec = embed(text)
 
    if ts is None:
        # Current: hot tier only
        return milvus.search(
            vectors=[vec], limit=k,
            filter="status == 'active'"
        )
    else:
        # Historical: cold tier with time filter
        time_filter = f"valid_from <= {ts}"
        time_filter += f" AND valid_to > {ts}"
        return delta.search(
            vectors=[vec], limit=k,
            filter=time_filter
        )

Pitfalls to avoid

These mistakes won't show up in unit tests but will cause problems in production.

Hash collisions on short chunks. SHA-256 won't collide, but if your chunking produces very short paragraphs (under 50 characters), similar boilerplate text might legitimately hash the same. Consider minimum chunk sizes.

Timestamp granularity. If two versions are ingested in the same second, validity filtering breaks. Use millisecond-precision timestamps or version counters.

Cold tier scan costs. Historical queries scan Parquet files. For large corpora with many versions, this gets expensive. Consider partitioning by time period (year/quarter) to limit scan scope.

Memory pressure from hash store. The in-memory hash store grows with corpus size. A 100K document corpus with 500 chunks each means 50M hashes in memory (~3GB). For larger deployments, use Redis (or KeyDB for clustering). Redis handles hash lookups in microseconds and persists to disk automatically. The switch is straightforward: replace hash_store.get() with redis.hget() calls.

Embedding model changes. If you swap embedding models, all hashes remain valid but vectors become incompatible. You'll need a full re-embed of the hot tier. Plan for this.

When to use LiveVectorLake

Every architecture has tradeoffs. This one optimizes for frequently-updated corpora with compliance requirements. If that's not your situation, simpler approaches work fine.

Good fit:

Frequently updated document corpora (daily/weekly changes)
Compliance requirements for version history
Mixed workload: fast current queries + occasional historical lookups
Cost sensitivity around embedding compute

Not ideal:

Static corpora (no updates = no benefit from CDC)
Real-time streaming updates (current implementation is synchronous)
Multi-modal content (text-only support currently)
Distributed deployment (single-machine architecture)

Limitations

The paper acknowledges several constraints:

Synchronous processing only: No streaming ingestion pipeline yet
Text-only support: Images, tables, and other modalities not handled
Single-machine deployment: No horizontal scaling for hot or cold tier
Fixed chunking strategy: Paragraph-level only, no learned boundaries
Embedding model fixed: all-MiniLM-L6-v2 hardcoded, no pluggable models

Future work mentions standard dataset benchmarking (BEIR, MS MARCO), learned temporal embeddings, and semantic change detection beyond hash comparison.

The bottom line

For ML/data engineers: If you're running a RAG system with documents that change weekly or more often, this architecture pays for itself in embedding costs. The dual-tier approach is straightforward to implement with the provided code.

For engineering managers: The ROI calculation is simple. Take your current embedding spend, multiply by 0.15 (the fraction you'd actually need), and compare against the engineering time to implement. For most teams with compliance requirements, it's worth building.

For compliance/legal teams: This is one of the few RAG architectures that can answer "what did the system know on date X?" with confidence. If you're in financial services, healthcare, or legal, that capability matters.

For researchers: The chunk-level CDC approach is the novel contribution. The dual-tier storage is standard practice. Interesting extensions would be learned chunk boundaries and cross-document deduplication.

Paper: arXiv:2601.05270 Authors: Tarun Prajapati et al. Code: github.com/praj-tarun/livevectorlake

Authors

Tarun Prajapati et al.Independent Research

Code & Data

Cite this paper

Tarun Prajapati et al. (2026). LiveVectorLake: Real-Time Versioned Knowledge Base for RAG. arXiv 2026.

Key Findings