Tekta.ai LogoTektaai
arXiv 2025December 3, 2025

RAGVue: Diagnostic Evaluation That Tells You Why Your RAG System Fails

Keerthana Murugarajet al.

RAGVue addresses a critical gap in RAG evaluation: existing metrics collapse heterogeneous behaviors into single scores without explaining failure modes. The framework decomposes assessment into retrieval quality, answer relevance and completeness, claim-level faithfulness, and evaluator calibration. On StrategyQA benchmarks, RAGVue shows strong correlation with RAGAS on generation metrics but reveals systematic differences in retrieval assessment, demonstrating that existing tools often conflate insufficient retrieval with unsupported reasoning.

Categories:Information RetrievalNatural Language ProcessingEvaluation Methods

Key Findings

1

Turns opaque scores into debugging reports: instead of 'faithfulness: 0.52', you get 'claim X is unsupported because retrieved chunk Y lacks evidence for date Z'

2

Seven metrics across three dimensions: retrieval relevance and coverage, answer clarity and completeness, strict faithfulness with entity matching

3

Catches what RAGAS misses: the two frameworks agree on generation quality but diverge sharply on retrieval assessment, revealing hidden failure modes

4

Reference-free evaluation: no gold-standard answers required, making it practical for production systems where ground truth is expensive

5

Evaluator stability measurement: explicit calibration metric detects when your LLM judge is unreliable across different prompts

6

Production-ready interfaces: Python API, CLI tools, and Streamlit UI with only 3.4% overhead versus RAGAS

TL;DR
  1. The Problem. Your internal AI assistant searches company documents and gives confident answers. But how do you know it is not inventing details? Current testing tools say "85% accurate" when the answer contains fabricated facts.

  2. The Solution. RAGVue breaks each answer into individual claims and verifies them against the source documents. Instead of "faithfulness: 0.7" you get "claim X has no evidential basis in the retrieved context."

  3. The Results. RAGVue catches hallucinations that aggregate scoring misses, runs at comparable speed (3.4% overhead), and requires no golden datasets. Deploy it on production logs today.

Research Overview

Your company deployed an AI assistant that searches internal documents. An employee asks: "When did the Acme acquisition close?" The system finds the relevant contract, which says the deal was "finalized in early December." The AI responds: "The Acme acquisition closed on December 5th, 2024."

December 5th sounds specific. Professional. Correct.

It is completely invented.

What is RAG?

RAG (Retrieval-Augmented Generation) is how modern AI assistants answer questions about your data. The system first searches your documents for relevant passages, then generates an answer based on what it found. Enterprise tools like Microsoft Copilot, internal knowledge bases, and customer support bots all use this pattern.

This is the hidden problem with AI systems that search your documents. They sound authoritative even when they fabricate details. Current testing tools miss these errors because they check whether the answer sounds similar to the source, not whether each fact is actually supported.

RAGVue takes a different approach. Instead of asking "does this sound right?", it asks "can we prove each claim?"

It breaks the answer into pieces:

  • "The Acme acquisition closed" — ✓ Supported by context
  • "in December" — ✓ Matches "early December" in source
  • "on the 5th, 2024" — ✗ No specific date in retrieved documents

One invented detail. Traditional testing said 85% accurate. RAGVue caught the fabrication and told you exactly what was wrong.

Think of RAGVue as a forensic audit for your AI's reasoning. With the rise of AI assistants in legal, financial, and healthcare settings, "checking the work" has become the central challenge. RAGVue is the external auditor that ensures your AI system is not just confident, but correct.

Why this matters now

RAG systems are moving into production at scale. Customer support bots, legal research tools, financial analysts, medical information systems. These applications need more than pass/fail testing. They need debugging intelligence that helps teams systematically improve retrieval strategies, prompt engineering, and model selection.

The enterprise blocker is gone. The biggest obstacle to deploying RAG evaluation in production has been the lack of "golden datasets" with human-verified answers. Creating test sets requires expensive expert annotation. RAGVue is fully reference-free. It requires zero human labeling. You can deploy it on your live production logs today without asking your domain experts to write thousands of test answers.

The research community has recognized the evaluation gap. Papers on RAG metrics have proliferated, but most propose new scores without addressing the fundamental opacity problem. RAGVue is one of the first frameworks to prioritize explainability alongside accuracy.

The Evaluation Gap

Existing RAG metrics suffer from three structural problems that create what we call the Silent Failure Crisis: your system passes evaluation while actively misinforming users.

Component confusion. A single "faithfulness" score blends retrieval failures, reasoning errors, and generation hallucinations. If retrieval returns irrelevant chunks, the model cannot generate faithful answers. But the faithfulness metric penalizes both scenarios equally, without distinguishing the root cause. Worse, when RAGAS gives a hallucination a "passing grade" (as in the December 5th example below), it hides business risk rather than exposing it.

Semantic permissiveness. Most frameworks use semantic similarity for faithfulness checking. If the answer "paraphrases" the source, it passes. But semantic similarity misses entity errors, temporal mismatches, and numerical inaccuracies. The answer might be semantically similar to the source while being factually wrong.

Evaluator instability. LLM-as-judge approaches vary wildly based on prompt phrasing, model choice, and temperature settings. A score of 0.7 from one configuration might be 0.5 from another. Without explicit reliability measures, you cannot trust the evaluation itself.

What is LLM-as-judge?

Instead of human evaluators or rule-based metrics, LLM-as-judge uses a language model to assess answer quality. The model receives the question, answer, and context, then scores aspects like faithfulness and relevance. This approach scales better than human evaluation but introduces its own biases and instabilities.

Consider this concrete example. A user asks: "When did the merger close?"

The retrieved context says: "The acquisition was finalized in early December following regulatory approval."

The RAG system answers: "The merger closed on December 5th."

RAGAS result: Faithfulness: 0.85 (passes). The answer is semantically similar to the source. Both mention December, both mention the merger closing. The embedding similarity is high.

RAGVue result: Strict Faithfulness: 0.0 (fails).

  • Claim: "December 5th"
  • Status: UNSUPPORTED
  • Reason: "Source says 'early December' but no specific date. The date '5th' has no evidential basis."
  • Fix hint: "Remove specific date or retrieve documents with exact closing date."

This is the difference. RAGAS sees semantic overlap. RAGVue sees that the model invented a specific date that does not exist in the source. In a legal or financial context, this fabricated precision could be actionable misinformation.

The cost of false precision: In a contract review, this difference is not just a metric—it is a lawsuit. Semantic similarity says "close enough." Strict faithfulness says "breach of contract." When your RAG system invents the specific date a merger closed, and that date ends up in a legal filing, "the embeddings were similar" is not a defense.

RAGVue's Seven Diagnostic Metrics

Three dimensions of evaluation with structured explanations

Seven Diagnostic Metrics

RAGVue organizes evaluation into three dimensions with seven specific metrics.

Retrieval dimension

Retrieval Relevance scores each retrieved chunk individually. Instead of averaging relevance across all chunks, it identifies which specific chunks are useful and which are noise. The formula divides relevant chunks by total chunks, with a relevance threshold of 0.7.

Retrieval Coverage measures whether retrieved content addresses all aspects of the question. A question about "revenue growth in Q3 and Q4" requires chunks covering both quarters. Coverage identifies missing aspects, not just overall relevance.

Answer quality dimension

Clarity assesses linguistic quality: grammar, fluency, logical flow, and readability. This catches answers that are technically correct but poorly structured or confusing.

Answer Relevance checks alignment with the question's intent. An answer might be factually accurate but miss what the user actually asked. This metric identifies off-topic or incomplete responses.

Answer Completeness measures coverage of question aspects. If the question has three parts, the answer should address all three. Completeness calculates aspects covered divided by total aspects.

Grounding dimension

Strict Faithfulness is the key differentiator. Unlike semantic faithfulness, it decomposes answers into individual claims and verifies each against the retrieved context with exact entity and temporal matching. The formula: supported claims divided by (supported + hallucinated claims).

Think of Strict Faithfulness as a forensic lab examiner. When a suspect (the answer) is brought in, the examiner does not just look at the whole person; they isolate each fingerprint, DNA strand, and shoe-print (each claim) and compare them against the evidence locker (the retrieved chunks). If any trace does not match, the examiner flags that specific piece as contaminated. This is why RAGVue catches the invented "December 5th" date that semantic similarity misses.

Why claim-level verification matters

An answer might contain five claims. Four are supported by evidence; one is hallucinated. Aggregate faithfulness scores average these together, hiding the specific hallucination. Claim-level verification identifies exactly which statement lacks support, enabling targeted fixes.

Generic Calibration measures evaluator stability. It runs the same evaluation across multiple judge configurations (different models, temperatures) and computes agreement. High variance signals unreliable evaluation. Formula: 1 minus (max score minus min score).

The Second Opinion Problem

If three doctors give three different diagnoses, you do not trust any of them. RAGVue's Calibration metric automates this second opinion: it queries multiple LLM judges (GPT-4o, Claude, Llama) and measures how much they agree. If GPT-4o says faithfulness is 0.9 but Claude says 0.4, the calibration score drops to 0.5, warning you that the evaluation itself is unreliable. Fix the evaluator before trusting the scores.

These seven metrics form the diagnostic foundation. But RAGVue also previews where evaluation is heading.

Agentic Mode: The Future of Auto-Evaluation

RAGVue includes an experimental "Agentic Mode" that previews where evaluation is heading: an autonomous quality assurance agent that adapts its scrutiny level based on complexity. For simple factoid questions, it runs lightweight checks. For multi-hop reasoning, it enables full claim decomposition. If context is missing, it skips retrieval metrics entirely. This is not just metric selection—it is an evaluator that thinks about how to evaluate before evaluating.

How RAGVue Differs from RAGAS

RAGAS is the most widely used RAG evaluation framework. Understanding how RAGVue differs clarifies its value.

RAGAS

RAGAS (Retrieval-Augmented Generation Assessment Suite) scores RAG pipelines using aggregate, semantic-similarity-based measures like faithfulness, relevance, and answer quality. It provides a single numeric score per metric but does not break down failures by individual chunk or claim. Think of it as a thermometer: it tells you the temperature but not what is causing the fever.

If RAGAS is a general practitioner doing a checkup, RAGVue is an MRI scan looking for the root cause. Both have their place, but you do not debug production failures with a checkup.

AspectRAGASRAGVue
Metric granularityAggregate scoresPer-claim, per-chunk
Retrieval assessmentCombined contextChunk + coverage
FaithfulnessSemantic similarityEntity/temporal match
ExplanationsMinimalStructured diagnostics
Evaluator reliabilityImplicitExplicit calibration
Reference requirementSome metrics need goldFully reference-free
Best forCI/CD pass/fail gatesRoot cause analysis

RAGAS vs RAGVue: Where They Agree and Diverge

Strong agreement on generation, but retrieval metrics show different signals

The correlation analysis reveals a critical finding. RAGVue and RAGAS show strong agreement (0.64-0.96 correlation) on generation-focused metrics. Both frameworks similarly assess answer quality. But on retrieval metrics, correlation drops to 0.006-0.708.

This divergence matters. RAGAS often conflates insufficient retrieval with unsupported reasoning. When retrieval fails, RAGAS penalizes faithfulness even though the generation component performed correctly given its inputs. RAGVue separates these concerns, enabling teams to fix the actual problem.

Benchmark Results

The paper evaluates RAGVue on 100 queries from StrategyQA, a multi-hop reasoning dataset.

StrategyQA and Multi-hop Reasoning

StrategyQA is a benchmark of open-domain questions that require chaining together multiple facts to arrive at an answer. For example: "Did the founder of Tesla come from a country in the Southern Hemisphere?" requires knowing (1) who founded Tesla, (2) where they were born, and (3) whether that country is in the Southern Hemisphere. This "multi-hop" retrieval stress-tests RAG systems because missing any one piece causes failure, making it ideal for testing claim-level faithfulness.

StrategyQA Benchmark: RAGAS vs RAGVue

RAGVue's strict faithfulness catches more errors (lower score = stricter)

Speed comparison

RAGVue adds minimal overhead versus RAGAS:

  • RAGVue: 18.87 seconds per query
  • RAGAS: 18.26 seconds per query
  • Overhead: 3.4%

This negligible difference means teams can switch frameworks without performance concerns.

Metric comparison

FrameworkMetricMeanStd Dev
RAGASFaithfulness0.5210.403
RAGVueStrict Faithfulness0.4000.492
RAGASAnswer Relevancy0.2400.307
RAGVueAnswer Relevance0.3720.255
RAGVueClarity0.6980.100

RAGVue's strict faithfulness scores lower than RAGAS because it catches errors that semantic matching misses. The higher variance reflects the binary nature of claim verification: a claim is either supported or it is not.

Qualitative analysis

The paper analyzes three example queries in detail. In each case, RAGAS provides mid-range scores without diagnostic clarity. RAGVue's structured signals identify whether failures originated from retrieval gaps, generation errors, or unsupported reasoning. This actionability is the framework's core contribution.

Implementation Blueprint

RAGVue provides three deployment options: Python API, CLI, and Streamlit interface. This section walks through production-ready integration patterns.

The paper's evaluation pipeline reveals the components that actually work together.

ComponentRecommendedNotes
LLM JudgeGPT-4oBest accuracy/cost
Fallback JudgeClaude 3.5For calibration
Data FormatJSONLStreaming-friendly
Async ProcessingasyncioFor batch eval

Core data structures

RAGVue expects a specific input format. Getting this right avoids debugging headaches later.

# The evaluation item structure
# Each item represents one RAG interaction
eval_item = {
    "question": str,      # User query
    "answer": str,        # RAG system response
    "context": list[str]  # Retrieved chunks
}

The context field is critical. RAGVue evaluates each chunk individually for relevance, then checks whether claims in the answer trace back to specific chunks. If your RAG system does not preserve chunk boundaries, you lose the granular diagnostics.

# Good: Preserves chunk boundaries
context = [
    "Q3 revenue reached $4.2M, up 23%...",
    "The board approved the dividend...",
    "Market conditions remained stable..."
]
 
# Bad: Concatenated into single string
context = ["Q3 revenue reached $4.2M... " +
           "board approved... market stable..."]

Installation and setup

pip install ragvue
 
# Set your LLM API key
export OPENAI_API_KEY="sk-..."

RAGVue uses OpenAI by default. For other providers, configure the judge model explicitly.

from ragvue import RAGVueConfig, evaluate
 
config = RAGVueConfig(
    judge_model="gpt-4o",
    temperature=0.0,  # Deterministic scoring
    max_retries=3
)

Quick start for skeptics: You can run RAGVue on your last 10 bad responses in under 5 minutes. No integration required—just paste the JSON. Export a few failed examples from your logs, format them as {question, answer, context} objects, and run evaluate(). If the diagnostics do not immediately tell you something you did not know, uninstall it.

Running your first evaluation

Start with a small batch to verify your data format works correctly.

from ragvue import evaluate, load_metrics
 
# Load test data
items = [
    {
        "question": "What was Q3 revenue?",
        "answer": "Q3 revenue was $4.2M, up 23%.",
        "context": [
            "Q3 2024 revenue: $4.2M (+23% YoY)",
            "Operating margin improved to 12%"
        ]
    }
]
 
# Run all seven metrics
metrics = list(load_metrics().keys())
report = evaluate(items, metrics=metrics)
 
# Inspect the diagnostic output
print(report.summary())

The summary shows aggregate scores, but the real value is in the per-item diagnostics.

Interpreting diagnostic output

RAGVue returns structured diagnostics for each evaluation item. Here is how to extract actionable insights.

for item_report in report.items:
    print(f"Question: {item_report.question}")
 
    # Retrieval diagnostics
    retrieval = item_report.retrieval
    print(f"  Relevant chunks: {retrieval.relevant_count}/{retrieval.total_count}")
    for chunk in retrieval.chunks:
        status = "relevant" if chunk.relevant else "noise"
        print(f"    [{status}] {chunk.text[:50]}...")
 
    # Faithfulness diagnostics (the key insight)
    faith = item_report.faithfulness
    print(f"  Claims: {faith.supported}/{faith.total}")
    for claim in faith.claims:
        if not claim.supported:
            print(f"    UNSUPPORTED: {claim.text}")
            print(f"    Reason: {claim.reason}")
            print(f"    Missing: {claim.missing_evidence}")

This is the debugging power of RAGVue. Instead of "faithfulness: 0.67", you see exactly which claim failed and why. Maybe the answer mentioned a date that does not appear in any retrieved chunk. Maybe an entity name was slightly different. The diagnostic tells you what to fix.

The debugging workflow

When your RAG system produces wrong answers, follow this diagnostic sequence.

Step 1: Check retrieval first

Most RAG failures are retrieval failures. If the right information was not retrieved, the model cannot generate a faithful answer.

def diagnose_failure(report):
    item = report.items[0]
 
    # Is this a retrieval problem?
    if item.retrieval.coverage < 0.5:
        print("RETRIEVAL ISSUE: Missing aspects")
        print(f"  Covered: {item.retrieval.covered_aspects}")
        print(f"  Missing: {item.retrieval.missing_aspects}")
        return "retrieval"
 
    # Or a generation problem?
    if item.faithfulness.score < 0.7:
        print("FAITHFULNESS ISSUE: Unsupported claims")
        for claim in item.faithfulness.unsupported_claims:
            print(f"  Claim: {claim.text}")
            print(f"  Evidence gap: {claim.reason}")
        return "generation"
 
    return "ok"

Step 2: Trace unsupported claims

For each unsupported claim, RAGVue explains what evidence is missing. Use this to improve your retrieval or prompt.

# Example diagnostic output
# Claim: "Revenue grew 23% in Q3"
# Reason: "Retrieved context mentions $4.2M but
#          no percentage growth figure found"
# Fix hint: "Add financial metrics to retrieval"

Step 3: Check evaluator stability

If scores vary across runs, your evaluation itself may be unreliable.

# Run calibration check
calibration = report.calibration
if calibration.variance > 0.15:
    print("WARNING: High evaluator variance")
    print(f"  Score range: {calibration.min}-{calibration.max}")
    print("  Consider: different judge model or lower temperature")

Batch evaluation for CI/CD

For production integration, evaluate against a golden test set on every deployment.

import json
from ragvue import evaluate, load_metrics
 
def run_evaluation_pipeline(test_file: str):
    """CI/CD evaluation gate."""
 
    # Load test cases
    with open(test_file) as f:
        items = [json.loads(line) for line in f]
 
    # Run evaluation
    report = evaluate(items, metrics=list(load_metrics().keys()))
 
    # Define quality gates
    gates = {
        "faithfulness": 0.7,
        "retrieval_relevance": 0.6,
        "answer_completeness": 0.8
    }
 
    # Check each gate
    failures = []
    for metric, threshold in gates.items():
        score = report.aggregate[metric]
        if score < threshold:
            failures.append(f"{metric}: {score:.2f} < {threshold}")
 
    if failures:
        print("DEPLOYMENT BLOCKED")
        for f in failures:
            print(f"  {f}")
        return False
 
    print("All quality gates passed")
    return True

CLI for batch processing

For large-scale evaluation without writing code, use the CLI tools.

# Basic batch evaluation
ragvue-cli --input test_data.jsonl \
           --metrics all \
           --output report.json
 
# Specific metrics only
ragvue-cli --input test_data.jsonl \
           --metrics faithfulness,retrieval_relevance \
           --output report.json
 
# With custom judge model
ragvue-cli --input test_data.jsonl \
           --judge-model claude-3-5-sonnet \
           --output report.json

Streamlit interface for exploration

During development, the visual interface helps you understand metric behavior.

ragvue-ui
# Opens browser at localhost:8501

The Streamlit UI lets you:

  • Upload individual test cases
  • See per-claim faithfulness breakdowns visually
  • Compare different judge models side-by-side
  • Export diagnostic reports

Key parameters to tune

ParameterDefaultWhen to change
Relevance threshold0.7Lower (0.5) for exploratory RAG, higher (0.85) for precision-critical
Calibration models3Increase to 5 for high-stakes applications
Claim granularitysentenceUse "clause" for complex legal/medical text
Entity matchingfuzzyUse "strict" when exact names matter

Integration with RAGAS

RAGVue complements RAGAS rather than replacing it. Use both for different purposes.

from ragas import evaluate as ragas_eval
from ragvue import evaluate as ragvue_eval
 
def hybrid_evaluation(items):
    """RAGAS for quick scores, RAGVue for debugging."""
 
    # Fast aggregate check
    ragas_result = ragas_eval(items)
 
    if ragas_result['faithfulness'] >= 0.8:
        return {"status": "pass", "detail": None}
 
    # Detailed diagnosis for failures
    ragvue_result = ragvue_eval(items)
 
    failures = []
    for item in ragvue_result.items:
        if item.faithfulness.score < 0.7:
            failures.append({
                "question": item.question,
                "unsupported": [c.text for c in item.faithfulness.unsupported_claims],
                "fix_hints": [c.reason for c in item.faithfulness.unsupported_claims]
            })
 
    return {"status": "fail", "detail": failures}

Building a self-healing RAG pipeline

Detecting errors is good. Fixing them automatically is better. The holy grail of production RAG is a self-healing system that catches its own failures and recovers before the user notices. Use RAGVue diagnostics to trigger corrective actions before showing answers to users.

async def self_correcting_rag(query: str, max_retries: int = 2):
    """RAG with automatic retry on low coverage."""
 
    for attempt in range(max_retries + 1):
        # Standard RAG flow
        chunks = retrieve(query)
        answer = generate(query, chunks)
 
        # Evaluate before returning
        report = ragvue_eval([{
            "question": query,
            "answer": answer,
            "context": chunks
        }])
 
        item = report.items[0]
 
        # Check retrieval coverage
        if item.retrieval.coverage >= 0.7:
            return answer  # Good enough
 
        if attempt < max_retries:
            # Expand query and retry
            expanded = expand_query(query, item.retrieval.missing_aspects)
            query = expanded
            continue
 
        # Final attempt: return with warning
        return f"{answer}\n\n[Note: Some aspects of your question may not be fully covered.]"

This pattern catches retrieval failures before they become user-facing hallucinations. Instead of logging "coverage: 0.4" and moving on, the system actively tries to fix the problem.

Observability integration

Python print statements do not scale. Pipe RAGVue metrics to your observability stack to track "truthfulness drift" over time.

import json
from datadog import statsd  # or prometheus_client
 
def log_evaluation_metrics(report, tags: dict):
    """Send RAGVue metrics to observability platform."""
 
    for item in report.items:
        # Core metrics as gauges
        statsd.gauge('ragvue.faithfulness',
                     item.faithfulness.score,
                     tags=tags)
        statsd.gauge('ragvue.retrieval_coverage',
                     item.retrieval.coverage,
                     tags=tags)
 
        # Track unsupported claims as events
        for claim in item.faithfulness.unsupported_claims:
            statsd.event(
                title='Unsupported Claim Detected',
                text=json.dumps({
                    'claim': claim.text,
                    'reason': claim.reason
                }),
                tags=tags,
                alert_type='warning'
            )
 
        # Alert on calibration drift
        if item.calibration.variance > 0.2:
            statsd.event(
                title='High Evaluator Variance',
                text=f'Variance: {item.calibration.variance}',
                tags=tags,
                alert_type='error'
            )

Track unsupported_claim_count over time. A sudden spike means your retrieval index is stale or your prompt drifted. Catch regressions before users report them.

Common pitfalls

Pitfall 1: Concatenated context

If you pass context as a single string, RAGVue cannot identify which chunk supports which claim. Always preserve chunk boundaries.

Pitfall 2: Inconsistent judge models

Switching judge models between runs makes scores incomparable. Lock your judge model version in production configs.

Pitfall 3: Ignoring calibration warnings

High calibration variance means your scores are noise. Do not ship based on unstable evaluations.

Pitfall 4: Over-relying on aggregate scores

A faithfulness score of 0.7 could mean "all claims 70% supported" or "70% of claims fully supported, 30% hallucinated." The per-claim diagnostics tell you which. Always drill into item-level reports for failures.

Limitations

This is an offline tool

RAGVue takes approximately 18 seconds per query. This is not suitable for real-time evaluation in the user request loop. Do not block user responses waiting for evaluation to complete.

Recommended deployment patterns:

  • Sampling: Evaluate 10% of production traffic asynchronously
  • Nightly batch: Run comprehensive evaluation on daily logs
  • Pre-deployment: Gate releases on test set evaluation
  • Post-incident: Deep-dive specific failure cases

For real-time quality signals, use lightweight heuristics (response length, retrieval score) and reserve RAGVue for thorough offline analysis.

LLM judge dependency

All metrics rely on LLM-as-judge. Model selection significantly affects scores. The calibration metric helps detect instability but does not eliminate it. Teams should benchmark judge models on domain-specific examples before trusting scores.

Seven metrics only

The current release covers seven metrics. Complex failure modes (multi-hop reasoning errors, implicit contradictions) may require additional metrics not yet implemented.

Agentic mode limitations

The automatic metric selection mode works for standard scenarios but may miss edge cases. Manual metric selection remains more reliable for production evaluation.

Domain adaptation

Default prompts assume general-domain text. Legal, medical, or technical domains may need prompt customization for accurate claim decomposition and entity matching.

Business Implications

The cost of not evaluating

The 18 seconds per query cost sounds expensive until you consider the alternative. A hallucinated answer in a legal research tool could expose your company to malpractice claims. A fabricated statistic in a financial analyst could trigger compliance violations. A wrong medication interaction in a medical chatbot could harm patients.

RAGVue acts as an automated compliance auditor. It catches "silent failures" where an agent sounds confident but is factually wrong. These are the failures that pass human spot-checks because the answer reads well. Only systematic claim-level verification catches the invented date, the misattributed quote, the fabricated percentage.

The ROI of truth

Consider a customer support RAG system handling 10,000 queries per day. At 18 seconds per evaluation, you cannot evaluate everything in real-time. But you can:

  • Sample 10%: 1,000 evaluations/day = 5 hours of compute
  • Catch 2% hallucination rate: 20 hallucinated answers identified daily
  • Cost of one bad answer: Customer escalation, potential churn, brand damage

The math is stark:

ScenarioDaily Cost
Without RAGVue20 hallucinated answers slip through → 20 × $500 = $10,000 in escalations, churn, and brand damage
With RAGVue (10% sampling)5 hours compute (~$2) + catches all 20 hallucinations → $2

If preventing one escalation per day saves $500 in support costs and reputation, the evaluation compute pays for itself immediately. The diagnostic output also feeds back into retrieval and prompt improvements, compounding returns over time.

RAGVue costs pennies to run. The hallucination it catches costs your reputation. This is the cheapest insurance policy you will ever buy for your AI.

Regulatory and compliance value

For regulated industries (healthcare, finance, legal), RAGVue provides audit trails. Every claim in every answer can be traced to supporting evidence or flagged as unsupported. This is the documentation regulators want to see when they ask: "How do you ensure your AI system is not making things up?"

The Bottom Line

For ML engineers: RAGVue is not just another metric—it is the missing "Diagnostic Layer" in the modern AI stack. If you have spent hours staring at faithfulness scores trying to figure out what went wrong, you have been flying blind. This framework tells you exactly which claim failed and why. The Python API integrates cleanly into existing pipelines. Without it, you are debugging with a blindfold.

For engineering managers: Evaluation is often the bottleneck in RAG quality improvement. Teams cannot fix what they cannot diagnose. RAGVue reduces debugging cycles by converting opaque scores into actionable signals. The 3.4% overhead is negligible compared to the time saved. The ROI is immediate and measurable.

For researchers: The claim-level faithfulness approach and explicit calibration metric represent meaningful advances over aggregate scoring. The correlation analysis with RAGAS reveals systematic blind spots in current evaluation practices worth investigating further.


Original paper: arXivPDFHTML

Code: GitHub

Authors: Keerthana Murugaraj, Salima Lamsiyah, Martin Theobald (University of Luxembourg)

Authors

Keerthana MurugarajUniversity of Luxembourg,Salima LamsiyahUniversity of Luxembourg,Martin TheobaldUniversity of Luxembourg

Cite this paper

Keerthana Murugaraj, Salima Lamsiyah, Martin Theobald (2025). RAGVue: Diagnostic Evaluation That Tells You Why Your RAG System Fails. arXiv 2025.

Related Research