CIEA: How to Search for Visual Details Your Captions Miss

TL;DR

The Problem. Your image search finds pictures that match captions. But captions are lazy summaries. A photo of "a park bench" might show flowers, birds, and a clock tower in the background, and your search engine has no idea any of that exists.
The Solution. CIEA calculates which image patches differ most from the accompanying text, then amplifies those patches in the final embedding. Instead of learning what images and text share, it learns what images add.
The Results. 66.16 MRR@10 on WebQA-Multi with 1.1 million documents, beating the previous best by 0.73 points. Training overhead is only 7% (6.5 minutes vs 6.1 minutes). Code available now.

Research Overview

If you have ever built an image search feature, you know the frustration: a user searches for "cafe with a view of the Eiffel Tower," and your system finds an image with the caption "Outdoor seating at a Parisian cafe." Good match for "cafe" and "Parisian." But does the image actually show the Eiffel Tower? The caption does not say.

Standard multimodal retrieval cannot help you here. It learned to match queries to captions. If the caption does not mention the tower, the embedding does not encode it.

What is Multimodal Retrieval?

Multimodal retrieval finds documents containing both text and images in response to text queries. Unlike image-only search (which uses visual similarity) or text-only search (which ignores images), multimodal retrieval creates unified embeddings that represent both modalities. E-commerce product search, visual Q&A, and image archives all rely on this capability.

CIEA (Complementary Information Extraction and Alignment) takes a different approach. Instead of learning what images and captions have in common, it learns what images contain that captions leave out.

The Core Insight

Captions are summaries, not inventories. A photo of a park might show a bench, trees, a pond, ducks, joggers, a clock tower, fallen leaves, and a dog. The caption says "park bench." Everything else is complementary information that standard methods throw away.

CIEA identifies image patches that differ from the text, upweights them in the attention mechanism, and trains them to match the parts of queries that captions miss. When someone searches for "park with a clock tower," CIEA surfaces the right image even though "clock tower" never appeared in the caption.

Why this matters now

Multimodal RAG is becoming standard in production systems. If your retrieval only matches queries to captions, your system inherits every caption writer's oversight. CIEA fixes this at the embedding level without requiring re-annotation.

The training cost is negligible. CIEA adds 6.5% training time versus projection-only baselines (6.5 minutes vs 6.1 minutes on WebQA-Multi). On a single NVIDIA A100 40GB GPU with batch size 64, this translates to roughly 0.4 extra minutes per epoch, or about 1.6 GPU-hours extra for a full 40-epoch run. The architecture adds one attention layer and one contrastive loss term. You can swap it into existing multimodal pipelines without infrastructure changes.

Multimodal retrieval pipelines typically work like this:

Encode images with a vision model (CLIP, ViT)
Project visual features into the text model's embedding space
Concatenate or fuse with text embeddings
Train with contrastive loss to match queries to relevant documents

Standard vs CIEA Pipeline

CIEA adds a complementary extractor branch between projection and fusion

The problem is step 3. When you project and fuse, you are learning alignment: how to make the image embedding similar to the text embedding. This works when queries ask about captioned content. It fails when queries ask about visual details the caption omits.

Projection-based Fusion

Most multimodal retrievers use a linear projection to map CLIP's visual features into the language model's embedding space. This lets the transformer process image patches as if they were text tokens. MARVEL, UniVL-DR, and similar systems all use this approach. The limitation: projection optimizes for similarity to text, not for capturing unique visual content.

Picture the projection step as a translator who renders a poem word-for-word into another language, preserving the rhyme but losing the subtle imagery. The complementary extractor is a bilingual illustrator who, after the translation, sketches the missing scenes that the poem never described, adding the colors and shapes that the translator ignored.

Before and after: A concrete example

Consider searching for a specific product detail:

	Standard Retrieval	CIEA Retrieval
Query	"Modern chair with wooden legs"	"Modern chair with wooden legs"
Caption	"Sleek modern chair"	"Sleek modern chair"
Result	Returns chair (misses leg material)	Returns chair with wooden legs visible
Why	"Wooden legs" not in caption	Visual patches of legs upweighted

The paper shows several such examples. CIEA extracts terms like "green," "leaf," "clock," and "flowers" from images when those terms do not appear in captions. MARVEL (the previous state-of-the-art) does not.

How CIEA Extracts Complementary Information

CIEA's core innovation is a complementary information extractor that identifies and amplifies image patches differing from text. The process has four steps.

What is a Patch?

Vision transformers divide images into small square regions called patches, typically 16x16 or 32x32 pixels. Each patch is processed as a separate "token," similar to how text models process words. A 224x224 image with 16x16 patches produces 196 patch tokens. CIEA analyzes each patch independently to find which ones contain information missing from the text.

Step 1: Calculate patch-text dissimilarity

Cosine Similarity

A measure of similarity between two vectors based on the angle between them, ranging from -1 (opposite) to 1 (identical). In CIEA it quantifies how closely an image patch aligns with a text token. A score near 1 means the patch and token represent similar concepts; near 0 means they are unrelated.

For each image patch, CIEA computes cosine similarity to every text token. It then takes the maximum similarity (the closest text token) and inverts it:

dissimilarity[j] = -max(
  cosine(patch_j, token_c)
) for all tokens c

High dissimilarity means no text token closely matches this patch. These are the "complementary" patches containing visual information the text does not describe.

Think of each image patch as a guest at a party and every word in the caption as a conversation partner. The guest wanders from table to table, listening for the most familiar topic. If the guest finds a table where someone speaks their language, they feel at home; if not, they stand awkwardly in the corner, unnoticed. The model acts as a host who shines a spotlight on the unnoticed guests, inviting the crowd to pay attention to what the caption overlooked.

Concrete example: Consider a 224x224 pixel photo of a park bench. After splitting into 16x16 pixel patches (196 total), the patch covering the clock tower in the background has a cosine similarity of 0.12 with any token in the caption "park bench," while the patch covering the bench itself scores 0.78. CIEA computes a dissimilarity weight of 0.94 for the clock-tower patch and only 0.11 for the bench patch. The clock tower now contributes far more to the final embedding, allowing the model to retrieve this image for queries like "park with a clock tower."

Step 2: Convert to attention weights

The dissimilarities are normalized to a 0-1 range:

weight[j] = (1 + dissimilarity[j]) / 2

Patches with high dissimilarity get higher weights. Patches that match the text get lower weights.

Step 3: Reweight via attention

A self-attention layer processes the image patches, but attention scores are multiplied by the dissimilarity weights. This amplifies complementary patches and dampens redundant ones.

Q, K, V = project(patches)
scores = softmax(
  Q @ K.T * weights / sqrt(d)
)
output = scores @ V

Step 4: Dual contrastive training

Contrastive Loss

A training objective that pushes embeddings of matching query-document pairs together while pulling apart non-matching pairs. It teaches retrieval models to rank relevant items higher. CIEA uses two contrastive losses: one for standard matching, one for complementary information.

Standard contrastive loss aligns queries with relevant documents. CIEA adds a complementary loss that masks text in the query appearing in the document caption, leaving only the "complementary" query portions.

Here is the masking logic in pseudo-code:

def build_complementary_query(query, caption):
    query_tokens = tokenize(query)
    caption_tokens = tokenize(caption)
 
    # Remove shared tokens
    complementary = [
        t for t in query_tokens
        if t not in caption_tokens
    ]
 
    # Replace shared tokens with <mask>
    return mask_shared(query, caption)
 
# Example:
# query: "woman with flowers"
# caption: "woman in garden"
# result: "<mask> with flowers"

Think of the query and caption as two overlapping maps of a treasure island. The overlapping area marks the well-known landmarks (the shared words). By covering those landmarks with a tarp, the explorer is forced to rely on hidden clues, like a distant lighthouse or a strange rock formation, that only appear on the image. The loss function rewards the explorer for finding those hidden clues, not for retracing the obvious paths.

The intuition: if the query is "woman with flowers" and the caption is "woman in garden," mask out "woman" (shared) and train the image embedding to match "flowers" (complementary). This forces the image embedding to carry information that text does not.

Why Masked Queries?

If we trained images to match full queries, they would just learn to duplicate text. By removing shared terms, we force images to encode the visual details that make them uniquely valuable. This is a data pre-processing step, not a black-box loss function.

Architecture Deep Dive

The full CIEA pipeline has three components that work together to extract and preserve complementary visual information.

1. Query Encoder

Text queries pass through a language model (T5-ANCE by default) to produce query embeddings. This is standard dense retrieval.

2. Document Encoder

Documents contain text and images processed separately before fusion:

Text path: Caption goes through the LM embedding layer (but not transformer blocks yet)
Image path: Image goes through frozen CLIP visual encoder, then a linear projector to match LM dimensions, then the complementary information extractor

The weighted image patches and text embeddings are concatenated with special <start> and <end> tokens separating them, then passed through the LM transformer blocks for fusion.

3. Complementary Information Extractor

This is the new component sitting between the CLIP projector and the transformer fusion:

Receive projected image patches
Receive text embeddings
Compute patch-token cosine similarities
Extract max similarity per patch, invert to get dissimilarity
Normalize to attention weights
Apply attention layer with weighted scores
Output reweighted patches

The attention layer has learnable Q, K, V projection matrices (standard transformer attention parameters).

Training setup

These are the exact choices the authors used to get the benchmark results.

Parameter	Value
Backbone	T5-ANCE
Visual encoder	CLIP (frozen)
Optimizer	AdamW
Learning rate	5e-6
Batch size	64
Max epochs	40
Temperature	0.01

Temperature (Scaling Factor)

A small constant used to scale similarity scores before applying softmax in contrastive loss. Lower temperatures (like 0.01) make the model more confident, creating sharper distinctions between correct and incorrect pairs. Higher temperatures produce smoother probability distributions.

The lambda hyperparameter balances standard and complementary losses. The paper uses 0.0011 for WebQA-Multi, 0.019 for WebQA-Image, and 0.001 for EDIS. Smaller values on larger datasets prevent the complementary loss from dominating.

Benchmark Results

MRR@10 (Mean Reciprocal Rank at 10)

An information retrieval metric that averages the reciprocal rank of the first relevant result, considering only the top 10 retrieved items. If the correct answer is ranked 1st, the score is 1.0; if 2nd, it is 0.5; if 10th, it is 0.1. Higher values mean relevant documents appear earlier in results.

CIEA was evaluated on three multimodal retrieval benchmarks covering different scales and document types.

WebQA-Multi Benchmark: CIEA vs Baselines

CIEA outperforms MARVEL by extracting visual details captions miss

WebQA-Multi

The primary benchmark contains 1,177,447 documents (787,697 text-only, 389,750 with images) with open-domain QA queries.

Model	MRR@10	NDCG@10
BM25	22.11	22.92
CLIP-DPR	48.83	46.32
UniVL-DR	62.40	59.32
T5-ANCE	64.13	62.03
MARVEL	65.43	63.07
CIEA	66.16	63.89

CIEA beats MARVEL by 0.73 MRR@10 points. More importantly, it beats T5-ANCE (text-only) by 2.03 points, showing the value of the complementary visual information.

WebQA-Image

Image-only subset of WebQA with 389,750 documents, all containing images.

Model	MRR@10	NDCG@10
CLIP-DPR	59.78	61.05
UniVL-DR	65.95	67.33
MARVEL	66.43	67.60
CIEA	67.40	68.77

EDIS

Different distribution with 1M image-text pairs from Google, where text queries retrieve images.

Model	MRR@10	NDCG@10
CLIP-DPR	62.52	39.19
MARVEL	67.00	42.19
CIEA	68.11	42.57

Ablation results

Removing components hurts performance, confirming each part contributes.

Configuration	MRR@10	NDCG@10
Full CIEA	66.03	63.70
Without complementary loss	65.66	63.44
Without attention reweighting	65.90	63.39
Neither (baseline)	65.30	63.19

Ablation Study: Component Contributions

Both complementary loss and attention reweighting contribute to CIEA's gains

Both components contribute. The attention reweighting has slightly more impact on NDCG; the complementary loss has more impact on recall.

Backbone comparison

CIEA works across language model architectures with consistent improvements.

Backbone	Text Only	CIEA	Gain
T5-ANCE	64.13	66.16	+2.03
GPT-2-Large	63.58	65.38	+1.80
BERT	61.39	63.63	+2.24
BART	60.14	63.73	+3.59
GPT-2	54.68	59.25	+4.57

Consistent 1.8 to 4.6 point improvements regardless of backbone architecture.

Implementation Blueprint

Code is available at github.com/zengdlong/CIEA. These instructions will help you adapt it to your own data.

Recommended tech stack

The paper's codebase reveals the tools that actually work.

Component	Recommended	Notes
LM Backbone	T5-ANCE	Best results
Visual Encoder	CLIP	Frozen weights
Framework	PyTorch	Required
GPU	8GB+ VRAM	Batch 64 fits

Vector database compatibility: CIEA produces standard dense vectors. The output works with Pinecone, Milvus, Weaviate, Qdrant, or any vector store you already use. It changes how vectors are calculated, not how they are stored or searched.

Data format

CIEA expects documents with text and image paths.

document = {
    "text": "Woman in garden",
    "image_path": "/path/to/img.jpg",
    "doc_id": "doc_001"
}
 
query = {
    "text": "Woman with flowers",
    "relevant_docs": ["doc_001"]
}

Key parameters to tune

These hyperparameters produced the benchmark numbers. Start here and adjust for your domain.

Parameter	Small Dataset	Large Dataset
Lambda	0.01-0.02	0.001-0.005
Learning rate	5e-6	5e-6
Temperature	0.01	0.01
Batch size	64	64

Lambda tuning strategy

Lambda is the most dangerous parameter. It controls how much the complementary loss influences training. Here is a debugging workflow:

Start with Lambda = 0.001
If the model ignores text constraints (returns visually similar but semantically wrong results), lower Lambda to 0.0005
If the model ignores visual details (behaves like text-only retrieval), raise Lambda to 0.005 or 0.01
For small datasets (under 100K documents), start higher at 0.01

What happens when lambda is wrong:

Lambda	MRR@10	Effect
0 (no complementary loss)	65.43	Baseline, ignores visual details
0.0011 (paper setting)	66.16	Optimal balance
0.01 (too high)	65.80	Over-focuses on visual patches, misses text signal

The symptoms are distinct: too high makes caption-matching queries worse; too low makes visual-detail queries no better than baseline.

Adapting to your data

Follow these steps to integrate CIEA into your pipeline.

Prepare image-text pairs: Each document needs text and at least one image
Format queries: Text queries with relevance judgments for training
Start with low lambda: Use 0.001, increase if complementary features are underweighted
Evaluate incrementally: Check whether visual-only queries improve

Integration with existing systems

If you have a MARVEL or similar projection-based system, the integration is straightforward.

Keep your CLIP encoder and projector
Add the complementary information extractor after projection
Add the complementary loss term to training
Tune lambda for your domain

The extractor is approximately 100 lines of code. The loss term is one additional forward pass through images.

Pitfalls and gotchas

These mistakes will not show up in unit tests but will hurt production quality.

Lambda too high: Model ignores caption-relevant content. Queries matching captions perform worse.

Lambda too low: Complementary signal gets lost. Behaves like standard projection baseline.

Small images: Fewer patches means less granular dissimilarity. Consider minimum 224x224 input.

Short captions: Very short captions make everything "dissimilar." May need caption augmentation.

Limitations

The paper is honest about constraints that may affect your implementation.

Caption dependency for training

CIEA still needs text-image pairs for training. The complementary loss requires captions to identify what is "missing." Purely visual datasets with no captions cannot use this method directly.

Single-image documents

The current implementation handles one image per document. For products with multiple images (galleries), consider two workarounds:

Average the image vectors: Compute CIEA embeddings for each image, then mean-pool for a single document vector
Treat each image as a separate chunk: Index each gallery image independently, then deduplicate at retrieval time

Both approaches trade off precision for coverage. Test which works better for your query patterns.

Static dissimilarity

Patch dissimilarity is computed once during encoding. Query-dependent dissimilarity (what is complementary for this specific query) is not modeled. A query about "flowers" does not dynamically upweight flower patches at retrieval time.

Limited to text queries

CIEA retrieves documents from text queries. Image-to-image or image-to-text retrieval is not addressed.

Evaluation scale

WebQA-Multi has 1.1M documents. Production systems with 100M+ documents may face scaling challenges not observed in these benchmarks.

Business Implications

Understanding the ROI helps justify implementation effort.

The metric that matters: Zero-result searches

CIEA is most valuable for reducing "zero-result" or "low-relevance" rates on long-tail queries. When users search for specific visual attributes like "distressed leather texture," "nighttime cityscape," or "Art Deco details," standard retrieval often returns nothing useful.

These failed searches are quantifiable revenue loss. If 15% of your product searches fail because users ask for visual details not in your titles, CIEA directly addresses that gap.

The real cost of caption laziness

Every product listing, image archive, and content library has lazy captions. Annotators describe the obvious. The subtle details, the background elements, the contextual visual information, gets lost.

Before CIEA, fixing this meant re-annotation. After CIEA, the model extracts what captions miss automatically. The ROI calculation changes from "cost of annotating 1M images" to "cost of training one model."

Where CIEA fits

E-commerce: Product images contain details (color, material, context) that titles abbreviate. A search for "blue leather couch in modern living room" benefits from complementary extraction.

Stock photography: Image libraries have generic captions. CIEA surfaces images for specific queries even when tags do not match.

Visual RAG: Multimodal Q&A systems that retrieve image-text documents can now answer questions about visual details absent from text.

Content discovery: Social media, news archives, and digital asset management all have caption blind spots.

What CIEA does not fix

CIEA improves recall for visual details but has limits.

Wrong images: If the image does not contain what the user wants, no amount of complementary extraction helps.

Caption errors: Factually incorrect captions still mislead.

Missing images: Text-only documents still need text-only retrieval.

The Bottom Line

For ML engineers: CIEA is a drop-in upgrade for projection-based multimodal retrieval. Add one attention layer, one loss term, tune lambda. Training overhead is 7%. If your users search for things captions do not mention, this is worth trying. Start with the reference implementation and your existing data. Output vectors work with your existing Pinecone/Milvus/Weaviate setup.

For product managers: Your image search is only as good as your captions. CIEA lets you capture value from visual details without re-annotation. Quantify how many user queries reference things your captions miss, and measure your zero-result rate on long-tail searches. That is your opportunity size.

For researchers: The complementary information framing is the contribution here. Instead of asking "how similar are images and text?", CIEA asks "what does the image add?" This reframing generalizes beyond retrieval to any multimodal system that fuses modalities by alignment.

Original paper: arXiv ・ PDF ・ HTML

Code: GitHub

Authors: Delong Zeng, Yuexiang Xie, Yaliang Li, Ying Shen (Sun Yat-sen University, Alibaba Group)

Authors

Delong ZengSun Yat-sen University,Yuexiang XieAlibaba Group,Yaliang LiAlibaba Group,Ying ShenSun Yat-sen University

Code & Data

Cite this paper

Delong Zeng, Yuexiang Xie, Yaliang Li, Ying Shen (2026). CIEA: How to Search for Visual Details Your Captions Miss. arXiv 2026.

Key Findings