-
The Problem. Your image search finds pictures that match captions. But captions are lazy summaries. A photo of "a park bench" might show flowers, birds, and a clock tower in the background, and your search engine has no idea any of that exists.
-
The Solution. CIEA calculates which image patches differ most from the accompanying text, then amplifies those patches in the final embedding. Instead of learning what images and text share, it learns what images add.
-
The Results. 66.16 MRR@10 on WebQA-Multi with 1.1 million documents, beating the previous best by 0.73 points. Training overhead is only 7% (6.5 minutes vs 6.1 minutes). Code available now.
Research Overview
If you have ever built an image search feature, you know the frustration: a user searches for "cafe with a view of the Eiffel Tower," and your system finds an image with the caption "Outdoor seating at a Parisian cafe." Good match for "cafe" and "Parisian." But does the image actually show the Eiffel Tower? The caption does not say.
Standard multimodal retrieval cannot help you here. It learned to match queries to captions. If the caption does not mention the tower, the embedding does not encode it.
Multimodal retrieval finds documents containing both text and images in response to text queries. Unlike image-only search (which uses visual similarity) or text-only search (which ignores images), multimodal retrieval creates unified embeddings that represent both modalities. E-commerce product search, visual Q&A, and image archives all rely on this capability.
CIEA (Complementary Information Extraction and Alignment) takes a different approach. Instead of learning what images and captions have in common, it learns what images contain that captions leave out.
Captions are summaries, not inventories. A photo of a park might show a bench, trees, a pond, ducks, joggers, a clock tower, fallen leaves, and a dog. The caption says "park bench." Everything else is complementary information that standard methods throw away.
CIEA identifies image patches that differ from the text, upweights them in the attention mechanism, and trains them to match the parts of queries that captions miss. When someone searches for "park with a clock tower," CIEA surfaces the right image even though "clock tower" never appeared in the caption.
Why this matters now
Multimodal RAG is becoming standard in production systems. If your retrieval only matches queries to captions, your system inherits every caption writer's oversight. CIEA fixes this at the embedding level without requiring re-annotation.
The training cost is negligible. CIEA adds 6.5% training time versus projection-only baselines (6.5 minutes vs 6.1 minutes on WebQA-Multi). On a single NVIDIA A100 40GB GPU with batch size 64, this translates to roughly 0.4 extra minutes per epoch, or about 1.6 GPU-hours extra for a full 40-epoch run. The architecture adds one attention layer and one contrastive loss term. You can swap it into existing multimodal pipelines without infrastructure changes.
The Caption Blind Spot
Multimodal retrieval pipelines typically work like this:
- Encode images with a vision model (CLIP, ViT)
- Project visual features into the text model's embedding space
- Concatenate or fuse with text embeddings
- Train with contrastive loss to match queries to relevant documents
Standard vs CIEA Pipeline
CIEA adds a complementary extractor branch between projection and fusion
The problem is step 3. When you project and fuse, you are learning alignment: how to make the image embedding similar to the text embedding. This works when queries ask about captioned content. It fails when queries ask about visual details the caption omits.
Most multimodal retrievers use a linear projection to map CLIP's visual features into the language model's embedding space. This lets the transformer process image patches as if they were text tokens. MARVEL, UniVL-DR, and similar systems all use this approach. The limitation: projection optimizes for similarity to text, not for capturing unique visual content.
Picture the projection step as a translator who renders a poem word-for-word into another language, preserving the rhyme but losing the subtle imagery. The complementary extractor is a bilingual illustrator who, after the translation, sketches the missing scenes that the poem never described, adding the colors and shapes that the translator ignored.
Before and after: A concrete example
Consider searching for a specific product detail:
| Standard Retrieval | CIEA Retrieval | |
|---|---|---|
| Query | "Modern chair with wooden legs" | "Modern chair with wooden legs" |
| Caption | "Sleek modern chair" | "Sleek modern chair" |
| Result | Returns chair (misses leg material) | Returns chair with wooden legs visible |
| Why | "Wooden legs" not in caption | Visual patches of legs upweighted |
The paper shows several such examples. CIEA extracts terms like "green," "leaf," "clock," and "flowers" from images when those terms do not appear in captions. MARVEL (the previous state-of-the-art) does not.
How CIEA Extracts Complementary Information
CIEA's core innovation is a complementary information extractor that identifies and amplifies image patches differing from text. The process has four steps.
Vision transformers divide images into small square regions called patches, typically 16x16 or 32x32 pixels. Each patch is processed as a separate "token," similar to how text models process words. A 224x224 image with 16x16 patches produces 196 patch tokens. CIEA analyzes each patch independently to find which ones contain information missing from the text.
Step 1: Calculate patch-text dissimilarity
A measure of similarity between two vectors based on the angle between them, ranging from -1 (opposite) to 1 (identical). In CIEA it quantifies how closely an image patch aligns with a text token. A score near 1 means the patch and token represent similar concepts; near 0 means they are unrelated.
For each image patch, CIEA computes cosine similarity to every text token. It then takes the maximum similarity (the closest text token) and inverts it:
dissimilarity[j] = -max(
cosine(patch_j, token_c)
) for all tokens c
High dissimilarity means no text token closely matches this patch. These are the "complementary" patches containing visual information the text does not describe.
Think of each image patch as a guest at a party and every word in the caption as a conversation partner. The guest wanders from table to table, listening for the most familiar topic. If the guest finds a table where someone speaks their language, they feel at home; if not, they stand awkwardly in the corner, unnoticed. The model acts as a host who shines a spotlight on the unnoticed guests, inviting the crowd to pay attention to what the caption overlooked.
Concrete example: Consider a 224x224 pixel photo of a park bench. After splitting into 16x16 pixel patches (196 total), the patch covering the clock tower in the background has a cosine similarity of 0.12 with any token in the caption "park bench," while the patch covering the bench itself scores 0.78. CIEA computes a dissimilarity weight of 0.94 for the clock-tower patch and only 0.11 for the bench patch. The clock tower now contributes far more to the final embedding, allowing the model to retrieve this image for queries like "park with a clock tower."
Step 2: Convert to attention weights
The dissimilarities are normalized to a 0-1 range:
weight[j] = (1 + dissimilarity[j]) / 2
Patches with high dissimilarity get higher weights. Patches that match the text get lower weights.
Step 3: Reweight via attention
A self-attention layer processes the image patches, but attention scores are multiplied by the dissimilarity weights. This amplifies complementary patches and dampens redundant ones.
Q, K, V = project(patches)
scores = softmax(
Q @ K.T * weights / sqrt(d)
)
output = scores @ VStep 4: Dual contrastive training
A training objective that pushes embeddings of matching query-document pairs together while pulling apart non-matching pairs. It teaches retrieval models to rank relevant items higher. CIEA uses two contrastive losses: one for standard matching, one for complementary information.
Standard contrastive loss aligns queries with relevant documents. CIEA adds a complementary loss that masks text in the query appearing in the document caption, leaving only the "complementary" query portions.
Here is the masking logic in pseudo-code:
def build_complementary_query(query, caption):
query_tokens = tokenize(query)
caption_tokens = tokenize(caption)
# Remove shared tokens
complementary = [
t for t in query_tokens
if t not in caption_tokens
]
# Replace shared tokens with <mask>
return mask_shared(query, caption)
# Example:
# query: "woman with flowers"
# caption: "woman in garden"
# result: "<mask> with flowers"Think of the query and caption as two overlapping maps of a treasure island. The overlapping area marks the well-known landmarks (the shared words). By covering those landmarks with a tarp, the explorer is forced to rely on hidden clues, like a distant lighthouse or a strange rock formation, that only appear on the image. The loss function rewards the explorer for finding those hidden clues, not for retracing the obvious paths.
The intuition: if the query is "woman with flowers" and the caption is "woman in garden," mask out "woman" (shared) and train the image embedding to match "flowers" (complementary). This forces the image embedding to carry information that text does not.
If we trained images to match full queries, they would just learn to duplicate text. By removing shared terms, we force images to encode the visual details that make them uniquely valuable. This is a data pre-processing step, not a black-box loss function.
Architecture Deep Dive
The full CIEA pipeline has three components that work together to extract and preserve complementary visual information.
1. Query Encoder
Text queries pass through a language model (T5-ANCE by default) to produce query embeddings. This is standard dense retrieval.
2. Document Encoder
Documents contain text and images processed separately before fusion:
- Text path: Caption goes through the LM embedding layer (but not transformer blocks yet)
- Image path: Image goes through frozen CLIP visual encoder, then a linear projector to match LM dimensions, then the complementary information extractor
The weighted image patches and text embeddings are concatenated with special <start> and <end> tokens separating them, then passed through the LM transformer blocks for fusion.
3. Complementary Information Extractor
This is the new component sitting between the CLIP projector and the transformer fusion:
- Receive projected image patches
- Receive text embeddings
- Compute patch-token cosine similarities
- Extract max similarity per patch, invert to get dissimilarity
- Normalize to attention weights
- Apply attention layer with weighted scores
- Output reweighted patches
The attention layer has learnable Q, K, V projection matrices (standard transformer attention parameters).
Training setup
These are the exact choices the authors used to get the benchmark results.
| Parameter | Value |
|---|---|
| Backbone | T5-ANCE |
| Visual encoder | CLIP (frozen) |
| Optimizer | AdamW |
| Learning rate | 5e-6 |
| Batch size | 64 |
| Max epochs | 40 |
| Temperature | 0.01 |
A small constant used to scale similarity scores before applying softmax in contrastive loss. Lower temperatures (like 0.01) make the model more confident, creating sharper distinctions between correct and incorrect pairs. Higher temperatures produce smoother probability distributions.
The lambda hyperparameter balances standard and complementary losses. The paper uses 0.0011 for WebQA-Multi, 0.019 for WebQA-Image, and 0.001 for EDIS. Smaller values on larger datasets prevent the complementary loss from dominating.
Benchmark Results
An information retrieval metric that averages the reciprocal rank of the first relevant result, considering only the top 10 retrieved items. If the correct answer is ranked 1st, the score is 1.0; if 2nd, it is 0.5; if 10th, it is 0.1. Higher values mean relevant documents appear earlier in results.
CIEA was evaluated on three multimodal retrieval benchmarks covering different scales and document types.
WebQA-Multi Benchmark: CIEA vs Baselines
CIEA outperforms MARVEL by extracting visual details captions miss
WebQA-Multi
The primary benchmark contains 1,177,447 documents (787,697 text-only, 389,750 with images) with open-domain QA queries.
| Model | MRR@10 | NDCG@10 |
|---|---|---|
| BM25 | 22.11 | 22.92 |
| CLIP-DPR | 48.83 | 46.32 |
| UniVL-DR | 62.40 | 59.32 |
| T5-ANCE | 64.13 | 62.03 |
| MARVEL | 65.43 | 63.07 |
| CIEA | 66.16 | 63.89 |
CIEA beats MARVEL by 0.73 MRR@10 points. More importantly, it beats T5-ANCE (text-only) by 2.03 points, showing the value of the complementary visual information.
WebQA-Image
Image-only subset of WebQA with 389,750 documents, all containing images.
| Model | MRR@10 | NDCG@10 |
|---|---|---|
| CLIP-DPR | 59.78 | 61.05 |
| UniVL-DR | 65.95 | 67.33 |
| MARVEL | 66.43 | 67.60 |
| CIEA | 67.40 | 68.77 |
EDIS
Different distribution with 1M image-text pairs from Google, where text queries retrieve images.
| Model | MRR@10 | NDCG@10 |
|---|---|---|
| CLIP-DPR | 62.52 | 39.19 |
| MARVEL | 67.00 | 42.19 |
| CIEA | 68.11 | 42.57 |
Ablation results
Removing components hurts performance, confirming each part contributes.
| Configuration | MRR@10 | NDCG@10 |
|---|---|---|
| Full CIEA | 66.03 | 63.70 |
| Without complementary loss | 65.66 | 63.44 |
| Without attention reweighting | 65.90 | 63.39 |
| Neither (baseline) | 65.30 | 63.19 |
Ablation Study: Component Contributions
Both complementary loss and attention reweighting contribute to CIEA's gains
Both components contribute. The attention reweighting has slightly more impact on NDCG; the complementary loss has more impact on recall.
Backbone comparison
CIEA works across language model architectures with consistent improvements.
| Backbone | Text Only | CIEA | Gain |
|---|---|---|---|
| T5-ANCE | 64.13 | 66.16 | +2.03 |
| GPT-2-Large | 63.58 | 65.38 | +1.80 |
| BERT | 61.39 | 63.63 | +2.24 |
| BART | 60.14 | 63.73 | +3.59 |
| GPT-2 | 54.68 | 59.25 | +4.57 |
Consistent 1.8 to 4.6 point improvements regardless of backbone architecture.
Implementation Blueprint
Code is available at github.com/zengdlong/CIEA. These instructions will help you adapt it to your own data.
Recommended tech stack
The paper's codebase reveals the tools that actually work.
| Component | Recommended | Notes |
|---|---|---|
| LM Backbone | T5-ANCE | Best results |
| Visual Encoder | CLIP | Frozen weights |
| Framework | PyTorch | Required |
| GPU | 8GB+ VRAM | Batch 64 fits |
Vector database compatibility: CIEA produces standard dense vectors. The output works with Pinecone, Milvus, Weaviate, Qdrant, or any vector store you already use. It changes how vectors are calculated, not how they are stored or searched.
Data format
CIEA expects documents with text and image paths.
document = {
"text": "Woman in garden",
"image_path": "/path/to/img.jpg",
"doc_id": "doc_001"
}
query = {
"text": "Woman with flowers",
"relevant_docs": ["doc_001"]
}Key parameters to tune
These hyperparameters produced the benchmark numbers. Start here and adjust for your domain.
| Parameter | Small Dataset | Large Dataset |
|---|---|---|
| Lambda | 0.01-0.02 | 0.001-0.005 |
| Learning rate | 5e-6 | 5e-6 |
| Temperature | 0.01 | 0.01 |
| Batch size | 64 | 64 |
Lambda tuning strategy
Lambda is the most dangerous parameter. It controls how much the complementary loss influences training. Here is a debugging workflow:
- Start with Lambda = 0.001
- If the model ignores text constraints (returns visually similar but semantically wrong results), lower Lambda to 0.0005
- If the model ignores visual details (behaves like text-only retrieval), raise Lambda to 0.005 or 0.01
- For small datasets (under 100K documents), start higher at 0.01
What happens when lambda is wrong:
| Lambda | MRR@10 | Effect |
|---|---|---|
| 0 (no complementary loss) | 65.43 | Baseline, ignores visual details |
| 0.0011 (paper setting) | 66.16 | Optimal balance |
| 0.01 (too high) | 65.80 | Over-focuses on visual patches, misses text signal |
The symptoms are distinct: too high makes caption-matching queries worse; too low makes visual-detail queries no better than baseline.
Adapting to your data
Follow these steps to integrate CIEA into your pipeline.
- Prepare image-text pairs: Each document needs text and at least one image
- Format queries: Text queries with relevance judgments for training
- Start with low lambda: Use 0.001, increase if complementary features are underweighted
- Evaluate incrementally: Check whether visual-only queries improve
Integration with existing systems
If you have a MARVEL or similar projection-based system, the integration is straightforward.
- Keep your CLIP encoder and projector
- Add the complementary information extractor after projection
- Add the complementary loss term to training
- Tune lambda for your domain
The extractor is approximately 100 lines of code. The loss term is one additional forward pass through images.
Pitfalls and gotchas
These mistakes will not show up in unit tests but will hurt production quality.
Lambda too high: Model ignores caption-relevant content. Queries matching captions perform worse.
Lambda too low: Complementary signal gets lost. Behaves like standard projection baseline.
Small images: Fewer patches means less granular dissimilarity. Consider minimum 224x224 input.
Short captions: Very short captions make everything "dissimilar." May need caption augmentation.
Limitations
The paper is honest about constraints that may affect your implementation.
Caption dependency for training
CIEA still needs text-image pairs for training. The complementary loss requires captions to identify what is "missing." Purely visual datasets with no captions cannot use this method directly.
Single-image documents
The current implementation handles one image per document. For products with multiple images (galleries), consider two workarounds:
- Average the image vectors: Compute CIEA embeddings for each image, then mean-pool for a single document vector
- Treat each image as a separate chunk: Index each gallery image independently, then deduplicate at retrieval time
Both approaches trade off precision for coverage. Test which works better for your query patterns.
Static dissimilarity
Patch dissimilarity is computed once during encoding. Query-dependent dissimilarity (what is complementary for this specific query) is not modeled. A query about "flowers" does not dynamically upweight flower patches at retrieval time.
Limited to text queries
CIEA retrieves documents from text queries. Image-to-image or image-to-text retrieval is not addressed.
Evaluation scale
WebQA-Multi has 1.1M documents. Production systems with 100M+ documents may face scaling challenges not observed in these benchmarks.
Business Implications
Understanding the ROI helps justify implementation effort.
The metric that matters: Zero-result searches
CIEA is most valuable for reducing "zero-result" or "low-relevance" rates on long-tail queries. When users search for specific visual attributes like "distressed leather texture," "nighttime cityscape," or "Art Deco details," standard retrieval often returns nothing useful.
These failed searches are quantifiable revenue loss. If 15% of your product searches fail because users ask for visual details not in your titles, CIEA directly addresses that gap.
The real cost of caption laziness
Every product listing, image archive, and content library has lazy captions. Annotators describe the obvious. The subtle details, the background elements, the contextual visual information, gets lost.
Before CIEA, fixing this meant re-annotation. After CIEA, the model extracts what captions miss automatically. The ROI calculation changes from "cost of annotating 1M images" to "cost of training one model."
Where CIEA fits
E-commerce: Product images contain details (color, material, context) that titles abbreviate. A search for "blue leather couch in modern living room" benefits from complementary extraction.
Stock photography: Image libraries have generic captions. CIEA surfaces images for specific queries even when tags do not match.
Visual RAG: Multimodal Q&A systems that retrieve image-text documents can now answer questions about visual details absent from text.
Content discovery: Social media, news archives, and digital asset management all have caption blind spots.
What CIEA does not fix
CIEA improves recall for visual details but has limits.
Wrong images: If the image does not contain what the user wants, no amount of complementary extraction helps.
Caption errors: Factually incorrect captions still mislead.
Missing images: Text-only documents still need text-only retrieval.
The Bottom Line
For ML engineers: CIEA is a drop-in upgrade for projection-based multimodal retrieval. Add one attention layer, one loss term, tune lambda. Training overhead is 7%. If your users search for things captions do not mention, this is worth trying. Start with the reference implementation and your existing data. Output vectors work with your existing Pinecone/Milvus/Weaviate setup.
For product managers: Your image search is only as good as your captions. CIEA lets you capture value from visual details without re-annotation. Quantify how many user queries reference things your captions miss, and measure your zero-result rate on long-tail searches. That is your opportunity size.
For researchers: The complementary information framing is the contribution here. Instead of asking "how similar are images and text?", CIEA asks "what does the image add?" This reframing generalizes beyond retrieval to any multimodal system that fuses modalities by alignment.
Original paper: arXiv ・ PDF ・ HTML
Code: GitHub
Authors: Delong Zeng, Yuexiang Xie, Yaliang Li, Ying Shen (Sun Yat-sen University, Alibaba Group)
Cite this paper
Delong Zeng, Yuexiang Xie, Yaliang Li, Ying Shen (2026). CIEA: How to Search for Visual Details Your Captions Miss. arXiv 2026.