-
The Problem. Contextual embeddings (like sentence transformers) capture semantic similarity but miss topical structure. Two documents about "contract termination" and "employment law" might be semantically similar but belong to different legal topics
-
The Solution. Fuse classical NLP signals (TF-IDF for term importance, LSA for latent semantics, LDA for topic membership) with modern embeddings. Two fusion approaches: concatenation or weighted averaging
-
The Results. 4 percentage point improvement in Precision@10 (87% vs 83%) on Spanish legal documents. Clustering coherence improves from 0.64 to 0.70 Silhouette score. Ablation confirms gains come from topic structure, not added dimensions
Executive impact
For every 10 documents your system retrieves, topic enrichment means approximately 1 fewer irrelevant result (9 relevant vs 8.3). In practice:
- Reduced review time. Legal and compliance teams spend less time filtering wrong documents from search results
- Lower compliance risk. Fewer off-topic documents means less chance of missing critical information buried in noise
- Better user trust. When the first page of results is consistently on-topic, users trust the system and use it more
- Lower hallucination risk. RAG systems hallucinate when they try to answer questions using irrelevant retrieved context. If your retriever feeds the LLM an employment document when the user asked about tax law, the model may confidently generate incorrect answers. Higher precision means the model sees more on-topic context, reducing the chance of generating plausible-sounding but wrong responses
Concrete example: A compliance analyst running 1,000 search queries per month on a legal knowledge base must manually discard ~170 off-topic documents with standard embeddings (1.7 per query × 100 queries). With topic enrichment, that drops to ~100 off-topic documents—70 fewer manual reviews per month. At 30 seconds per review, that's 35 minutes of analyst time recovered weekly.
The tradeoff is upfront indexing cost (LDA training adds hours to initial setup) and periodic retraining when your document collection evolves. For stable, domain-specific corpora where precision matters—especially where wrong answers carry legal or financial risk—the ROI is positive.
Research Overview
Modern RAG systems rely on embedding models to find relevant documents. You embed a query, compare it to document embeddings, and retrieve the closest matches. This works well when semantic similarity aligns with relevance. But semantic similarity is not the same as topical relevance.
Dataset context. This paper tests on a specific corpus: 12,436 Argentine legal documents in Spanish, covering industrial promotion legislation (Law 19.640) from 1972 to 2020. The documents come from InfoLeg and SAIJ legal repositories. Results should be interpreted with this scope in mind. Domains with less clear topical structure may see different outcomes.
Consider a legal database. A query about "industrial tax exemptions" should retrieve documents about tax law, not employment contracts that happen to mention "industrial" in passing. Contextual embeddings see both as semantically related (they share vocabulary and context patterns). Topic modeling sees them as distinct categories.
Topic modeling algorithms like LDA (Latent Dirichlet Allocation) discover hidden thematic structure in document collections. Each document gets a probability distribution over topics. A legal document might be 60% "tax law," 30% "corporate governance," and 10% "contract law." This topical fingerprint captures something embeddings miss.
This paper proposes a hybrid approach: take the contextual embeddings you already use and enrich them with topic signals. The intuition is simple. Embeddings capture what words mean. Topic models capture what documents are about. Both signals matter for retrieval.
This paper provides academic validation for the industry trend called "Hybrid Search" or "Hybrid RAG"—combining keyword/lexical signals with semantic vectors. If you've followed the "Vector vs. Keyword" debate, this research shows that the answer is "both." TF-IDF and topic models capture explicit term matches and thematic structure that pure embeddings miss. The paper's three-stream architecture (lexical + topical + contextual) is a rigorous implementation of what practitioners call hybrid retrieval.
The core contribution
| Metric | Topic-Enriched | Contextual Only | Δ (pp) |
|---|---|---|---|
| Precision@10 | 0.87 | 0.83 | +4 |
| Recall@10 | 0.72 | 0.67 | +5 |
| F1@10 | 0.79 | 0.74 | +5 |
| Silhouette | 0.70 | 0.64 | +6 |
All retrieval metrics measured at k=10. Δ = percentage point difference (absolute).
The improvements are modest but consistent. Ablation studies confirm the gains come from meaningful topic integration rather than dimensionality expansion alone.
Before and after: a retrieval example
To illustrate the difference, consider a query about tax exemptions in industrial zones. The paper uses artificial examples for illustration, but they demonstrate the core dynamic.
Query: "tax exemption requirements for industrial promotion zones"
| Rank | Contextual-only Result | Topic-Enriched Result |
|---|---|---|
| 1 | Tax exemption procedures for Law 19.640 beneficiaries ✓ | Tax exemption procedures for Law 19.640 beneficiaries ✓ |
| 2 | Employment termination procedures in industrial facilities ✗ | Tax credit calculations for promotional regimes ✓ |
| 3 | Environmental compliance for industrial zones ✗ | Documentation requirements for tax benefits ✓ |
✓ = relevant, ✗ = off-topic. Contextual-only retrieval is confused by shared vocabulary ("industrial," "termination"). Topic-enriched retrieval recognizes the query is about tax law.
The topic model recognizes that "tax exemption" queries should retrieve tax law documents, not employment or environmental documents that happen to share vocabulary. This is the difference between "semantically similar" and "topically relevant."
The Topic Blindness Problem
Contextual embeddings encode meaning at the sentence level. They excel at capturing semantic relationships: synonyms, paraphrases, contextual usage. But they have a blind spot for document-level topical structure.
In specialized domains, documents cluster around topics. Legal documents fall into categories: tax law, employment law, property law. Medical records cluster by specialty. Technical documentation groups by product area. When users search these collections, they often want topical relevance, not just semantic similarity.
Consider what happens when you embed these two sentences:
- "The employee terminated the contract after 90 days notice"
- "Tax obligations terminate upon dissolution of the industrial entity"
Both use "terminate" in a legal context. Contextual embeddings see high similarity. But topically, they belong to different domains (employment law vs. tax law). A legal professional searching for termination clauses in employment contracts does not want tax documents polluting their results.
The paper identifies three limitations of contextual-only approaches:
1. Local context dominance. Sentence transformers process text in windows. Document-level themes get diluted across many embeddings.
2. Vocabulary overlap confusion. Domain-specific corpora reuse terminology across topics. "Industrial" appears in tax law, environmental law, and employment law with different meanings.
3. No explicit topic representation. The model has no mechanism to say "this document is primarily about X." Topic membership is implicit, buried in the embedding space.
Beyond legal: a customer support example. This problem applies to any domain with topical structure. Consider a support knowledge base where a user asks "How do I return a product?" Two articles might share nearly identical vocabulary:
- "Product Return Policy" (topic: policy/rules) — explains eligibility windows and restocking fees
- "How to Return a Broken Item" (topic: instructions/process) — step-by-step guide with shipping labels
Embeddings see high similarity (both mention "return," "product," "item"). But one is policy, the other is procedure. A user wanting to actually return something needs the instructions, not the policy. Topic modeling distinguishes these categories.
Why classical methods still matter
TF-IDF, LSA, and LDA are 20+ year old techniques. They predate transformers by decades. But they capture something transformers do not: explicit statistical and probabilistic structure.
| Method | Pipeline role | What it captures | Limitation |
|---|---|---|---|
| TF-IDF | Lexical stream | Term importance | No semantics |
| LSA | Lexical stream | Latent semantics | Linear only |
| LDA | Topical stream | Topic membership | Bag of words |
| Embeddings | Contextual stream | Contextual meaning | No explicit topics |
The insight is that these methods are complementary, not competing. The paper's architecture fuses all three streams to produce richer document representations.
Architecture
LSA (Latent Semantic Analysis) uses linear algebra (SVD) to compress TF-IDF vectors while preserving word co-occurrence patterns. It's fast and deterministic. LDA (Latent Dirichlet Allocation) is a probabilistic model that assumes documents are mixtures of topics. It's slower but produces interpretable topic labels. This pipeline uses both: LSA for efficient dimensionality reduction, LDA for explicit topic membership.
The system operates in two phases: offline index construction and online query processing.
Topic Enrichment Pipeline
Two-phase architecture: index construction and query processing
The pipeline fuses three signal types: lexical (TF-IDF/LSA), topical (LDA), and contextual (sentence transformer). All three streams process documents in parallel during indexing, then combine via concatenation or weighted averaging.
How it works (plain English)
Index phase (run once per corpus):
- Chunk documents into 500-word segments with overlap
- Build TF-IDF vectors to capture which terms matter in each chunk
- Apply LSA to reduce TF-IDF dimensions while preserving semantic relationships
- Train LDA on the corpus to discover latent topics (e.g., "tax law," "employment," "contracts")
- Generate topic vectors showing each chunk's probability distribution over topics
- Compute contextual embeddings using all-MiniLM-L6-v2 (384 dimensions)
- Fuse signals via concatenation or weighted averaging
- Store enriched embeddings in a vector database
Query phase (run per search):
- Transform query using the same TF-IDF vocabulary, LSA projection, and LDA model
- Compute query embedding with the same sentence transformer
- Fuse query signals identically to documents
- Retrieve nearest neighbors from the vector database
The TF-IDF vocabulary and LDA topics are learned from the corpus. A query must be projected using the same learned structures to enable meaningful comparison. This is why the system stores trained artifacts during indexing.
Fusion Strategies
Two approaches combine the different signal types into unified embeddings. Think of it like preparing food: you can either keep ingredients in separate compartments (bento box) or blend them together (smoothie).
Concatenation (the "bento box")
The simplest approach: stack the vectors end-to-end.
e_new = [e_context, t_topic]
A 384-dimensional contextual embedding concatenated with a 12-dimensional topic vector produces a 396-dimensional enriched embedding. Each signal type retains its distinct contribution.
Like a bento box where rice, fish, and vegetables sit in separate compartments, concatenation preserves signal independence. The retrieval system sees both semantic similarity and topic alignment as separate dimensions. You can taste each flavor distinctly. This works well when both signals matter equally.
Weighted averaging (the "smoothie")
When you want tighter integration:
e_new = alpha * e_context + (1-alpha) * t_topic
The paper finds alpha=0.45 works well through empirical validation. This means topic signals get slightly more weight than contextual embeddings. The resulting vector has the same dimensionality as the contextual embedding (384).
Like blending fruits into a smoothie, weighted averaging forces the signals to interact and creates a new unified flavor. Topic information modulates semantic similarity rather than adding independent dimensions. You cannot separate the ingredients afterward, but the result may be more balanced than any single component.
Which to choose?
| Strategy | Alpha | Output dim | When to use |
|---|---|---|---|
| Concatenation | n/a | 384 + K (396 with K=12) | Need both signals separately, have memory budget |
| Weighted averaging | 0.45 | 384 (preserved) | Memory constrained, want simpler similarity |
The paper sets α=0.45 for weighted averaging (slight topic bias). Concatenation expands dimensionality (384 + K topics), while weighted averaging preserves the original 384 dimensions. Both approaches show similar retrieval performance on the legal corpus.
Benchmark Results
The evaluation uses a 12,436-document corpus of Argentine legal documents in Spanish from the InfoLeg and SAIJ repositories. Documents span 1972 to 2020 and cover industrial promotion legislation (Law 19.640).
Precision@k = fraction of top k results that are relevant. P@10 of 0.87 means 8.7 of 10 results are relevant.
Recall@k = fraction of all relevant documents found in top k. R@10 of 0.72 means 72% of relevant docs appear in top 10.
F1@k = harmonic mean of precision and recall, balancing both. Higher is better.
Retrieval Performance Comparison (k=10)
Mean Precision@10, Recall@10, and F1@10 across embedding techniques
Retrieval performance
The paper compares five embedding techniques across retrieval depths (k=10, 20, 50). Table 2 in the paper reports k=10 results; we show those here.
| Embedding Technique | P@10 | R@10 | F1@10 |
|---|---|---|---|
| Topic-Enriched Embeddings | 0.87 | 0.72 | 0.79 |
| Contextual Embeddings | 0.83 | 0.67 | 0.74 |
| LDA-Enriched | 0.79 | 0.62 | 0.70 |
| LSA-Enriched | 0.75 | 0.58 | 0.65 |
| TF-IDF Enriched | 0.68 | 0.51 | 0.58 |
Note: The paper's evaluation mentions k=10, 20, and 50, but Table 2 reports k=10 only.
The topic-enriched approach outperforms all baselines. The contextual-only baseline (all-MiniLM-L6-v2) comes second, confirming that modern embeddings are strong but improvable.
Note on evaluation methodology. The paper states that "artificial data and outputs are used for illustration" in some examples. The benchmark numbers come from the actual corpus, but readers should verify results on their own data before production deployment.
Precision–Recall tradeoff
The paper's Figure 3 shows precision-recall curves across the full recall range. Topic-enriched embeddings maintain superior precision as recall increases, with the gap widening at higher recall levels.
Precision-Recall Curves
Performance across recall range (mean +/- std over 5 runs)
Shaded regions show ±1 standard deviation over 5 random seeds. Topic-enriched (gold) maintains precision better than contextual-only (steel) as recall increases.
Clustering quality
Beyond retrieval, the paper evaluates clustering coherence using Table 1. Better embeddings should produce cleaner document clusters. The progression from TF-IDF through contextual to topic-enriched shows consistent improvement.
| Embedding Technique | Silhouette ↑ | Calinski-Harabasz ↑ | Davies-Bouldin ↓ |
|---|---|---|---|
| TF-IDF Enriched | 0.47 | 330.5 | 1.85 |
| LSA-Enriched | 0.54 | 425.2 | 1.50 |
| LDA-Enriched | 0.57 | 465.7 | 1.39 |
| Contextual Embeddings | 0.64 | 525.4 | 1.30 |
| Topic-Enriched Embeddings | 0.70 | 580.6 | 1.19 |
↑ = higher is better, ↓ = lower is better
Silhouette score measures how similar documents are to their own cluster versus other clusters (higher is better, max 1.0). Calinski-Harabasz measures cluster compactness (higher is better). Davies-Bouldin measures cluster overlap (lower is better).
The improvements indicate that topic enrichment creates embeddings that better reflect true document similarity. Each method in the progression adds signal that improves cluster separation.
Visual cluster separation
t-SNE (t-distributed Stochastic Neighbor Embedding) is an algorithm that maps high-dimensional vectors into 2D or 3D for visualization. It preserves local similarity—points close in high-dimensional space stay close in the plot. Useful for spotting cluster structure, but can distort global distances.
The paper's Figure 2 shows t-SNE projections of document embeddings. Topic-enriched embeddings produce compact, well-separated clusters, while TF-IDF shows significant overlap between topic groups.
t-SNE Cluster Visualization
Document clusters by embedding method - topic-enriched shows cleaner separation
t-SNE visualization of document clusters by embedding method. Each color represents a topic cluster. Topic-enriched embeddings (right) show cleaner separation than TF-IDF (left) or LDA alone (center).
Ablation Analysis
Do the gains come from meaningful topic signals, or just added dimensions? The ablation study (Table 3 in the paper) tests this by comparing fusion variants and replacing topic vectors with random vectors.
Ablation Study: Component Contributions
Compressed F1@10 view - see table below for concat vs weighted variants
Note: The chart above shows a compressed F1 view. The full ablation results below distinguish concatenation vs weighted averaging variants.
Component contributions by fusion method
| Variant | P@10 | R@10 | F1@10 |
|---|---|---|---|
| Contextual Only | 0.845 | 0.60 | 0.74 |
| + LSA (concatenation) | 0.845 | 0.69 | 0.78 |
| + LDA (concatenation) | 0.852 | 0.68 | 0.77 |
| + LSA (weighted average) | 0.851 | 0.71 | 0.78 |
| + LDA (weighted average) | 0.866 | 0.71 | 0.78 |
| Topic-Enriched (full) | 0.870 | 0.72 | 0.80 |
| Random Topic Vectors | 0.814 | 0.65 | 0.71 |
The random baseline performs worse than contextual-only (F1: 0.71 vs 0.74). This confirms that gains come from meaningful topic structure, not dimensionality expansion.
Both fusion strategies show improvement. Weighted averaging with LDA achieves slightly higher precision (0.866) than concatenation variants. The full pipeline combining all signals provides the best F1 (+0.06 pp over baseline).
Implementation Blueprint
Recommended stack
The paper provides code for reproducing results. The table below reflects tools explicitly named in the paper where available.
| Component | Tool | Source |
|---|---|---|
| Embeddings | all-MiniLM-L6-v2 | Paper (384-dim, 33M params) |
| Topic Model | Gensim LDA (Gibbs sampling) | Paper |
| LSA | SVD-based (e.g., TruncatedSVD) | Paper mentions SVD |
| TF-IDF | scikit-learn TfidfVectorizer | Implementation suggestion |
| Vector DB | ChromaDB | Paper (local instance) |
| Web scraping | BeautifulSoup, Docling | Paper (for corpus prep) |
Key parameters
These values produced the benchmark results on the legal corpus.
| Parameter | Value | Source | Notes |
|---|---|---|---|
| LDA topics (K) | 12 | Paper | "Optimal semantic distinction"; similar results for K=11–14 |
| LSA components | — | Not specified | Paper does not state LSA component count; tune for your corpus |
| Chunk size | 500 words | Paper | With 50-word overlap |
| Fusion alpha | 0.45 | Paper | "Empirically validated on the dataset" |
Tuning guide for K and alpha
Choosing number of topics (K). Start with K between 10 and 20 for most domain corpora. Use coherence scores to evaluate:
from gensim.models import CoherenceModel
coherence_scores = []
for k in range(5, 25, 5):
model = LdaModel(corpus, num_topics=k)
cm = CoherenceModel(model=model,
texts=tokenized_docs,
coherence='c_v')
coherence_scores.append((k, cm.get_coherence()))
# Pick K with highest coherence, or elbow pointValidate with a small labeled query set: run retrieval with different K values and measure P@10 on queries with known relevant documents.
Choosing alpha for weighted averaging. The paper found 0.45 optimal (slight topic bias). Start at 0.5 (equal weight) and adjust:
- Lower alpha (0.3-0.4): stronger topic influence, better for corpora with distinct topics
- Higher alpha (0.5-0.6): stronger semantic influence, better for nuanced similarity
Cost and tradeoffs
Before implementing, understand the resource impact.
| Factor | Impact | Mitigation |
|---|---|---|
| LDA training time | Hours for 10K+ docs | Train offline, cache model |
| Index size (concat) | +3% larger | Use weighted avg instead |
| Query latency | +5-10ms for topic inference | Pre-warm LDA model |
| Retraining frequency | Monthly for evolving corpora | Monitor topic drift |
Critical point for engineering leads: The heavy computation (LDA training, LSA fitting) happens entirely offline during index construction. Users never wait for topic models to train. At query time, you only apply pre-trained transformations—a few matrix multiplications that add ~5-10ms to the retrieval pipeline. The "slowness" is a one-time setup cost, not a user-facing latency penalty.
For a 12,000-document corpus, expect:
- Initial LDA training: 30-60 minutes on CPU (offline, once)
- Embedding generation: 10-20 minutes (batch, offline)
- Total index build: 1-2 hours first time, minutes for incremental updates
- Query-time overhead: 5-10ms (negligible vs network latency)
Core workflow
Step 1: Preprocessing. Chunk documents into 500-word segments with 50-word overlap.
def chunk_document(text, size=500, overlap=50):
words = text.split()
chunks = []
for i in range(0, len(words), size - overlap):
chunk = ' '.join(words[i:i + size])
chunks.append(chunk)
return chunksStep 2: Train topic model. Fit LDA on the full corpus to discover latent topics.
from gensim import corpora
from gensim.models import LdaModel
dictionary = corpora.Dictionary(tokenized_docs)
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
lda_model = LdaModel(corpus, num_topics=12,
id2word=dictionary)Step 3: Generate enriched embeddings. For each document, compute contextual embedding and topic vector, then fuse.
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
def enrich_embedding(text, lda_model, dictionary):
ctx_emb = model.encode(text)
bow = dictionary.doc2bow(text.split())
topic_dist = lda_model.get_document_topics(bow)
topic_vec = [0] * lda_model.num_topics
for topic_id, prob in topic_dist:
topic_vec[topic_id] = prob
return np.concatenate([ctx_emb, topic_vec])Step 4: Index and retrieve. Store enriched embeddings in a vector database. At query time, apply the same enrichment to the query.
Evaluation and monitoring loop
Production systems need ongoing measurement. Set up this loop before launch:
1. Create a labeled evaluation set. Select 50-100 representative queries. For each, label 10-20 documents as relevant or not. This takes a few hours but enables all future measurement.
2. Track Precision@10 weekly. Run your evaluation queries against the live index. Plot P@10 over time. Drops indicate topic drift or corpus changes.
def evaluate_p_at_k(queries, relevance_labels, k=10):
scores = []
for query, relevant_ids in zip(queries, relevance_labels):
results = retrieve(query, k=k)
hits = sum(1 for r in results if r.id in relevant_ids)
scores.append(hits / k)
return np.mean(scores)3. Monitor topic drift. Frame topic drift as a key metric for your MLOps dashboard alongside latency and error rates. When new documents arrive, check their topic distributions. If many documents fall outside existing topics (low max probability), that's your signal to retrain LDA.
Set up an "Unknown Topic" alert: for each incoming document, compute the maximum topic probability. If max_prob < 0.3, the document doesn't fit any learned topic well. Track the percentage of such "orphan" documents daily:
def compute_unknown_topic_rate(new_docs, lda_model, dictionary, threshold=0.3):
orphan_count = 0
for doc in new_docs:
bow = dictionary.doc2bow(doc.split())
topic_dist = lda_model.get_document_topics(bow)
max_prob = max((prob for _, prob in topic_dist), default=0)
if max_prob < threshold:
orphan_count += 1
return orphan_count / len(new_docs) if new_docs else 0Alert thresholds: Fire a warning when the unknown topic rate exceeds 10% for 3 consecutive days. Fire a critical alert at 20%. These thresholds turn the abstract problem of "topics change over time" into a concrete operational task: "retrain the topic model."
4. Retrain schedule. For stable corpora: quarterly. For growing corpora: when P@10 drops 5+ points or 20%+ new documents arrive.
Pitfalls to avoid
Topic count selection. K=12 worked for legal documents. Your corpus may need different values. Use coherence scores and validation queries to tune.
Domain drift. The topic model learns from your corpus. If new documents introduce new topics, retrain periodically.
Embedding alignment. Query and document embeddings must use identical processing. Store all trained artifacts (TF-IDF vocabulary, LSA projection, LDA model).
Memory overhead. Concatenation increases embedding dimensionality. For large corpora, weighted averaging may be more practical.
Limitations
The paper acknowledges several constraints that affect applicability.
Cold start problem. LDA requires a pre-existing corpus to train meaningful topics. You cannot use this method effectively on Day 1 with zero or few documents. Topic models learn from document co-occurrence patterns; with insufficient data, the "topics" will be noise. Plan for at least 1,000+ documents before topic enrichment adds value. Until then, use contextual embeddings alone.
Dataset scope. All quantitative results in this paper come from one specific corpus: 12,436 Argentine legal documents in Spanish covering Law 19.640 (industrial promotion). This is a domain with clear topical structure, specialized vocabulary, and limited topic diversity. The 4-5 percentage point improvements reported are validated only for this legal corpus. Generalization to other domains (healthcare, customer support, technical documentation) requires independent validation. The authors note that the approach "can be applied to other domains with topical structure" but this claim has not been tested.
Computational overhead. LDA training is CPU-intensive. Processing 12,000+ documents takes 30-60 minutes. Real-time applications need pre-computed topic models. The paper notes that LSA helps "minimize computational overhead" through dimensionality reduction—without it, the full TF-IDF vocabulary would make fusion impractical at scale.
Parameter sensitivity. LDA topic count (K) requires corpus-specific tuning. There is no universal value. Poor choices degrade performance. The paper selected K=12 "based on qualitative interpretability tailored to the specific characteristics of the corpus"—meaning manual inspection of topic coherence, not automated optimization. Your corpus may need K=5 or K=50 depending on domain complexity.
Scalability questions. The paper tests on ~12K documents. Behavior at millions of documents remains unexplored. LDA training does not scale linearly.
Language-specific topic models. While modern embedding models (like multilingual-e5 or multilingual MiniLM) work across languages, LDA and TF-IDF are fundamentally language-dependent. Topic models learn from word co-occurrence patterns in one language; a Spanish LDA model cannot analyze English documents. For multilingual corpora, you need either: (a) separate topic models per language, (b) translation to a common language before topic modeling, or (c) cross-lingual topic models (which are less mature). Do not assume a topic-enriched pipeline trained on Spanish will work for mixed Spanish/English documents.
Illustration vs. validation. The paper notes that some examples use artificial data for illustration. While benchmark numbers come from real evaluation, readers should validate on their own data.
When to use this approach
Topic enrichment makes sense when:
- Your corpus has clear topical structure (legal, medical, technical docs)
- Users search for topical relevance, not just semantic similarity
- You have sufficient documents to train meaningful topic models (thousands, not hundreds)
- Retrieval precision matters more than processing speed
- Your corpus is relatively stable (not changing daily)
Skip it when:
- Documents lack clear topics (e.g., general web content, social media)
- Real-time processing is critical and you cannot pre-compute topics
- Your corpus changes rapidly (constant retraining needed)
- You only have hundreds of documents (insufficient for topic modeling)
Cite this paper
Rodrigo Kataishi (2026). Topic-Enriched Embeddings: Combining Classical NLP with Modern RAG. arXiv 2026.