Tekta.ai LogoTektaai
arXiv 2026January 1, 2026

Topic-Enriched Embeddings: Combining Classical NLP with Modern RAG

Rodrigo Kataishi

This research proposes a hybrid embedding technique that combines term-frequency signals (TF-IDF), dimensionality-reduced semantics (LSA), and probabilistic topic modeling (LDA) with contextual sentence embeddings. Testing on a 12,436-document legal corpus demonstrates improvements in both clustering coherence and retrieval metrics compared to statistical, probabilistic, and contextual-only baselines.

Categories:Information RetrievalNatural Language ProcessingMachine Learning

Key Findings

1

Adds 'topic awareness' to embeddings: combines classical topic modeling (LDA) with modern sentence transformers for better document clustering

2

Improves retrieval precision: 87% precision at k=10 compared to 83% for contextual embeddings alone on legal documents

3

Uses three signal types: term frequency (TF-IDF), latent semantics (LSA), and probabilistic topics (LDA) fused with transformer embeddings

4

Tested on Spanish legal corpus: 12,436 Argentine documents covering industrial promotion law (Law 19.640), 1972-2020

5

Two fusion strategies: concatenation preserves distinct signals, weighted averaging (alpha=0.45) balances contribution

6

Reproducible framework with code for custom corpora

TL;DR
  1. The Problem. Contextual embeddings (like sentence transformers) capture semantic similarity but miss topical structure. Two documents about "contract termination" and "employment law" might be semantically similar but belong to different legal topics

  2. The Solution. Fuse classical NLP signals (TF-IDF for term importance, LSA for latent semantics, LDA for topic membership) with modern embeddings. Two fusion approaches: concatenation or weighted averaging

  3. The Results. 4 percentage point improvement in Precision@10 (87% vs 83%) on Spanish legal documents. Clustering coherence improves from 0.64 to 0.70 Silhouette score. Ablation confirms gains come from topic structure, not added dimensions

Executive impact

For every 10 documents your system retrieves, topic enrichment means approximately 1 fewer irrelevant result (9 relevant vs 8.3). In practice:

  • Reduced review time. Legal and compliance teams spend less time filtering wrong documents from search results
  • Lower compliance risk. Fewer off-topic documents means less chance of missing critical information buried in noise
  • Better user trust. When the first page of results is consistently on-topic, users trust the system and use it more
  • Lower hallucination risk. RAG systems hallucinate when they try to answer questions using irrelevant retrieved context. If your retriever feeds the LLM an employment document when the user asked about tax law, the model may confidently generate incorrect answers. Higher precision means the model sees more on-topic context, reducing the chance of generating plausible-sounding but wrong responses

Concrete example: A compliance analyst running 1,000 search queries per month on a legal knowledge base must manually discard ~170 off-topic documents with standard embeddings (1.7 per query × 100 queries). With topic enrichment, that drops to ~100 off-topic documents—70 fewer manual reviews per month. At 30 seconds per review, that's 35 minutes of analyst time recovered weekly.

The tradeoff is upfront indexing cost (LDA training adds hours to initial setup) and periodic retraining when your document collection evolves. For stable, domain-specific corpora where precision matters—especially where wrong answers carry legal or financial risk—the ROI is positive.

Research Overview

Modern RAG systems rely on embedding models to find relevant documents. You embed a query, compare it to document embeddings, and retrieve the closest matches. This works well when semantic similarity aligns with relevance. But semantic similarity is not the same as topical relevance.

Dataset context. This paper tests on a specific corpus: 12,436 Argentine legal documents in Spanish, covering industrial promotion legislation (Law 19.640) from 1972 to 2020. The documents come from InfoLeg and SAIJ legal repositories. Results should be interpreted with this scope in mind. Domains with less clear topical structure may see different outcomes.

Consider a legal database. A query about "industrial tax exemptions" should retrieve documents about tax law, not employment contracts that happen to mention "industrial" in passing. Contextual embeddings see both as semantically related (they share vocabulary and context patterns). Topic modeling sees them as distinct categories.

What is topic modeling?

Topic modeling algorithms like LDA (Latent Dirichlet Allocation) discover hidden thematic structure in document collections. Each document gets a probability distribution over topics. A legal document might be 60% "tax law," 30% "corporate governance," and 10% "contract law." This topical fingerprint captures something embeddings miss.

This paper proposes a hybrid approach: take the contextual embeddings you already use and enrich them with topic signals. The intuition is simple. Embeddings capture what words mean. Topic models capture what documents are about. Both signals matter for retrieval.

Connection to "Hybrid RAG" and Hybrid Search

This paper provides academic validation for the industry trend called "Hybrid Search" or "Hybrid RAG"—combining keyword/lexical signals with semantic vectors. If you've followed the "Vector vs. Keyword" debate, this research shows that the answer is "both." TF-IDF and topic models capture explicit term matches and thematic structure that pure embeddings miss. The paper's three-stream architecture (lexical + topical + contextual) is a rigorous implementation of what practitioners call hybrid retrieval.

The core contribution

MetricTopic-EnrichedContextual OnlyΔ (pp)
Precision@100.870.83+4
Recall@100.720.67+5
F1@100.790.74+5
Silhouette0.700.64+6

All retrieval metrics measured at k=10. Δ = percentage point difference (absolute).

The improvements are modest but consistent. Ablation studies confirm the gains come from meaningful topic integration rather than dimensionality expansion alone.

Before and after: a retrieval example

To illustrate the difference, consider a query about tax exemptions in industrial zones. The paper uses artificial examples for illustration, but they demonstrate the core dynamic.

Query: "tax exemption requirements for industrial promotion zones"

RankContextual-only ResultTopic-Enriched Result
1Tax exemption procedures for Law 19.640 beneficiaries ✓Tax exemption procedures for Law 19.640 beneficiaries ✓
2Employment termination procedures in industrial facilities ✗Tax credit calculations for promotional regimes ✓
3Environmental compliance for industrial zones ✗Documentation requirements for tax benefits ✓

✓ = relevant, ✗ = off-topic. Contextual-only retrieval is confused by shared vocabulary ("industrial," "termination"). Topic-enriched retrieval recognizes the query is about tax law.

The topic model recognizes that "tax exemption" queries should retrieve tax law documents, not employment or environmental documents that happen to share vocabulary. This is the difference between "semantically similar" and "topically relevant."

The Topic Blindness Problem

Contextual embeddings encode meaning at the sentence level. They excel at capturing semantic relationships: synonyms, paraphrases, contextual usage. But they have a blind spot for document-level topical structure.

Why does topic structure matter?

In specialized domains, documents cluster around topics. Legal documents fall into categories: tax law, employment law, property law. Medical records cluster by specialty. Technical documentation groups by product area. When users search these collections, they often want topical relevance, not just semantic similarity.

Consider what happens when you embed these two sentences:

  • "The employee terminated the contract after 90 days notice"
  • "Tax obligations terminate upon dissolution of the industrial entity"

Both use "terminate" in a legal context. Contextual embeddings see high similarity. But topically, they belong to different domains (employment law vs. tax law). A legal professional searching for termination clauses in employment contracts does not want tax documents polluting their results.

The paper identifies three limitations of contextual-only approaches:

1. Local context dominance. Sentence transformers process text in windows. Document-level themes get diluted across many embeddings.

2. Vocabulary overlap confusion. Domain-specific corpora reuse terminology across topics. "Industrial" appears in tax law, environmental law, and employment law with different meanings.

3. No explicit topic representation. The model has no mechanism to say "this document is primarily about X." Topic membership is implicit, buried in the embedding space.

Beyond legal: a customer support example. This problem applies to any domain with topical structure. Consider a support knowledge base where a user asks "How do I return a product?" Two articles might share nearly identical vocabulary:

  • "Product Return Policy" (topic: policy/rules) — explains eligibility windows and restocking fees
  • "How to Return a Broken Item" (topic: instructions/process) — step-by-step guide with shipping labels

Embeddings see high similarity (both mention "return," "product," "item"). But one is policy, the other is procedure. A user wanting to actually return something needs the instructions, not the policy. Topic modeling distinguishes these categories.

Why classical methods still matter

TF-IDF, LSA, and LDA are 20+ year old techniques. They predate transformers by decades. But they capture something transformers do not: explicit statistical and probabilistic structure.

MethodPipeline roleWhat it capturesLimitation
TF-IDFLexical streamTerm importanceNo semantics
LSALexical streamLatent semanticsLinear only
LDATopical streamTopic membershipBag of words
EmbeddingsContextual streamContextual meaningNo explicit topics

The insight is that these methods are complementary, not competing. The paper's architecture fuses all three streams to produce richer document representations.

Architecture

LSA vs LDA: Two different approaches

LSA (Latent Semantic Analysis) uses linear algebra (SVD) to compress TF-IDF vectors while preserving word co-occurrence patterns. It's fast and deterministic. LDA (Latent Dirichlet Allocation) is a probabilistic model that assumes documents are mixtures of topics. It's slower but produces interpretable topic labels. This pipeline uses both: LSA for efficient dimensionality reduction, LDA for explicit topic membership.

The system operates in two phases: offline index construction and online query processing.

Topic Enrichment Pipeline

Two-phase architecture: index construction and query processing

The pipeline fuses three signal types: lexical (TF-IDF/LSA), topical (LDA), and contextual (sentence transformer). All three streams process documents in parallel during indexing, then combine via concatenation or weighted averaging.

How it works (plain English)

Index phase (run once per corpus):

  1. Chunk documents into 500-word segments with overlap
  2. Build TF-IDF vectors to capture which terms matter in each chunk
  3. Apply LSA to reduce TF-IDF dimensions while preserving semantic relationships
  4. Train LDA on the corpus to discover latent topics (e.g., "tax law," "employment," "contracts")
  5. Generate topic vectors showing each chunk's probability distribution over topics
  6. Compute contextual embeddings using all-MiniLM-L6-v2 (384 dimensions)
  7. Fuse signals via concatenation or weighted averaging
  8. Store enriched embeddings in a vector database

Query phase (run per search):

  1. Transform query using the same TF-IDF vocabulary, LSA projection, and LDA model
  2. Compute query embedding with the same sentence transformer
  3. Fuse query signals identically to documents
  4. Retrieve nearest neighbors from the vector database
Why alignment matters

The TF-IDF vocabulary and LDA topics are learned from the corpus. A query must be projected using the same learned structures to enable meaningful comparison. This is why the system stores trained artifacts during indexing.

Fusion Strategies

Two approaches combine the different signal types into unified embeddings. Think of it like preparing food: you can either keep ingredients in separate compartments (bento box) or blend them together (smoothie).

Concatenation (the "bento box")

The simplest approach: stack the vectors end-to-end.

e_new = [e_context, t_topic]

A 384-dimensional contextual embedding concatenated with a 12-dimensional topic vector produces a 396-dimensional enriched embedding. Each signal type retains its distinct contribution.

Like a bento box where rice, fish, and vegetables sit in separate compartments, concatenation preserves signal independence. The retrieval system sees both semantic similarity and topic alignment as separate dimensions. You can taste each flavor distinctly. This works well when both signals matter equally.

Weighted averaging (the "smoothie")

When you want tighter integration:

e_new = alpha * e_context + (1-alpha) * t_topic

The paper finds alpha=0.45 works well through empirical validation. This means topic signals get slightly more weight than contextual embeddings. The resulting vector has the same dimensionality as the contextual embedding (384).

Like blending fruits into a smoothie, weighted averaging forces the signals to interact and creates a new unified flavor. Topic information modulates semantic similarity rather than adding independent dimensions. You cannot separate the ingredients afterward, but the result may be more balanced than any single component.

Which to choose?

StrategyAlphaOutput dimWhen to use
Concatenationn/a384 + K (396 with K=12)Need both signals separately, have memory budget
Weighted averaging0.45384 (preserved)Memory constrained, want simpler similarity

The paper sets α=0.45 for weighted averaging (slight topic bias). Concatenation expands dimensionality (384 + K topics), while weighted averaging preserves the original 384 dimensions. Both approaches show similar retrieval performance on the legal corpus.

Benchmark Results

The evaluation uses a 12,436-document corpus of Argentine legal documents in Spanish from the InfoLeg and SAIJ repositories. Documents span 1972 to 2020 and cover industrial promotion legislation (Law 19.640).

Understanding retrieval metrics

Precision@k = fraction of top k results that are relevant. P@10 of 0.87 means 8.7 of 10 results are relevant.
Recall@k = fraction of all relevant documents found in top k. R@10 of 0.72 means 72% of relevant docs appear in top 10.
F1@k = harmonic mean of precision and recall, balancing both. Higher is better.

Retrieval Performance Comparison (k=10)

Mean Precision@10, Recall@10, and F1@10 across embedding techniques

Retrieval performance

The paper compares five embedding techniques across retrieval depths (k=10, 20, 50). Table 2 in the paper reports k=10 results; we show those here.

Embedding TechniqueP@10R@10F1@10
Topic-Enriched Embeddings0.870.720.79
Contextual Embeddings0.830.670.74
LDA-Enriched0.790.620.70
LSA-Enriched0.750.580.65
TF-IDF Enriched0.680.510.58

Note: The paper's evaluation mentions k=10, 20, and 50, but Table 2 reports k=10 only.

The topic-enriched approach outperforms all baselines. The contextual-only baseline (all-MiniLM-L6-v2) comes second, confirming that modern embeddings are strong but improvable.

Note on evaluation methodology. The paper states that "artificial data and outputs are used for illustration" in some examples. The benchmark numbers come from the actual corpus, but readers should verify results on their own data before production deployment.

Precision–Recall tradeoff

The paper's Figure 3 shows precision-recall curves across the full recall range. Topic-enriched embeddings maintain superior precision as recall increases, with the gap widening at higher recall levels.

Precision-Recall Curves

Performance across recall range (mean +/- std over 5 runs)

Shaded regions show ±1 standard deviation over 5 random seeds. Topic-enriched (gold) maintains precision better than contextual-only (steel) as recall increases.

Clustering quality

Beyond retrieval, the paper evaluates clustering coherence using Table 1. Better embeddings should produce cleaner document clusters. The progression from TF-IDF through contextual to topic-enriched shows consistent improvement.

Embedding TechniqueSilhouette ↑Calinski-Harabasz ↑Davies-Bouldin ↓
TF-IDF Enriched0.47330.51.85
LSA-Enriched0.54425.21.50
LDA-Enriched0.57465.71.39
Contextual Embeddings0.64525.41.30
Topic-Enriched Embeddings0.70580.61.19

↑ = higher is better, ↓ = lower is better

Interpreting clustering metrics

Silhouette score measures how similar documents are to their own cluster versus other clusters (higher is better, max 1.0). Calinski-Harabasz measures cluster compactness (higher is better). Davies-Bouldin measures cluster overlap (lower is better).

The improvements indicate that topic enrichment creates embeddings that better reflect true document similarity. Each method in the progression adds signal that improves cluster separation.

Visual cluster separation

What is t-SNE?

t-SNE (t-distributed Stochastic Neighbor Embedding) is an algorithm that maps high-dimensional vectors into 2D or 3D for visualization. It preserves local similarity—points close in high-dimensional space stay close in the plot. Useful for spotting cluster structure, but can distort global distances.

The paper's Figure 2 shows t-SNE projections of document embeddings. Topic-enriched embeddings produce compact, well-separated clusters, while TF-IDF shows significant overlap between topic groups.

t-SNE Cluster Visualization

Document clusters by embedding method - topic-enriched shows cleaner separation

t-SNE visualization of document clusters by embedding method. Each color represents a topic cluster. Topic-enriched embeddings (right) show cleaner separation than TF-IDF (left) or LDA alone (center).

Ablation Analysis

Do the gains come from meaningful topic signals, or just added dimensions? The ablation study (Table 3 in the paper) tests this by comparing fusion variants and replacing topic vectors with random vectors.

Ablation Study: Component Contributions

Compressed F1@10 view - see table below for concat vs weighted variants

Note: The chart above shows a compressed F1 view. The full ablation results below distinguish concatenation vs weighted averaging variants.

Component contributions by fusion method

VariantP@10R@10F1@10
Contextual Only0.8450.600.74
+ LSA (concatenation)0.8450.690.78
+ LDA (concatenation)0.8520.680.77
+ LSA (weighted average)0.8510.710.78
+ LDA (weighted average)0.8660.710.78
Topic-Enriched (full)0.8700.720.80
Random Topic Vectors0.8140.650.71

The random baseline performs worse than contextual-only (F1: 0.71 vs 0.74). This confirms that gains come from meaningful topic structure, not dimensionality expansion.

Both fusion strategies show improvement. Weighted averaging with LDA achieves slightly higher precision (0.866) than concatenation variants. The full pipeline combining all signals provides the best F1 (+0.06 pp over baseline).

Implementation Blueprint

The paper provides code for reproducing results. The table below reflects tools explicitly named in the paper where available.

ComponentToolSource
Embeddingsall-MiniLM-L6-v2Paper (384-dim, 33M params)
Topic ModelGensim LDA (Gibbs sampling)Paper
LSASVD-based (e.g., TruncatedSVD)Paper mentions SVD
TF-IDFscikit-learn TfidfVectorizerImplementation suggestion
Vector DBChromaDBPaper (local instance)
Web scrapingBeautifulSoup, DoclingPaper (for corpus prep)

Key parameters

These values produced the benchmark results on the legal corpus.

ParameterValueSourceNotes
LDA topics (K)12Paper"Optimal semantic distinction"; similar results for K=11–14
LSA componentsNot specifiedPaper does not state LSA component count; tune for your corpus
Chunk size500 wordsPaperWith 50-word overlap
Fusion alpha0.45Paper"Empirically validated on the dataset"

Tuning guide for K and alpha

Choosing number of topics (K). Start with K between 10 and 20 for most domain corpora. Use coherence scores to evaluate:

from gensim.models import CoherenceModel
 
coherence_scores = []
for k in range(5, 25, 5):
    model = LdaModel(corpus, num_topics=k)
    cm = CoherenceModel(model=model,
                        texts=tokenized_docs,
                        coherence='c_v')
    coherence_scores.append((k, cm.get_coherence()))
 
# Pick K with highest coherence, or elbow point

Validate with a small labeled query set: run retrieval with different K values and measure P@10 on queries with known relevant documents.

Choosing alpha for weighted averaging. The paper found 0.45 optimal (slight topic bias). Start at 0.5 (equal weight) and adjust:

  • Lower alpha (0.3-0.4): stronger topic influence, better for corpora with distinct topics
  • Higher alpha (0.5-0.6): stronger semantic influence, better for nuanced similarity

Cost and tradeoffs

Before implementing, understand the resource impact.

FactorImpactMitigation
LDA training timeHours for 10K+ docsTrain offline, cache model
Index size (concat)+3% largerUse weighted avg instead
Query latency+5-10ms for topic inferencePre-warm LDA model
Retraining frequencyMonthly for evolving corporaMonitor topic drift

Critical point for engineering leads: The heavy computation (LDA training, LSA fitting) happens entirely offline during index construction. Users never wait for topic models to train. At query time, you only apply pre-trained transformations—a few matrix multiplications that add ~5-10ms to the retrieval pipeline. The "slowness" is a one-time setup cost, not a user-facing latency penalty.

For a 12,000-document corpus, expect:

  • Initial LDA training: 30-60 minutes on CPU (offline, once)
  • Embedding generation: 10-20 minutes (batch, offline)
  • Total index build: 1-2 hours first time, minutes for incremental updates
  • Query-time overhead: 5-10ms (negligible vs network latency)

Core workflow

Step 1: Preprocessing. Chunk documents into 500-word segments with 50-word overlap.

def chunk_document(text, size=500, overlap=50):
    words = text.split()
    chunks = []
    for i in range(0, len(words), size - overlap):
        chunk = ' '.join(words[i:i + size])
        chunks.append(chunk)
    return chunks

Step 2: Train topic model. Fit LDA on the full corpus to discover latent topics.

from gensim import corpora
from gensim.models import LdaModel
 
dictionary = corpora.Dictionary(tokenized_docs)
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
lda_model = LdaModel(corpus, num_topics=12,
                     id2word=dictionary)

Step 3: Generate enriched embeddings. For each document, compute contextual embedding and topic vector, then fuse.

from sentence_transformers import SentenceTransformer
import numpy as np
 
model = SentenceTransformer('all-MiniLM-L6-v2')
 
def enrich_embedding(text, lda_model, dictionary):
    ctx_emb = model.encode(text)
 
    bow = dictionary.doc2bow(text.split())
    topic_dist = lda_model.get_document_topics(bow)
    topic_vec = [0] * lda_model.num_topics
    for topic_id, prob in topic_dist:
        topic_vec[topic_id] = prob
 
    return np.concatenate([ctx_emb, topic_vec])

Step 4: Index and retrieve. Store enriched embeddings in a vector database. At query time, apply the same enrichment to the query.

Evaluation and monitoring loop

Production systems need ongoing measurement. Set up this loop before launch:

1. Create a labeled evaluation set. Select 50-100 representative queries. For each, label 10-20 documents as relevant or not. This takes a few hours but enables all future measurement.

2. Track Precision@10 weekly. Run your evaluation queries against the live index. Plot P@10 over time. Drops indicate topic drift or corpus changes.

def evaluate_p_at_k(queries, relevance_labels, k=10):
    scores = []
    for query, relevant_ids in zip(queries, relevance_labels):
        results = retrieve(query, k=k)
        hits = sum(1 for r in results if r.id in relevant_ids)
        scores.append(hits / k)
    return np.mean(scores)

3. Monitor topic drift. Frame topic drift as a key metric for your MLOps dashboard alongside latency and error rates. When new documents arrive, check their topic distributions. If many documents fall outside existing topics (low max probability), that's your signal to retrain LDA.

Set up an "Unknown Topic" alert: for each incoming document, compute the maximum topic probability. If max_prob < 0.3, the document doesn't fit any learned topic well. Track the percentage of such "orphan" documents daily:

def compute_unknown_topic_rate(new_docs, lda_model, dictionary, threshold=0.3):
    orphan_count = 0
    for doc in new_docs:
        bow = dictionary.doc2bow(doc.split())
        topic_dist = lda_model.get_document_topics(bow)
        max_prob = max((prob for _, prob in topic_dist), default=0)
        if max_prob < threshold:
            orphan_count += 1
    return orphan_count / len(new_docs) if new_docs else 0

Alert thresholds: Fire a warning when the unknown topic rate exceeds 10% for 3 consecutive days. Fire a critical alert at 20%. These thresholds turn the abstract problem of "topics change over time" into a concrete operational task: "retrain the topic model."

4. Retrain schedule. For stable corpora: quarterly. For growing corpora: when P@10 drops 5+ points or 20%+ new documents arrive.

Pitfalls to avoid

Topic count selection. K=12 worked for legal documents. Your corpus may need different values. Use coherence scores and validation queries to tune.

Domain drift. The topic model learns from your corpus. If new documents introduce new topics, retrain periodically.

Embedding alignment. Query and document embeddings must use identical processing. Store all trained artifacts (TF-IDF vocabulary, LSA projection, LDA model).

Memory overhead. Concatenation increases embedding dimensionality. For large corpora, weighted averaging may be more practical.

Limitations

The paper acknowledges several constraints that affect applicability.

Cold start problem. LDA requires a pre-existing corpus to train meaningful topics. You cannot use this method effectively on Day 1 with zero or few documents. Topic models learn from document co-occurrence patterns; with insufficient data, the "topics" will be noise. Plan for at least 1,000+ documents before topic enrichment adds value. Until then, use contextual embeddings alone.

Dataset scope. All quantitative results in this paper come from one specific corpus: 12,436 Argentine legal documents in Spanish covering Law 19.640 (industrial promotion). This is a domain with clear topical structure, specialized vocabulary, and limited topic diversity. The 4-5 percentage point improvements reported are validated only for this legal corpus. Generalization to other domains (healthcare, customer support, technical documentation) requires independent validation. The authors note that the approach "can be applied to other domains with topical structure" but this claim has not been tested.

Computational overhead. LDA training is CPU-intensive. Processing 12,000+ documents takes 30-60 minutes. Real-time applications need pre-computed topic models. The paper notes that LSA helps "minimize computational overhead" through dimensionality reduction—without it, the full TF-IDF vocabulary would make fusion impractical at scale.

Parameter sensitivity. LDA topic count (K) requires corpus-specific tuning. There is no universal value. Poor choices degrade performance. The paper selected K=12 "based on qualitative interpretability tailored to the specific characteristics of the corpus"—meaning manual inspection of topic coherence, not automated optimization. Your corpus may need K=5 or K=50 depending on domain complexity.

Scalability questions. The paper tests on ~12K documents. Behavior at millions of documents remains unexplored. LDA training does not scale linearly.

Language-specific topic models. While modern embedding models (like multilingual-e5 or multilingual MiniLM) work across languages, LDA and TF-IDF are fundamentally language-dependent. Topic models learn from word co-occurrence patterns in one language; a Spanish LDA model cannot analyze English documents. For multilingual corpora, you need either: (a) separate topic models per language, (b) translation to a common language before topic modeling, or (c) cross-lingual topic models (which are less mature). Do not assume a topic-enriched pipeline trained on Spanish will work for mixed Spanish/English documents.

Illustration vs. validation. The paper notes that some examples use artificial data for illustration. While benchmark numbers come from real evaluation, readers should validate on their own data.

When to use this approach

Topic enrichment makes sense when:

  • Your corpus has clear topical structure (legal, medical, technical docs)
  • Users search for topical relevance, not just semantic similarity
  • You have sufficient documents to train meaningful topic models (thousands, not hundreds)
  • Retrieval precision matters more than processing speed
  • Your corpus is relatively stable (not changing daily)

Skip it when:

  • Documents lack clear topics (e.g., general web content, social media)
  • Real-time processing is critical and you cannot pre-compute topics
  • Your corpus changes rapidly (constant retraining needed)
  • You only have hundreds of documents (insufficient for topic modeling)

Authors

Rodrigo KataishiCONICET / National University of Tierra del Fuego, Argentina

Cite this paper

Rodrigo Kataishi (2026). Topic-Enriched Embeddings: Combining Classical NLP with Modern RAG. arXiv 2026.

Related Research