-
The Problem. E-commerce ML models need labeled product data (attribute-text pairs), but manual annotation is expensive and does not scale. Real catalogs have messy, inconsistent data.
-
The Solution. Use LLMs to generate synthetic products with three controlled strategies: correct attributes, intentional errors, and missing attributes. This trains models to handle real-world data quality issues.
-
The Results. 100% synthetic data matches 100% real data performance (60.5% vs 60.8%). Hybrid 75/25 real/synthetic reaches 68.8%. Human evaluators rated 99.6% of outputs as natural.
-
The Business Case. Human-level quality at machine-level cost. LLM generation runs ~$0.80-$4.00 per million tokens vs human annotation at ~$0.11 per 50 tokens. Particularly valuable for new markets or languages where labeled data is scarce.
Research overview
If you manage product catalogs or build ML models for e-commerce, you know the data problem. Product listings are messy. The same attribute appears in different formats: "waterproof" in one listing, "water-resistant material" in another, "keeps you dry in rain" in a third. Training models to extract structured attributes from this chaos requires labeled data. Lots of it.
Manual labeling does not scale. Annotators cost money, make mistakes, and cannot keep up with catalog growth. You need thousands of examples per attribute, across categories, in every language you support.
Given a product title and description, extract structured attributes like color, size, material, or brand. For example: "Blue Nike Running Shoes Size 10" should yield color="blue", brand="Nike", type="running shoes", size="10". This powers search filters, recommendations, and catalog organization.
This paper from Amazon researchers presents a practical framework: use LLMs to generate synthetic product data that trains ML models as effectively as real labeled data. The key insight is generating not just correct examples, but also controlled errors and missing data that mirror real-world catalog quality issues.
Key results
| Configuration | Accuracy |
|---|---|
| Zero-shot (no training) | 13.4% |
| 100% Real data | 60.8% |
| 100% Synthetic data | 60.5% |
| 75% Real + 25% Synthetic | 68.8% |
The synthetic-only result is remarkable: LLM-generated data matches human-labeled data performance. But the hybrid result is the practical takeaway. Adding synthetic data to real data improves results by 8 percentage points. You do not have to choose between real and synthetic; combining them works best.
Low-resource scenarios
This framework is especially valuable when labeled data is scarce or nonexistent:
- New product categories: Launching smart home devices when your training data is all clothing
- New markets: Expanding to Germany when all your labeled data is English
- Niche verticals: Industrial equipment where no public training data exists
- Cold start: Building an ML system before you have any customer data
In these scenarios, you can bootstrap entirely from synthetic data (matching real-data performance at 60.5%), then improve with hybrid training as real labeled data becomes available.
The data labeling problem
E-commerce catalogs have three common data quality issues that ML models must handle:
Correct but varied. The attribute is present and accurate, but expressed differently across products. "Red leather jacket" vs "Jacket in crimson genuine leather" vs "Fire-engine red coat, real leather."
Incorrect or conflicting. The structured attribute says one thing, the text says another. A product listed as "blue" in filters but described as "navy" or "midnight" in the description. Sometimes outright wrong: winter boots described as "perfect for summer hiking."
Missing entirely. The attribute exists in the schema but is not mentioned in the text. You cannot infer the material if the listing only discusses color and size.
Traditional synthetic data generation focuses on correct examples. This paper argues you need all three types to train robust models. A model that only sees perfect data will fail on real catalogs.
Synthetic Product Generation Pipeline
Three strategies create diverse training examples
Three generation strategies
The framework implements three distinct strategies, each with a specific training purpose:
Strategy 1: Positive examples (50% of data)
Generate products where structured attributes correctly match text content. Given a base product, modify an attribute value and update all text references consistently.
Example transformation:
- Original: "Red leather handbag with gold hardware"
- New attribute: color = "blue"
- Generated: "Blue leather handbag with gold hardware"
The LLM must update every mention of color across title, description, and bullet points while preserving structure and style.
When changing "vanilla" flavor to "chocolate" in a food product, the model automatically updated the origin from "Madagascar" to "Switzerland" without explicit instruction. This semantic reasoning is impossible with rule-based augmentation and demonstrates why LLMs produce more realistic synthetic data.
Strategy 2: Negative examples (25% of data)
Introduce controlled inconsistencies that mirror real data quality issues. The structured attribute remains correct, but the text contains a subtle conflicting reference.
Example transformation:
- Attribute: season = "summer"
- Original: "Perfect for summer hiking adventures"
- Generated: "Perfect for summer hiking, though not as warm as winter boots"
This teaches models to handle conflicting signals and identify the authoritative attribute value.
Strategy 3: Incomplete examples (25% of data)
Remove all references to the target attribute while maintaining product coherence. This simulates products where attribute values cannot be inferred from text.
Example transformation:
- Original: "Waterproof hiking boots for all-weather adventures"
- Target attribute: waterproof = true
- Generated: "Hiking boots for outdoor adventures" (waterproof mentions removed)
Models trained on this learn to output "unknown" rather than hallucinating values.
Real catalogs are not clean. Sellers make mistakes, copy descriptions between products, or omit details. A model trained only on perfect examples will confidently predict wrong answers when it encounters messy real data. Training on controlled imperfections teaches appropriate uncertainty.
The pipeline
The generation process follows a structured workflow using multiple LLM calls:
Step 1: Attribute selection
For each base product, select a relevant structured attribute based on category. "Heel height" for shoes, "material" for clothing, "screen size" for electronics. This ensures generated data covers category-specific attributes.
Step 2: Value generation
A "Value Provider" LLM generates new attribute values considering:
- Category constraints (valid ranges, units)
- Marketplace requirements (metric vs imperial)
- Diversity (avoiding repetition of common values)
Critical detail: The Value Provider uses high temperature (T=1.0) to force diversity. Without this, your synthetic dataset will lack the variance needed to robustly train downstream models. The main generation step uses lower temperature (T=0.7) for consistency.
For negative examples, a separate "Similarity LLM" (sentence-transformers model) ensures the incorrect value is semantically distinct from the correct one. This is crucial: "blue" is a bad negative for "navy" since they are near-synonyms, and training on confusing synonyms degrades model performance. The system generates a pool of candidate values, then filters to those with low cosine similarity to the correct value.
A measure of similarity between two vectors (e.g., word embeddings) ranging from -1 (opposite) to 1 (identical). Here it ensures a "negative" attribute value is semantically far from the correct one, preventing the model from learning ambiguous synonyms.
Step 3: Product generation
The main LLM modifies the base product using a carefully designed prompt with four components:
PROMPT = ROLE + INSTRUCTION + CONTEXT + FORMAT
- Role: "You are an e-commerce product expert"
- Instruction: Generation rules, brand anonymization, consistency requirements
- Context: Base product text, target attribute, new value, strategy type
- Format: JSON output structure
Step 4: Brand anonymization (Responsible AI)
All real brand names are replaced with plausible fictional alternatives. "Nike" becomes "AthleteX", "Samsung" becomes "TechPro".
This is not just a nice-to-have. Brand anonymization is a critical Responsible AI safeguard that prevents: (1) Data leakage where models memorize brand-attribute associations from training data, (2) Brand bias in downstream recommendations that could create legal liability, (3) Intellectual property issues with generated content containing real trademarks. The 95.8% success rate means this works automatically without manual review.
Human Evaluation: Synthetic Product Quality
N=2,000 products evaluated by expert reviewers
Results
Human evaluation (N=2,000)
Expert reviewers assessed synthetic products across multiple dimensions:
| Metric | Score |
|---|---|
| Natural e-commerce language | 99.6% |
| Valid attribute values | 96.5% |
| Brand anonymization success | 95.8% |
| Correct attribute consistency | 94.2% |
| Negative example consistency | 93.0% |
| Unknown example consistency | 88.3% |
| No unintended changes | 88.8% |
The 4.2% of products with major issues primarily occurred when base products had empty descriptions. The model successfully generated appropriate content for all 47 empty-description cases.
Downstream ML performance
Attribute Extraction Accuracy by Training Data
75% real + 25% synthetic achieves best results
Training a FLAN-T5-base model for attribute extraction:
| Training Data | Test Accuracy |
|---|---|
| Zero-shot | 13.4% |
| 100% Real | 60.8% |
| 100% Synthetic | 60.5% |
| 75% Real + 25% Synthetic | 68.8% |
| 50% Real + 50% Synthetic | 66.1% |
| 25% Real + 75% Synthetic | 64.4% |
Key findings:
-
Synthetic matches real. The 0.3% gap (60.5% vs 60.8%) is within noise. Synthetic data captures essential patterns without adding noise.
-
Hybrid beats both. The 75/25 mix outperforms either alone by 8+ points. Synthetic data provides augmentation value beyond just more examples.
-
Diminishing returns. More synthetic is not always better. Performance drops as synthetic proportion increases past 25%. Why? Two likely causes: (1) Distribution shift: synthetic data, however good, has slightly different patterns than real data, and the model overfits to these synthetic patterns at higher proportions. (2) The 25% sweet spot provides augmentation benefits without overwhelming the real data signal. Start with 25% synthetic and tune from there.
A change in the statistical properties of training data compared to real-world data. When synthetic data differs subtly from real catalog data, the model may overfit to synthetic patterns, reducing real-world accuracy. This is why hybrid training with only 25% synthetic outperforms 75% synthetic.
Error analysis
Of 441 predictions initially marked incorrect (on the synthetic-only config), manual review revealed many were semantically valid but differed from ground truth annotations. Seven types of "false errors" were identified:
| Type | Example | Explanation |
|---|---|---|
| Granularity | "running shoe" vs "running" | Model more specific than annotation |
| Morphology | "wall sticker" vs "wall stickers" | Singular/plural variation |
| Multiple valid | "t-shirt" or "tank top" | Product matches both equally |
| Missing units | "1200" vs "1200 thread count" | Annotation omitted unit |
| Equivalent definitions | "type=processor" vs "type=food processor" | Category ambiguity |
| Contextual synonyms | "striped" vs "stripe" | Same meaning, different form |
| Format variations | "ipod touch" vs "for apple ipod" | Same product, different phrasing |
Practical implication: Do not blindly trust accuracy numbers. Manual review of "errors" often reveals that synthetic data is producing more precise, standardized values than the original human annotations. Your synthetic data might be better than your ground truth.
Implementation blueprint
This section provides production-ready Python code you can adapt for your own synthetic data pipeline. The implementation follows the paper's methodology: generate diverse attribute values, apply three strategies (positive, negative, incomplete), and filter negative examples for semantic distinctness.
Core dependencies
You need the Anthropic SDK for generation and sentence-transformers for the semantic similarity filtering used in negative example creation.
import anthropic
import json
import random
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
client = anthropic.Anthropic()
embedder = SentenceTransformer("all-MiniLM-L6-v2")Strategy selection
Why 50/25/25? The paper tested multiple ratios and found this balance optimal. Positive examples (50%) teach the model what correct attribute-text alignment looks like. Negative examples (25%) teach it to recognize conflicts—critical for real catalogs where sellers make mistakes. Incomplete examples (25%) teach it to output "unknown" rather than hallucinate when information is genuinely missing.
If your catalog has more errors than average, increase the negative proportion. If your downstream task penalizes false positives heavily, increase incomplete.
def select_strategy() -> str:
"""50% positive, 25% negative, 25% incomplete."""
roll = random.random()
if roll < 0.50:
return "positive"
elif roll < 0.75:
return "negative"
return "incomplete"Value generation with diversity
The temperature=1.0 setting is intentional and critical. At lower temperatures, the LLM tends to generate the most "obvious" values—if the current color is "red", it will suggest "blue" repeatedly. At T=1.0, you get "cerulean", "burnt orange", "sage green"—the long tail of realistic values that appear in real catalogs.
The prompt asks for a single value with no explanation. This is important: LLMs love to explain themselves, but we need parseable output. The "Return ONLY the value" instruction suppresses the natural tendency to add context.
We pass current_value to ensure the generated value differs from the original. Without this constraint, the model might regenerate the same value, producing a useless training example.
def generate_new_value(
attribute: str,
category: str,
current_value: str,
strategy: str
) -> str:
"""Generate a new attribute value. High temperature for diversity."""
if strategy == "incomplete":
return None # No value needed for removal
prompt = f"""Generate a single realistic {attribute} value for a {category} product.
Current value: {current_value}
Requirements:
- Must be different from current value
- Must be realistic for this category
- Return ONLY the value, no explanation
Value:"""
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=50,
temperature=1.0, # High temp for diversity
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()Negative value filtering
For negative examples, the "wrong" value must be semantically distinct from the correct one. If the correct color is "navy" and you generate "blue" as the negative, the model learns to treat near-synonyms as errors—the opposite of what you want.
The threshold=0.3 means we reject any candidate with cosine similarity above 0.3 to the original. This is conservative; values like "red" vs "crimson" (similarity ~0.6) get rejected, while "red" vs "large" (similarity ~0.1) passes. The threshold ensures the negative example represents a genuine category error, not a formatting difference.
The retry loop (max_attempts=5) handles cases where the LLM keeps generating similar values. For common attributes like "color", this rarely triggers. For niche attributes with limited valid values, you may need to increase attempts or fall back to a predefined value pool.
def filter_negative_value(
correct_value: str,
candidate_value: str,
threshold: float = 0.3
) -> bool:
"""Reject candidates too similar to correct value."""
embeddings = embedder.encode([correct_value, candidate_value])
similarity = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
return similarity < threshold # True if sufficiently different
def get_valid_negative_value(
attribute: str,
category: str,
correct_value: str,
max_attempts: int = 5
) -> str:
"""Generate negative values until one passes similarity filter."""
for _ in range(max_attempts):
candidate = generate_new_value(
attribute, category, correct_value, "negative"
)
if filter_negative_value(correct_value, candidate):
return candidate
raise ValueError(f"Could not find distinct negative for: {correct_value}")Main generation function
The strategy_instructions dictionary encodes the core difference between strategies. For positive examples, we want ALL mentions updated consistently. For negative examples, we want exactly ONE subtle conflict—not obvious errors that would be trivial to detect. For incomplete, we remove references entirely while keeping the product description coherent.
The temperature=0.7 (lower than value generation) balances creativity with consistency. We want realistic modifications, not hallucinated product features.
The JSON parsing block handles a common LLM quirk: models often wrap JSON in markdown code blocks even when told not to. The split logic extracts the actual JSON regardless of wrapper format. In production, wrap this in a retry loop for malformed responses.
We attach metadata (_strategy, _attribute, _new_value) to each result. This is essential for debugging and for creating properly labeled training data downstream.
def generate_synthetic_product(
product: dict,
attribute: str,
new_value: str,
strategy: str
) -> dict:
"""Generate a modified product using the specified strategy."""
strategy_instructions = {
"positive": f"Update ALL mentions of {attribute} to consistently reflect: {new_value}",
"negative": f"Keep {attribute}={product.get(attribute)} as the correct value, but add ONE subtle conflicting reference to {new_value} in the description",
"incomplete": f"Remove ALL references to {attribute} from the text while keeping the product coherent"
}
prompt = f"""You are an e-commerce product expert. Modify this product listing.
TASK: {strategy_instructions[strategy]}
RULES:
1. Replace real brand names with fictional alternatives
2. Preserve the original writing style and structure
3. Keep all other attributes unchanged
4. Return valid JSON only
ORIGINAL PRODUCT:
Title: {product["title"]}
Description: {product["description"]}
Bullets: {json.dumps(product.get("bullets", []))}
Return JSON with keys: title, description, bullets"""
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=1000,
temperature=0.7, # Lower temp for consistency
messages=[{"role": "user", "content": prompt}]
)
# Parse and validate JSON response
text = response.content[0].text
# Handle markdown code blocks if present
if "```json" in text:
text = text.split("```json")[1].split("```")[0]
elif "```" in text:
text = text.split("```")[1].split("```")[0]
result = json.loads(text.strip())
result["_strategy"] = strategy
result["_attribute"] = attribute
result["_new_value"] = new_value
return resultBatch processing loop
This function ties everything together. A few implementation decisions worth noting:
The 50-character minimum for descriptions is a quality gate. The paper found that products with empty or minimal descriptions produced the 4.2% of outputs with major issues. Adjust this threshold based on your catalog—some domains have naturally shorter descriptions.
The attributes_by_category dictionary lets you target meaningful attributes per category. "Heel height" makes sense for shoes but not electronics. Fallback to generic attributes ("color", "material") handles unknown categories gracefully.
The try/except is essential for production resilience. LLM calls fail for many reasons: rate limits, malformed responses, network issues. Logging failures and continuing ensures you get partial results rather than losing an entire batch.
def generate_synthetic_batch(
products: list[dict],
attributes_by_category: dict[str, list[str]],
batch_size: int = 100
) -> list[dict]:
"""Generate synthetic products from a batch of real products."""
synthetic = []
for product in products:
# Skip products with minimal content
if len(product.get("description", "")) < 50:
continue
category = product.get("category", "general")
attributes = attributes_by_category.get(category, ["color", "material"])
attribute = random.choice(attributes)
strategy = select_strategy()
current_value = product.get(attribute, "")
try:
if strategy == "positive":
new_value = generate_new_value(
attribute, category, current_value, strategy
)
elif strategy == "negative":
new_value = get_valid_negative_value(
attribute, category, current_value
)
else: # incomplete
new_value = None
result = generate_synthetic_product(
product, attribute, new_value, strategy
)
synthetic.append(result)
except Exception as e:
print(f"Skipping product {product.get('id')}: {e}")
continue
return syntheticConfiguration reference
These are the key parameters to tune. Start with the paper's defaults and adjust based on your data quality and domain.
| Parameter | Value | Why |
|---|---|---|
| Generation model | claude-3-haiku | Cost-effective, sufficient quality |
| Value provider temp | 1.0 | Forces diversity in attributes |
| Generator temp | 0.7 | Balances creativity/consistency |
| Similarity threshold | 0.3 | Ensures negatives are distinct |
| Min description length | 50 chars | Avoids hallucination on empty input |
| Strategy split | 50/25/25 | Matches paper's optimal ratio |
A sampling parameter that controls randomness in LLM generation. Higher values (e.g., 1.0) produce more varied outputs; lower values (e.g., 0.7) make responses more deterministic. The Value Provider needs high temperature to create diverse attribute values; the generator needs lower temperature for consistent product text.
Where teams get stuck
Based on the paper's findings and common production issues, here are the failure modes to watch for:
Problem 1: Empty base content. Products with minimal descriptions produce low-quality synthetics. The code above filters products with less than 50 characters—adjust this threshold based on your catalog. The paper's 4.2% major issue rate came primarily from empty descriptions.
Problem 2: JSON parsing failures. LLMs sometimes return malformed JSON or wrap it in markdown code blocks. The generation function handles common cases. For production, add retries with exponential backoff.
Problem 3: Rate limits. Haiku allows approximately 1000 requests per minute. For large batches, add time.sleep(0.1) between calls or use async with semaphores. At scale, consider batching products into parallel workers.
Problem 4: Evaluation mismatch. Your accuracy metrics may be pessimistic. Manual review of "incorrect" predictions often reveals the model produced more precise values than the original annotations. Sample 50-100 generations for human review before scaling to validate quality.
Limitations
Garbage in, garbage out
Synthetic data quality depends heavily on seed product quality. This is not a magic fix for broken catalogs.
Specific failure modes:
- Empty or minimal descriptions produce hallucinated content (the 4.2% major issue rate came primarily from empty descriptions)
- Wrong category assignments propagate to synthetic data
- Duplicate listings create redundant training examples
- Overly generic attributes like "type" or "style" produce inconsistent generations
Recommendation: Filter your seed products before generation. Require minimum description length, validate category assignments, and focus on concrete attributes (color, size, material) rather than abstract ones (quality, style).
Single attribute per product
The framework modifies one attribute at a time. Multi-attribute changes (simultaneously updating color and size) are not addressed. This limits generation efficiency.
English-focused evaluation
While the framework supports multilingual generation, evaluation was primarily on English data. Performance may vary for languages with different product description conventions.
No public dataset release
The synthetic dataset is not released, limiting reproducibility. However, the methodology is fully described and implementable with public tools.
Original paper: arXiv ・ PDF ・ HTML
Authors: Virginia Negri, Víctor Martínez Gómez, Sergio A. Balanya, Subburam Rajaram (Amazon)
Venue: AAAI'26 Workshop on Responsible Synthetic Data
Cite this paper
Virginia Negri, Víctor Martínez Gómez, Sergio A. Balanya, Subburam Rajaram (2025). Synthetic Product Data That Actually Works: Amazon's LLM Pipeline for E-commerce. AAAI'26 Workshop on Responsible Synthetic Data.