Synthetic Product Data That Actually Works: Amazon's LLM Pipeline for E-commerce

TL;DR

The Problem. E-commerce ML models need labeled product data (attribute-text pairs), but manual annotation is expensive and does not scale. Real catalogs have messy, inconsistent data.
The Solution. Use LLMs to generate synthetic products with three controlled strategies: correct attributes, intentional errors, and missing attributes. This trains models to handle real-world data quality issues.
The Results. 100% synthetic data matches 100% real data performance (60.5% vs 60.8%). Hybrid 75/25 real/synthetic reaches 68.8%. Human evaluators rated 99.6% of outputs as natural.
The Business Case. Human-level quality at machine-level cost. LLM generation runs ~$0.80-$4.00 per million tokens vs human annotation at ~$0.11 per 50 tokens. Particularly valuable for new markets or languages where labeled data is scarce.

Research overview

If you manage product catalogs or build ML models for e-commerce, you know the data problem. Product listings are messy. The same attribute appears in different formats: "waterproof" in one listing, "water-resistant material" in another, "keeps you dry in rain" in a third. Training models to extract structured attributes from this chaos requires labeled data. Lots of it.

Manual labeling does not scale. Annotators cost money, make mistakes, and cannot keep up with catalog growth. You need thousands of examples per attribute, across categories, in every language you support.

What is attribute extraction?

Given a product title and description, extract structured attributes like color, size, material, or brand. For example: "Blue Nike Running Shoes Size 10" should yield color="blue", brand="Nike", type="running shoes", size="10". This powers search filters, recommendations, and catalog organization.

This paper from Amazon researchers presents a practical framework: use LLMs to generate synthetic product data that trains ML models as effectively as real labeled data. The key insight is generating not just correct examples, but also controlled errors and missing data that mirror real-world catalog quality issues.

Key results

Configuration	Accuracy
Zero-shot (no training)	13.4%
100% Real data	60.8%
100% Synthetic data	60.5%
75% Real + 25% Synthetic	68.8%

The synthetic-only result is remarkable: LLM-generated data matches human-labeled data performance. But the hybrid result is the practical takeaway. Adding synthetic data to real data improves results by 8 percentage points. You do not have to choose between real and synthetic; combining them works best.

Low-resource scenarios

This framework is especially valuable when labeled data is scarce or nonexistent:

New product categories: Launching smart home devices when your training data is all clothing
New markets: Expanding to Germany when all your labeled data is English
Niche verticals: Industrial equipment where no public training data exists
Cold start: Building an ML system before you have any customer data

In these scenarios, you can bootstrap entirely from synthetic data (matching real-data performance at 60.5%), then improve with hybrid training as real labeled data becomes available.

The data labeling problem

E-commerce catalogs have three common data quality issues that ML models must handle:

Correct but varied. The attribute is present and accurate, but expressed differently across products. "Red leather jacket" vs "Jacket in crimson genuine leather" vs "Fire-engine red coat, real leather."

Incorrect or conflicting. The structured attribute says one thing, the text says another. A product listed as "blue" in filters but described as "navy" or "midnight" in the description. Sometimes outright wrong: winter boots described as "perfect for summer hiking."

Missing entirely. The attribute exists in the schema but is not mentioned in the text. You cannot infer the material if the listing only discusses color and size.

Traditional synthetic data generation focuses on correct examples. This paper argues you need all three types to train robust models. A model that only sees perfect data will fail on real catalogs.

Synthetic Product Generation Pipeline

Three strategies create diverse training examples

Three generation strategies

The framework implements three distinct strategies, each with a specific training purpose:

Strategy 1: Positive examples (50% of data)

Generate products where structured attributes correctly match text content. Given a base product, modify an attribute value and update all text references consistently.

Example transformation:

Original: "Red leather handbag with gold hardware"
New attribute: color = "blue"
Generated: "Blue leather handbag with gold hardware"

The LLM must update every mention of color across title, description, and bullet points while preserving structure and style.

LLMs understand interdependent attributes

When changing "vanilla" flavor to "chocolate" in a food product, the model automatically updated the origin from "Madagascar" to "Switzerland" without explicit instruction. This semantic reasoning is impossible with rule-based augmentation and demonstrates why LLMs produce more realistic synthetic data.

Strategy 2: Negative examples (25% of data)

Introduce controlled inconsistencies that mirror real data quality issues. The structured attribute remains correct, but the text contains a subtle conflicting reference.

Example transformation:

Attribute: season = "summer"
Original: "Perfect for summer hiking adventures"
Generated: "Perfect for summer hiking, though not as warm as winter boots"

This teaches models to handle conflicting signals and identify the authoritative attribute value.

Strategy 3: Incomplete examples (25% of data)

Remove all references to the target attribute while maintaining product coherence. This simulates products where attribute values cannot be inferred from text.

Example transformation:

Original: "Waterproof hiking boots for all-weather adventures"
Target attribute: waterproof = true
Generated: "Hiking boots for outdoor adventures" (waterproof mentions removed)

Models trained on this learn to output "unknown" rather than hallucinating values.

Why include errors and missing data?

Real catalogs are not clean. Sellers make mistakes, copy descriptions between products, or omit details. A model trained only on perfect examples will confidently predict wrong answers when it encounters messy real data. Training on controlled imperfections teaches appropriate uncertainty.

The pipeline

The generation process follows a structured workflow using multiple LLM calls:

Step 1: Attribute selection

For each base product, select a relevant structured attribute based on category. "Heel height" for shoes, "material" for clothing, "screen size" for electronics. This ensures generated data covers category-specific attributes.

Step 2: Value generation

A "Value Provider" LLM generates new attribute values considering:

Category constraints (valid ranges, units)
Marketplace requirements (metric vs imperial)
Diversity (avoiding repetition of common values)

Critical detail: The Value Provider uses high temperature (T=1.0) to force diversity. Without this, your synthetic dataset will lack the variance needed to robustly train downstream models. The main generation step uses lower temperature (T=0.7) for consistency.

For negative examples, a separate "Similarity LLM" (sentence-transformers model) ensures the incorrect value is semantically distinct from the correct one. This is crucial: "blue" is a bad negative for "navy" since they are near-synonyms, and training on confusing synonyms degrades model performance. The system generates a pool of candidate values, then filters to those with low cosine similarity to the correct value.

Cosine similarity

A measure of similarity between two vectors (e.g., word embeddings) ranging from -1 (opposite) to 1 (identical). Here it ensures a "negative" attribute value is semantically far from the correct one, preventing the model from learning ambiguous synonyms.

Step 3: Product generation

The main LLM modifies the base product using a carefully designed prompt with four components:

PROMPT = ROLE + INSTRUCTION + CONTEXT + FORMAT

Role: "You are an e-commerce product expert"
Instruction: Generation rules, brand anonymization, consistency requirements
Context: Base product text, target attribute, new value, strategy type
Format: JSON output structure

Step 4: Brand anonymization (Responsible AI)

All real brand names are replaced with plausible fictional alternatives. "Nike" becomes "AthleteX", "Samsung" becomes "TechPro".

Why brand anonymization matters for production

This is not just a nice-to-have. Brand anonymization is a critical Responsible AI safeguard that prevents: (1) Data leakage where models memorize brand-attribute associations from training data, (2) Brand bias in downstream recommendations that could create legal liability, (3) Intellectual property issues with generated content containing real trademarks. The 95.8% success rate means this works automatically without manual review.

Human Evaluation: Synthetic Product Quality

N=2,000 products evaluated by expert reviewers

Results

Human evaluation (N=2,000)

Expert reviewers assessed synthetic products across multiple dimensions:

Metric	Score
Natural e-commerce language	99.6%
Valid attribute values	96.5%
Brand anonymization success	95.8%
Correct attribute consistency	94.2%
Negative example consistency	93.0%
Unknown example consistency	88.3%
No unintended changes	88.8%

The 4.2% of products with major issues primarily occurred when base products had empty descriptions. The model successfully generated appropriate content for all 47 empty-description cases.

Downstream ML performance

Attribute Extraction Accuracy by Training Data

75% real + 25% synthetic achieves best results

Training a FLAN-T5-base model for attribute extraction:

Training Data	Test Accuracy
Zero-shot	13.4%
100% Real	60.8%
100% Synthetic	60.5%
75% Real + 25% Synthetic	68.8%
50% Real + 50% Synthetic	66.1%
25% Real + 75% Synthetic	64.4%

Key findings:

Synthetic matches real. The 0.3% gap (60.5% vs 60.8%) is within noise. Synthetic data captures essential patterns without adding noise.
Hybrid beats both. The 75/25 mix outperforms either alone by 8+ points. Synthetic data provides augmentation value beyond just more examples.
Diminishing returns. More synthetic is not always better. Performance drops as synthetic proportion increases past 25%. Why? Two likely causes: (1) Distribution shift: synthetic data, however good, has slightly different patterns than real data, and the model overfits to these synthetic patterns at higher proportions. (2) The 25% sweet spot provides augmentation benefits without overwhelming the real data signal. Start with 25% synthetic and tune from there.

Distribution shift

A change in the statistical properties of training data compared to real-world data. When synthetic data differs subtly from real catalog data, the model may overfit to synthetic patterns, reducing real-world accuracy. This is why hybrid training with only 25% synthetic outperforms 75% synthetic.

Error analysis

Of 441 predictions initially marked incorrect (on the synthetic-only config), manual review revealed many were semantically valid but differed from ground truth annotations. Seven types of "false errors" were identified:

Type	Example	Explanation
Granularity	"running shoe" vs "running"	Model more specific than annotation
Morphology	"wall sticker" vs "wall stickers"	Singular/plural variation
Multiple valid	"t-shirt" or "tank top"	Product matches both equally
Missing units	"1200" vs "1200 thread count"	Annotation omitted unit
Equivalent definitions	"type=processor" vs "type=food processor"	Category ambiguity
Contextual synonyms	"striped" vs "stripe"	Same meaning, different form
Format variations	"ipod touch" vs "for apple ipod"	Same product, different phrasing

Practical implication: Do not blindly trust accuracy numbers. Manual review of "errors" often reveals that synthetic data is producing more precise, standardized values than the original human annotations. Your synthetic data might be better than your ground truth.

Implementation blueprint

This section provides production-ready Python code you can adapt for your own synthetic data pipeline. The implementation follows the paper's methodology: generate diverse attribute values, apply three strategies (positive, negative, incomplete), and filter negative examples for semantic distinctness.

Core dependencies

You need the Anthropic SDK for generation and sentence-transformers for the semantic similarity filtering used in negative example creation.

import anthropic
import json
import random
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
 
client = anthropic.Anthropic()
embedder = SentenceTransformer("all-MiniLM-L6-v2")

Strategy selection

Why 50/25/25? The paper tested multiple ratios and found this balance optimal. Positive examples (50%) teach the model what correct attribute-text alignment looks like. Negative examples (25%) teach it to recognize conflicts—critical for real catalogs where sellers make mistakes. Incomplete examples (25%) teach it to output "unknown" rather than hallucinate when information is genuinely missing.

If your catalog has more errors than average, increase the negative proportion. If your downstream task penalizes false positives heavily, increase incomplete.

def select_strategy() -> str:
    """50% positive, 25% negative, 25% incomplete."""
    roll = random.random()
    if roll < 0.50:
        return "positive"
    elif roll < 0.75:
        return "negative"
    return "incomplete"

Value generation with diversity

The temperature=1.0 setting is intentional and critical. At lower temperatures, the LLM tends to generate the most "obvious" values—if the current color is "red", it will suggest "blue" repeatedly. At T=1.0, you get "cerulean", "burnt orange", "sage green"—the long tail of realistic values that appear in real catalogs.

The prompt asks for a single value with no explanation. This is important: LLMs love to explain themselves, but we need parseable output. The "Return ONLY the value" instruction suppresses the natural tendency to add context.

We pass current_value to ensure the generated value differs from the original. Without this constraint, the model might regenerate the same value, producing a useless training example.

def generate_new_value(
    attribute: str,
    category: str,
    current_value: str,
    strategy: str
) -> str:
    """Generate a new attribute value. High temperature for diversity."""
 
    if strategy == "incomplete":
        return None  # No value needed for removal
 
    prompt = f"""Generate a single realistic {attribute} value for a {category} product.
Current value: {current_value}
Requirements:
- Must be different from current value
- Must be realistic for this category
- Return ONLY the value, no explanation
 
Value:"""
 
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=50,
        temperature=1.0,  # High temp for diversity
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text.strip()

Negative value filtering

For negative examples, the "wrong" value must be semantically distinct from the correct one. If the correct color is "navy" and you generate "blue" as the negative, the model learns to treat near-synonyms as errors—the opposite of what you want.

The threshold=0.3 means we reject any candidate with cosine similarity above 0.3 to the original. This is conservative; values like "red" vs "crimson" (similarity ~0.6) get rejected, while "red" vs "large" (similarity ~0.1) passes. The threshold ensures the negative example represents a genuine category error, not a formatting difference.

The retry loop (max_attempts=5) handles cases where the LLM keeps generating similar values. For common attributes like "color", this rarely triggers. For niche attributes with limited valid values, you may need to increase attempts or fall back to a predefined value pool.

def filter_negative_value(
    correct_value: str,
    candidate_value: str,
    threshold: float = 0.3
) -> bool:
    """Reject candidates too similar to correct value."""
    embeddings = embedder.encode([correct_value, candidate_value])
    similarity = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
    return similarity < threshold  # True if sufficiently different
 
def get_valid_negative_value(
    attribute: str,
    category: str,
    correct_value: str,
    max_attempts: int = 5
) -> str:
    """Generate negative values until one passes similarity filter."""
    for _ in range(max_attempts):
        candidate = generate_new_value(
            attribute, category, correct_value, "negative"
        )
        if filter_negative_value(correct_value, candidate):
            return candidate
    raise ValueError(f"Could not find distinct negative for: {correct_value}")

Main generation function

The strategy_instructions dictionary encodes the core difference between strategies. For positive examples, we want ALL mentions updated consistently. For negative examples, we want exactly ONE subtle conflict—not obvious errors that would be trivial to detect. For incomplete, we remove references entirely while keeping the product description coherent.

The temperature=0.7 (lower than value generation) balances creativity with consistency. We want realistic modifications, not hallucinated product features.

The JSON parsing block handles a common LLM quirk: models often wrap JSON in markdown code blocks even when told not to. The split logic extracts the actual JSON regardless of wrapper format. In production, wrap this in a retry loop for malformed responses.

We attach metadata (_strategy, _attribute, _new_value) to each result. This is essential for debugging and for creating properly labeled training data downstream.

def generate_synthetic_product(
    product: dict,
    attribute: str,
    new_value: str,
    strategy: str
) -> dict:
    """Generate a modified product using the specified strategy."""
 
    strategy_instructions = {
        "positive": f"Update ALL mentions of {attribute} to consistently reflect: {new_value}",
        "negative": f"Keep {attribute}={product.get(attribute)} as the correct value, but add ONE subtle conflicting reference to {new_value} in the description",
        "incomplete": f"Remove ALL references to {attribute} from the text while keeping the product coherent"
    }
 
    prompt = f"""You are an e-commerce product expert. Modify this product listing.
 
TASK: {strategy_instructions[strategy]}
 
RULES:
1. Replace real brand names with fictional alternatives
2. Preserve the original writing style and structure
3. Keep all other attributes unchanged
4. Return valid JSON only
 
ORIGINAL PRODUCT:
Title: {product["title"]}
Description: {product["description"]}
Bullets: {json.dumps(product.get("bullets", []))}
 
Return JSON with keys: title, description, bullets"""
 
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=1000,
        temperature=0.7,  # Lower temp for consistency
        messages=[{"role": "user", "content": prompt}]
    )
 
    # Parse and validate JSON response
    text = response.content[0].text
    # Handle markdown code blocks if present
    if "```json" in text:
        text = text.split("```json")[1].split("```")[0]
    elif "```" in text:
        text = text.split("```")[1].split("```")[0]
 
    result = json.loads(text.strip())
    result["_strategy"] = strategy
    result["_attribute"] = attribute
    result["_new_value"] = new_value
    return result

Batch processing loop

This function ties everything together. A few implementation decisions worth noting:

The 50-character minimum for descriptions is a quality gate. The paper found that products with empty or minimal descriptions produced the 4.2% of outputs with major issues. Adjust this threshold based on your catalog—some domains have naturally shorter descriptions.

The attributes_by_category dictionary lets you target meaningful attributes per category. "Heel height" makes sense for shoes but not electronics. Fallback to generic attributes ("color", "material") handles unknown categories gracefully.

The try/except is essential for production resilience. LLM calls fail for many reasons: rate limits, malformed responses, network issues. Logging failures and continuing ensures you get partial results rather than losing an entire batch.

def generate_synthetic_batch(
    products: list[dict],
    attributes_by_category: dict[str, list[str]],
    batch_size: int = 100
) -> list[dict]:
    """Generate synthetic products from a batch of real products."""
 
    synthetic = []
 
    for product in products:
        # Skip products with minimal content
        if len(product.get("description", "")) < 50:
            continue
 
        category = product.get("category", "general")
        attributes = attributes_by_category.get(category, ["color", "material"])
        attribute = random.choice(attributes)
 
        strategy = select_strategy()
        current_value = product.get(attribute, "")
 
        try:
            if strategy == "positive":
                new_value = generate_new_value(
                    attribute, category, current_value, strategy
                )
            elif strategy == "negative":
                new_value = get_valid_negative_value(
                    attribute, category, current_value
                )
            else:  # incomplete
                new_value = None
 
            result = generate_synthetic_product(
                product, attribute, new_value, strategy
            )
            synthetic.append(result)
 
        except Exception as e:
            print(f"Skipping product {product.get('id')}: {e}")
            continue
 
    return synthetic

Configuration reference

These are the key parameters to tune. Start with the paper's defaults and adjust based on your data quality and domain.

Parameter	Value	Why
Generation model	claude-3-haiku	Cost-effective, sufficient quality
Value provider temp	1.0	Forces diversity in attributes
Generator temp	0.7	Balances creativity/consistency
Similarity threshold	0.3	Ensures negatives are distinct
Min description length	50 chars	Avoids hallucination on empty input
Strategy split	50/25/25	Matches paper's optimal ratio

Temperature (T)

A sampling parameter that controls randomness in LLM generation. Higher values (e.g., 1.0) produce more varied outputs; lower values (e.g., 0.7) make responses more deterministic. The Value Provider needs high temperature to create diverse attribute values; the generator needs lower temperature for consistent product text.

Where teams get stuck

Based on the paper's findings and common production issues, here are the failure modes to watch for:

Problem 1: Empty base content. Products with minimal descriptions produce low-quality synthetics. The code above filters products with less than 50 characters—adjust this threshold based on your catalog. The paper's 4.2% major issue rate came primarily from empty descriptions.

Problem 2: JSON parsing failures. LLMs sometimes return malformed JSON or wrap it in markdown code blocks. The generation function handles common cases. For production, add retries with exponential backoff.

Problem 3: Rate limits. Haiku allows approximately 1000 requests per minute. For large batches, add time.sleep(0.1) between calls or use async with semaphores. At scale, consider batching products into parallel workers.

Problem 4: Evaluation mismatch. Your accuracy metrics may be pessimistic. Manual review of "incorrect" predictions often reveals the model produced more precise values than the original annotations. Sample 50-100 generations for human review before scaling to validate quality.

Limitations

Garbage in, garbage out

Synthetic data quality depends heavily on seed product quality. This is not a magic fix for broken catalogs.

Specific failure modes:

Empty or minimal descriptions produce hallucinated content (the 4.2% major issue rate came primarily from empty descriptions)
Wrong category assignments propagate to synthetic data
Duplicate listings create redundant training examples
Overly generic attributes like "type" or "style" produce inconsistent generations

Recommendation: Filter your seed products before generation. Require minimum description length, validate category assignments, and focus on concrete attributes (color, size, material) rather than abstract ones (quality, style).

Single attribute per product

The framework modifies one attribute at a time. Multi-attribute changes (simultaneously updating color and size) are not addressed. This limits generation efficiency.

English-focused evaluation

While the framework supports multilingual generation, evaluation was primarily on English data. Performance may vary for languages with different product description conventions.

No public dataset release

The synthetic dataset is not released, limiting reproducibility. However, the methodology is fully described and implementable with public tools.

Original paper: arXiv ・ PDF ・ HTML

Authors: Virginia Negri, Víctor Martínez Gómez, Sergio A. Balanya, Subburam Rajaram (Amazon)

Venue: AAAI'26 Workshop on Responsible Synthetic Data

Authors

Virginia NegriAmazon Spain,Víctor Martínez GómezAmazon Spain,Sergio A. BalanyaAmazon Spain,Subburam RajaramAmazon Germany

Cite this paper

Virginia Negri, Víctor Martínez Gómez, Sergio A. Balanya, Subburam Rajaram (2025). Synthetic Product Data That Actually Works: Amazon's LLM Pipeline for E-commerce. AAAI'26 Workshop on Responsible Synthetic Data.

Key Findings

Research overview

Key results

Low-resource scenarios

The data labeling problem

Synthetic Product Generation Pipeline

Three generation strategies

Strategy 1: Positive examples (50% of data)

Strategy 2: Negative examples (25% of data)

Strategy 3: Incomplete examples (25% of data)

The pipeline

Step 1: Attribute selection

Step 2: Value generation

Step 3: Product generation

Step 4: Brand anonymization (Responsible AI)

Human Evaluation: Synthetic Product Quality

Results

Human evaluation (N=2,000)

Downstream ML performance

Attribute Extraction Accuracy by Training Data

Error analysis

Implementation blueprint

Core dependencies

Strategy selection

Value generation with diversity

Negative value filtering

Main generation function

Batch processing loop

Configuration reference

Where teams get stuck

Limitations

Garbage in, garbage out

Single attribute per product

English-focused evaluation

No public dataset release

Authors

Cite this paper

Related Research

HiMem: Hierarchical Memory That Actually Remembers What Matters

From Vague to Precise: How Multi-Agent LLM Teams Fix Your Prompts

ARM: Teaching RAG Systems to Forget Like Humans