-
The Problem. LLMs are terrible at direct prediction tasks. Ask GPT-5 to estimate house prices and you get 38% error. The model has broad knowledge but lacks dataset-specific patterns.
-
The Solution. Use LLMs as semantic feature extractors, not predictors. GenZ asks "does this house have a modern kitchen?" and feeds yes/no answers into traditional regression. When predictions are wrong, it discovers new features by contrasting error groups.
-
The Results. 3.2x improvement on house prices (12% vs 38% error). Cold-start recommendations equivalent to 4,000 user ratings. By decoupling "reasoning" (LLM) from "calculation" (statistics), you get the best of both worlds: interpretable features and accurate numbers.
Research overview
Foundation models know a lot about the world. They can describe architectural styles, explain movie genres, and discuss real estate markets. But ask them to predict a specific house price or recommend movies to a new user, and they struggle.
The problem is not a knowledge gap, but a "calibration gap." LLMs know general real estate principles but cannot calibrate them to the specific dollar-value weights of your local market. They do not know that houses in your city's north side command a 15% premium, or that granite countertops add $12,000 in your price range.
GenZ solves this by treating foundation models as feature generators rather than predictors. Instead of asking "what is this house worth?", GenZ asks "does this house have granite countertops?" and "is it in a walkable neighborhood?" The yes/no answers become features in a traditional statistical model. When predictions are wrong, the system discovers new features by examining what the errors have in common.
On house pricing, GenZ achieved 12% median relative error compared to 38% for zero-shot GPT-5. For movie recommendations, semantic features alone matched the predictive power of 4,000 user ratings from collaborative filtering.
A recommendation technique that predicts preferences based on similar users: "people who liked X also liked Y." It requires existing user behavior data, which is why GenZ's ability to match 4,000 ratings without any user data is significant.
When you ask an LLM to perform a task without any task-specific training examples. "Here's a house description, estimate its price" is zero-shot. The model relies entirely on its pre-training knowledge.
The prediction gap
Foundation models excel at understanding and generation. They fail at prediction for a specific reason: they lack access to your data's statistical structure.
Consider house pricing. An LLM knows that granite countertops are desirable and that good school districts matter. But it does not know:
- How much granite countertops add to prices in your market
- Which specific neighborhoods command premiums
- The interaction effects (granite matters more in luxury segments)
- The baseline price anchors: GPT-5 does not know if "average" in your dataset means $300K or $3M without seeing the distribution. In San Francisco, a "fixer-upper" might cost $1M. In Detroit, it might be $10K. An LLM trained on the whole internet averages these out, leading to massive errors in local contexts.
When researchers tested GPT-5 on 535 house listings, asking it to predict log-prices directly, the median relative error was 38%. The model understood real estate concepts but could not map them to specific dollar values.
GenZ vs zero-shot performance
LLM as feature generator beats LLM as direct predictor
The same gap appears in recommendations. An LLM can describe movies eloquently, but it cannot predict which movies a user will enjoy without knowing how those descriptions correlate with ratings in your system.
Traditional machine learning solves this through features: measurable attributes that statistical models can weight and combine. The challenge is defining those features. Hand-crafting them requires domain expertise and misses patterns humans do not anticipate.
How GenZ works
GenZ bridges foundation models and statistical learning through a four-step loop:
The GenZ iterative loop
Four steps that refine semantic features based on prediction errors
Step 1: Semantic classification. The foundation model classifies items against a set of feature descriptions. "Does this house have a modern kitchen?" yields a probability. These probabilities become the feature matrix.
Step 2: Statistical prediction. A traditional model (linear regression, neural network) predicts targets from the semantic features. For houses, it predicts log-price. For movies, it predicts user preference embeddings.
Step 3: Error analysis. The system identifies items where predictions are most wrong. It groups these errors and looks for commonalities. What do the overpriced houses share? What do the underpriced ones have?
Step 4: Feature discovery. Using the error groups, GenZ prompts the foundation model to discover distinguishing features. "What makes Group A different from Group B?" The model proposes new semantic features. Useful ones get added; redundant ones get pruned. Think of this as a detective asking a witness: "You said these two suspects looked different. Was it their height? Their hair color?" The statistical model points out the suspects (errors), and the LLM identifies the distinguishing trait.
This cycle repeats. Each iteration refines the feature set based on prediction errors, not just LLM intuition.
If the model overprices certain houses consistently, those houses share something the current features miss. By contrasting error groups, GenZ finds patterns that general LLM knowledge overlooks. A house near a highway might be undervalued by features focused on interior quality.
Feature discovery
The most interesting aspect of GenZ is what it discovers. The features that emerge often surprise domain experts.
House pricing discoveries
Starting with generic real estate features (bedrooms, bathrooms, square footage), GenZ discovered:
- Location signals: Specific neighborhood and proximity factors the LLM initially overlooked. The zero-shot baseline focused on cosmetic features like "fenced yard" while GenZ learned that location quality dominated pricing.
- Build quality indicators: Architectural details signaling luxury versus budget construction, quality of remodel work
- Listing metadata: Factors in how the property was presented, not just the property itself
These features were not in the LLM's initial suggestions. They emerged from contrasting prediction errors.
Statistical Feature Discovery
Accuracy in discovering data characteristics
Movie recommendation discoveries
For Netflix cold-start recommendations, GenZ discovered features that diverged from content-based intuitions:
- Prestige signals: Fine-grained award distinctions (Best Picture, Best Actor, Best Original Screenplay) predicted preferences better than genre
- Specific talent: Individual actors, directors, and composers. The "John Williams score" feature cuts across genres to identify a coherent aesthetic preference invisible to content-based analysis.
- Franchise membership: Being part of a series mattered more than plot similarity
- Precise temporal windows: Not just "old vs new" but specific periods like 1995-2000 or 2004-2005 that captured cultural cohorts in the Netflix audience
A content-focused system might emphasize "action movie with car chases." GenZ learned that shared preferences cluster around prestige signals, specific creative talent, and cultural/temporal cohorts rather than plot similarity.
The expand-contract cycle
Feature discovery uses an expand-contract pattern that runs 10-20 times:
- Expand: Propose new features based on error analysis
- Evaluate: Test each feature's predictive contribution
- Contract: Remove features that do not improve predictions or overlap with existing ones
Think of it like a sculptor refining a statue. First, add more clay (propose new features), then carefully remove excess material (drop unhelpful ones) to reveal the true form. Each iteration brings the model closer to the essential features that actually drive prediction.
This iteration is critical. Early cycles show modest gains as the system finds obvious features. Later cycles compound: each refined feature set exposes subtler error patterns, which reveal more nuanced features. The 3.2x improvement over GPT-5 and the 4,000-ratings equivalence both depend on running the full cycle, not stopping early.
The system tested scenarios with 50+ candidate features but typically stabilized around 20-30 useful ones after the expand-contract process pruned redundancies.
Results
GenZ was tested on three domains with increasing complexity.
Binary representation recovery
As a sanity check, researchers created items with hidden 9-bit binary codes determining their values. GenZ recovered the exact binary representation through iterative discovery, demonstrating the coarse-to-fine learning mechanism works.
House price prediction
On 535 house listings with detailed descriptions:
| Method | Median Relative Error |
|---|---|
| Zero-shot GPT-5 | 38% |
| GenZ (discovered features) | 12% |
| Improvement | 3.2x |
For context, Zillow's Zestimate achieves ~7% median error on off-market homes, but uses proprietary data, hand-crafted features, and vastly larger datasets. GenZ's 12% using only listing descriptions and discovered semantic features is competitive for an automated approach.
GenZ discovered that location and quality signals the LLM initially missed were driving prediction errors. The zero-shot baseline focused on cosmetic features like "fenced yard" or "green lawn" while GenZ learned that build quality and location dominated pricing in this dataset.
Movie cold-start recommendations
For 512 movies, GenZ predicted 32-dimensional user preference embeddings from semantic features alone:
| Method | Cosine Similarity |
|---|---|
| Zero-shot baseline | 0.48 |
| GenZ (linear model) | 0.59 |
| Equivalent ratings | ~4,000 |
A measure of how similar two lists of numbers are, ranging from -1 (opposite) to 1 (identical). Here, it compares predicted user preferences to actual preferences. A jump from 0.48 to 0.59 means predictions align much better with reality.
The 0.11 improvement in cosine similarity corresponds to approximately 2,000 additional user ratings worth of information. Reaching 0.59 cosine similarity from zero would require roughly 4,000 ratings through collaborative filtering alone. GenZ extracts that value from metadata instantly.
Linear versus neural models
An interesting pattern emerged across domains. For houses, both linear and neural models reached similar accuracy (~12% error), though neural networks showed more overfitting. For movies, linear models clearly outperformed neural networks (0.59 vs ~0.52 cosine similarity). The neural variant showed severe overfitting: training performance kept improving while test performance plateaued early.
This has practical implications. Start with linear models. They are more stable and often match or beat neural networks on small datasets. Add complexity only if you have thousands of items and linear models clearly plateau.
Practical applications
Cold-start recommendations
The cold-start problem affects any recommendation system adding new items. Without user interactions, collaborative filtering fails. GenZ offers an alternative:
When a new product, movie, or listing has zero user interactions, traditional recommendation systems cannot work. They rely on patterns like "users who bought X also bought Y." With no purchase history, there is nothing to learn from. GenZ sidesteps this by predicting from item descriptions instead of user behavior.
The key insight: GenZ's 4,000-ratings-equivalent accuracy comes from iterative feature discovery, not one-shot extraction. The process:
- Start with generic features ("Is this a comedy?", "Is it critically acclaimed?")
- Train a model and identify where predictions fail
- Ask the LLM to contrast high-error items ("What do these misses have in common?")
- Discover domain-specific features ("Released between 1995-2000", "Has John Williams score")
- Repeat 10-20 cycles until error stabilizes
Without the iterative refinement, you get modest improvements. With it, you match thousands of real user ratings. This works for products, content, job listings, or any domain with describable items.
Dynamic pricing
E-commerce pricing often relies on competitor data and historical sales. For new products without history, GenZ can:
- Extract semantic features from product descriptions and images
- Predict price sensitivity and optimal pricing
- Discover which attributes drive willingness-to-pay in your market
Most pricing tools scrape competitors. But unique items (vintage clothing, custom art, exclusive real estate) have no direct competitors. GenZ prices these "unpriceable" items by analyzing their intrinsic semantic value. The feature discovery aspect is valuable here. GenZ might find that "free shipping eligible" or "Prime badge" matters more than product specifications.
Real estate valuation
Automated valuation models (AVMs) typically use structured data: bedrooms, square footage, location. GenZ can augment these with:
- Quality signals from listing descriptions
- Neighborhood characteristics from text
- Temporal factors affecting market conditions
The 3.2x improvement over zero-shot suggests substantial value in combining LLM understanding with statistical rigor.
Why not just RAG?
Developers often ask: "Why can't I just retrieve similar houses and average their prices?" RAG retrieves similar documents but does not learn feature weights. GenZ explicitly learns that "feature X adds $Y to the price," which is more robust for prediction. RAG might retrieve five similar houses with prices ranging from $400K to $600K. GenZ tells you why the prices differ and predicts based on which features your target house has.
Implementation blueprint
A basic GenZ prototype requires standard ML infrastructure plus an LLM API. Production quality requires 10-20 iteration cycles with proper batching and error handling. Here is how to get started.
System architecture
The data flow is straightforward:
[Data]
↓
[LLM Extractor]
↓
[Feature Matrix]
↓
[Ridge Regression] ←──┐
↓ │
[Predictions] │
↓ │
[Error Loop] ──────┘
(Feedback Signal)
The system has two LLM touchpoints: (1) feature extraction (high volume, cheap model) and (2) feature discovery (low volume, reasoning model). Everything else is standard ML infrastructure.
Recommended tech stack
No exotic infrastructure required. Standard ML tooling plus an orchestration layer for the iterative loop.
| Component | Recommended | Alt |
|---|---|---|
| LLM | GPT-4 | Claude, Llama 3 |
| Stats Model | scikit-learn | PyTorch |
| Feature Store | JSON/SQLite | Redis |
| Orchestration | Airflow/Prefect | Dagster |
| Observability | LangSmith | W&B, Arize |
Why orchestration matters: Scripts fail when API rate limits hit or connections drop mid-run. A workflow engine handles backoff-and-retry automatically, which is critical when making 5,000+ API calls across 10-20 iterations.
Observability tracks two things: (1) LLM extraction accuracy (are Yes/No answers consistent?) and (2) regression model drift (is prediction error creeping up?). You need both to debug production issues.
Cost tip: Use a cheaper model (GPT-4o-mini) for yes/no extraction. Reserve GPT-4 for feature discovery where reasoning matters.
Core workflow
Steps 1-4 get you a working baseline. Steps 5-9 improve accuracy through iteration.
-
Prepare your dataset: Items with text descriptions and numeric targets (prices, ratings, embeddings)
-
Initialize features: Prompt the LLM for domain-relevant binary features
"List 20 yes/no questions that would help predict house prices. Examples: Does it have hardwood floors? Is it in a gated community?" -
Build feature matrix: Batch 20-50 Yes/No questions into a single prompt. Do NOT make 1 API call per feature.
def extract_features(desc, features): questions = numbered_list(features) prompt = TEMPLATE.format( desc=desc, questions=questions ) return parse_response( llm.complete(prompt), len(features) )Prompt: "Answer Yes/No for each: [desc]. Questions: [list]"
-
Train statistical model: Start with linear regression or logistic regression
from sklearn.linear_model import Ridge model = Ridge(alpha=1.0) model.fit(feature_matrix, targets)
Ridge regression is linear regression with a penalty that prevents any single feature from dominating. When LLM-generated features overlap (e.g., "modern kitchen" and "recently renovated"), Ridge handles the correlation gracefully instead of producing unstable weights. The alpha parameter controls how much to penalize large weights.
-
Identify error groups: Find items with largest prediction errors
errors = np.abs(predictions - targets) worst_50 = np.argsort(errors)[-50:] -
Discover new features: Prompt LLM to contrast error groups
"Overpriced by model: [descriptions] Underpriced: [descriptions] What distinguishes them? Suggest 5 new yes/no features." -
Optional: Human review: Have a domain expert quickly review proposed features before the next extraction run. This prevents the model from learning garbage correlations (e.g., "ID number starts with 5") which wastes compute on useless features.
-
Expand and contract: Add promising features, remove redundant ones based on cross-validation performance
-
Iterate: Repeat steps 3-8 until performance plateaus (typically 10-20 cycles)
Key parameters
Start with these defaults, then adjust based on your dataset size and domain.
- Initial features: 15-25 seed features from domain knowledge
- Error threshold: Top 10% of errors for contrast groups
- Feature limit: Cap at 30-40 features to prevent overfitting
- Regularization: Use Ridge regression (alpha=1.0) to handle correlated features
- Iterations: 10-20 expand-contract cycles
Pitfalls to avoid
Common failure modes from the paper's experiments and similar ML pipelines.
-
Starting with neural networks: Linear models are more stable for this task. Only add complexity if you have thousands of items.
-
Ignoring feature redundancy: New features often overlap with existing ones. Use correlation checks or feature importance to prune.
-
Over-expanding features: More features is not better. The expand-contract cycle matters.
-
Skipping numeric features: If you have structured data (bedrooms, price history), include it directly. Do not make the LLM rediscover obvious attributes.
-
Expecting instant results: The iterative process takes 10-20 cycles. Early iterations show modest gains; later ones compound.
-
Data leakage in descriptions: Ensure your LLM prompts do NOT include the target variable (price) or proxies for it. If the listing description says "million dollar view," the LLM can cheat by detecting price signals in the text itself rather than learning genuine predictive features.
-
Over-engineering prompts: You do not need complex Chain-of-Thought for simple Yes/No feature extraction. The statistical model corrects for noisy labels. Keep prompts simple and cheap.
Cost considerations
LLM API calls are the main expense, but proper batching keeps costs low.
The math: Cost = (Items x Features x Iterations) / Batch Size
Example: 500 items x 30 features x 10 iterations = 150,000 questions. With batch size 30 (all features in one prompt per item), that is only 5,000 API calls total.
- LLM API costs: With proper batching, 500 items over 10 iterations = 5,000 calls. At GPT-4o-mini rates, roughly $5-10 total.
- Iteration costs: Each expand-contract cycle re-extracts features. The batch size divisor is what makes this affordable.
- Caching: Cache LLM responses aggressively. Feature questions are deterministic. If an item's description has not changed, reuse the cached answers.
Limitations
Scale uncertainty
Experiments used 535 houses and 512 movies. Performance at larger scales (millions of items) remains untested. The iterative feature discovery may become computationally expensive.
Foundation model dependency
GenZ requires capable foundation models that understand the domain. For highly specialized fields (genomics, materials science), general LLMs may lack necessary knowledge.
Feature interpretability trade-offs
Discovered features are not always intuitive. "Movies released between 2008-2012" might predict preferences accurately but offers limited insight into why.
Incomplete baselines
The paper's zero-shot baseline provided the LLM with the full dataset including labels, which is somewhat generous. Real zero-shot scenarios might show larger gaps.
Overfitting with neural models
Neural network variants showed significant overfitting. This limits applicability for small datasets, though linear models remained robust.
Data privacy considerations
Your dataset's descriptions must be sent to the LLM for feature extraction. For highly sensitive data (medical records, financial details), this requires either a private LLM deployment or careful anonymization before processing.
Paper: arXiv:2512.24834
Authors: Marko Jojic (Arizona State University), Nebojsa Jojic (Microsoft Research)
Original paper: arXiv ・ PDF ・ HTML
Cite this paper
Marko Jojic, Nebojsa Jojic (2025). GenZ: Using Foundation Models as Feature Generators. arXiv 2025.