← Back to article

GenZ: Foundational models as latent variable generators within traditional statistical models

Marko Jojic
Arizona State University, Tempe, AZ, USA &Nebojsa Jojic
Microsoft Research, Redmond, WA, USA
Abstract

We present GenZ, a hybrid model that bridges foundational models and statistical modeling through interpretable semantic features. While large language models possess broad domain knowledge, they often fail to capture dataset-specific patterns critical for prediction tasks. Our approach addresses this by discovering semantic feature descriptions through an iterative process that contrasts groups of items identified via statistical modeling errors, rather than relying solely on the foundational model’s domain understanding. We formulate this as a generalized EM algorithm that jointly optimizes semantic feature descriptors and statistical model parameters. The method prompts a frozen foundational model to classify items based on discovered features, treating these judgments as noisy observations of latent binary features that predict real-valued targets through learned statistical relationships. We demonstrate the approach on two domains: house price prediction (hedonic regression) and cold-start collaborative filtering for movie recommendations. On house prices, our model achieves 12% median relative error using discovered semantic features from multimodal listing data, substantially outperforming a GPT-5 baseline (38% error) that relies on the LLM’s general domain knowledge. For Netflix movie embeddings, our model predicts collaborative filtering representations with 0.59 cosine similarity purely from semantic descriptions—matching the performance that would require approximately 4000 user ratings through traditional collaborative filtering. The discovered features reveal dataset-specific patterns (e.g., architectural details predicting local housing markets, franchise membership predicting user preferences) that diverge from the model’s domain knowledge alone.

1 Introduction

Given an item’s semantics, such as a text description, image, or audio recording, large foundational models can answer specific questions about the content, and thus convert the item’s semantic representation into a canonical discrete representation. Such a representation is useful for traditional statistical models—such as graphical generative models, belief networks, or probabilistic circuits—which can capture statistics from external, possibly proprietary, datasets related to these items.

Movies are an example of such items. We can use an LLM to featurize them as binary vectors based on answers to questions as illustrated in Fig. 1. These binary vectors can help explain the patterns found in the movie-user ratings matrix. Collaborative filtering techniques reduce the observation matrix to embeddings, which capture usage statistics absent from the foundational models’ training data. The foundational models, however, can reason about the semantic content—such as the movie synopsis, critical response, credits, and other metadata—either from their training or by retrieving information from a database as part of a RAG system.

We describe how appropriate prompts can be learned from such item statistics by jointly modeling the semantic features and the statistical patterns in the data.

We study a general hybrid statistical-foundational model defined as follows. Let ss be the semantic representation of an item, e.g., text with a movie title or its description. Given a collection of semantic feature descriptors θf\theta_{f}, such as ’historical war film’, a foundational model, which we refer to as function hh, makes a judgment whether the feature applies to the item (Figure 1). Let 𝐳{\bf z} be the true binary feature vector that allows for possible errors in either the model’s judgment or the wrongly worded feature descriptors. We model uncertainty by defining a conditional distribution p(𝐳|s,θf,θe)p({\bf z}|s,\theta_{f},\theta_{e}), where 𝐳=[z1,z2,,znf]T{\bf z}=[z_{1},z_{2},...,z_{n_{f}}]^{T} is a discrete feature vector (binary, in the example in the figure and in our experiments), θf\theta_{f} is a collection of semantic feature descriptors, such as ’historical war film’, and θe=[p1e,p2e,,pnfe]\theta_{e}=[p^{e}_{1},p^{e}_{2},...,p^{e}_{n_{f}}] is a set of estimated probabilities of feature classification errors. A feature vector is then linked to observation 𝐲{\bf y} through a classical statistical model p(𝐲|𝐳,θy)p({\bf y}|{\bf z},\theta_{y}), possibly involving latent variables or some other way of modeling structure, and which has its own parameters θy\theta_{y}. By jointly optimizing for the parameters of the model p(𝐲|s)=𝐳p(𝐲|𝐳)p(𝐳|s)p({\bf y}|s)=\sum_{{\bf z}}p({\bf y}|{\bf z})p({\bf z}|s) based on a large number of pairs (𝐲,s)({\bf y},s) we can discover the feature descriptions, or the right questions to ask, in order to explain the statistics in the dataset. More generally, hybrid models could have more than one category of items with a (learnable) statistical model of their relationships, for example geographic regions and native plants.

S1.F1
Figure 1:Feature classification ℎ ​ ( 𝑠 , 𝜃 𝑓 ) with extraction prompt strings shown in Python syntax. Items 𝑠 in the list are featurized using feature descriptions in 𝜃 𝑓 . The first prompt is used within a double loop ( 𝑡 , 𝑖 ) over items and features. The second is looped only over 𝑖 (and batches of items {s}) as each call retrieves the indices of items for which 𝑧 𝑖 = 1 (with 𝑧 𝑖 = 0 for all other items).

The idea of using interpretable intermediate representations to bridge inputs and predictions has a long history in machine learning. Concept bottleneck models (CBM) koh2020concept; kim2023probabilistic formalized this approach by training neural networks to predict pre-defined human-interpretable concepts from inputs, then using those concepts to make final predictions. Recent work has leveraged large language models (LLMs) to overcome the need for manually labeled concepts, e.g., benara2024craftinginterpretableembeddingsasking; feng2025bayesianconceptbottleneckmodels; chattopadhyay2023variationalinformationpursuitinterpretable; zhong2025explainingdatasetswordsstatistical; ludan2024interpretablebydesigntextunderstandingiteratively. These methods may or may not use the CBM formulation but share with it the goal of feature interpretability. In addition, they have a goal of feature discovery, which is also one of the goals here. They address the difficulty of modeling and jointly adjusting the semantic and statistical parts of the model—in our notation p(𝐳|s)p({\bf z}|s) and p(𝐲|𝐳)p({\bf y}|{\bf z})—in different ways.

In most cases, the target 𝐲{\bf y} is a discrete variable (the goal is a classification through intermediate concepts), and thus the classification model p(𝐲|𝐳)p({\bf y}|{\bf z}) is a simple logistic regressor or an exhaustive conditional table chattopadhyay2023variationalinformationpursuitinterpretable; feng2025bayesianconceptbottleneckmodels; zhong2025explainingdatasetswordsstatistical. The optimization of the featurizer p(𝐳|s)p({\bf z}|s) primarily relies on the LLM’s domain understanding (such as a task description), but some methods also provide examples of misclassified items, e.g., zhong2025explainingdatasetswordsstatistical; ludan2024interpretablebydesigntextunderstandingiteratively.

When real-valued multi-dimensional targets 𝐲{\bf y} are modeled, showing items with large prediction errors is much less informative because each item may have a different error vector, and thus item-driven discovery is usually abandoned. For example, in benara2024craftinginterpretableembeddingsasking multidimensional targets (fMRI recordings) are modeled, and the method relies on direct LLM prompting about potentially predictive features based on domain knowledge alone, followed by feature selection, rather than showing specific items to contrast.

A fundamental limitation shared by these approaches is their reliance on the LLM’s domain understanding to propose discriminative features. This works well when the LLM has been trained on relevant data and the task aligns with common patterns in its training distribution. However, as our experiments demonstrate, domain understanding alone is insufficient when dataset-specific patterns diverge from general knowledge—such as house pricing dynamics in particular markets, or user preferences captured in collaborative filtering from specific time periods or demographics. Moreover, for high-dimensional real-valued targets 𝐲{\bf y}, individual prediction errors have multidimensional structure that cannot be easily communicated to an LLM by simply showing "incorrect" examples.

Our approach differs in several key aspects. First, as in some of the recent work, we avoid concept supervision by using a frozen foundational model as an oracle within p(𝐳|s)p({\bf z}|s), but with trainable uncertainty parameters θe\theta_{e}. Second, our feature mining is driven by contrasting groups of items identified through the statistical model’s posterior q(𝐳)q({\bf z}) rather than by querying the LLM’s domain knowledge or showing individual misclassified examples (Figure 2). This group-based contrast allows the method to discover dataset-specific patterns that may not align with the LLM’s training distribution—or even in cases where the LLM has no understanding of what the target 𝐲{\bf y} represents. (We do, however, require the (M)LLM to reason about the semantic items ss, either based on knowledge encoded in its weights or by retrieving information from a database as in RAG systems). Third, we support arbitrary mappings p(𝐲|𝐳)p({\bf y}|{\bf z}) including nonlinear functions of binary features for high-dimensional real-valued observations, shifting the burden of modeling complex feature interactions from the LLM to the statistical model. The LLM is only responsible for explaining the semantic coherence of binary splits discovered based on the model’s prediction errors, making it applicable to domains where the relationship between semantics and observations is not well understood in advance.

We refer to our hybrid model as GenZ (or GenAIz?) as it relies on a pre-trained generative AI model to generate candidate latents 𝐳{\bf z} in a prediction model s𝐳𝐲s\rightarrow{\bf z}\rightarrow{\bf y}. We derive a generalized EM algorithm for training it, which naturally requires investigation of items ss when tuning parameters θf\theta_{f} of the featurizer p(𝐳|s)p({\bf z}|s).

S1.F2
Figure 2:Discovering feature descriptions for prompts in Fig. 1. When the statistical model identifies a binary split in the data, we can prompt a foundational model to generate an explanation by providing example items from each group. This explanation can then be used as a feature descriptor in 𝜃 𝑓 . The prompt need not include all members of each group, only representative examples.

2 Semantic model

Before developing the full algorithm, we first describe the semantic part: the semantic parameters θf\theta_{f}, how p(𝐳|s)p({\bf z}|s) depends on them and on a foundational model, and how statistical signals from the model enable updating these parameters in an iterative EM-like algorithm.

2.1 Extracting semantic features: Modeling p(𝐳|s)p({\bf z}|s)

The foundational model’s reasoning abilities are sufficient to featurize a semantic item using standard classification prompts. A generic version of this is illustrated in Fig. 1. In the example shown, the items are represented with text, in this case movie titles that large language models are likely familiar with. Each string in the list is an item description ss—however, the text describing each item could be expanded to include more details about the movie, or a RAG system with access to a movie database could be used in place of an LLM. We use the output h(s,θf)h(s,\theta_{f}) of such systems cautiously: we associate an uncertainty level, i.e., a probability pjep^{e}_{j} of mislabeling hzh\neq z for entry zjz_{j} in 𝐳{\bf z}. Assuming that 𝐳{\bf z} is not deterministically linked to ss makes it a latent variable in the p(𝐲,𝐳|s)p({\bf y},{\bf z}|s) model, allowing flexibility in inference and enabling updates to the semantic feature descriptions in θf\theta_{f}: If the posterior over 𝐲{\bf y} can override the foundational model’s determination of whether an item satisfies the feature, this provides a signal for updating the feature semantics.

2.2 Mining semantic features

As already alluded to, the feature semantics θf\theta_{f} are not hand-crafted but learned from the dataset statistics. The statistical signal from the data indicates how the items should be separated along a particular feature jj, i.e., how the zjz_{j} for different items can be altered so that the model explains the data better. In our setup, we deal with binary features, so the signal separates the items into two groups, and we seek a semantic explanation for this separation, θf\theta_{f}.

Specifically, for fixed model parameters, focusing on a particular binary feature zi{0,1}z_{i}\in\{0,1\} for one semantic item ss, the prior and posterior distributions are generally different:

p(zi|s)p(zi|𝐲,s)={zj}jip(zi|𝐲,{zj}ji,s)p(z_{i}|s)\neq p(z_{i}|{\bf y},s)=\sum_{\{z_{j}\}_{j\neq i}}p(z_{i}|{\bf y},\{z_{j}\}_{j\neq i},s)

Thus, sampling the posterior distribution (referred to below as qq) for several different items sts^{t} yields new feature assignments zitz_{i}^{t}. These help update the feature descriptor θfi{\theta_{f}}_{i} used in p(zi|s,θf)p(z_{i}|s,\theta_{f}) by using the prompt in Fig. 2. There, the positive group has a few items sts^{t} for which zit=1z_{i}^{t}=1 and the negative group has a few items for which zit=0z_{i}^{t}=0. Furthermore, we can select group exemplars to contain both semantic items where the posterior and the prior agree, and where they disagree. This encourages small refinements, e.g., from "fearless female characters in an action movie" to "strong female leads".

3 Statistical model

Given an item’s semantics ss, e.g., a movie description, and its corresponding observed vector 𝐲{\bf y}, e.g., the movie’s embedding derived from the statistics in the movie-user matrix, the log likelihood can be bounded as

logp(𝐲|s)B\displaystyle\log p({\bf y}|s)\geq B =\displaystyle= 𝐳q(𝐳)logp(𝐲|𝐳)p(𝐳|s)q(𝐳)\displaystyle\sum_{\bf z}q({\bf z})\log\frac{p({\bf y}|{\bf z})p({\bf z}|s)}{q({\bf z})} (1)
=\displaystyle= Eq[logp(𝐲|𝐳)]+Eq[logp(𝐳|s)]Eq[logq(𝐳)],\displaystyle E_{q}[\log p({\bf y}|{\bf z})]+E_{q}[\log p({\bf z}|s)]-E_{q}[\log q({\bf z})],

where EqE_{q} refers to an expectation under the distribution qq. Depending on the variational approximation of qq and the model p(𝐲|𝐳)p({\bf y}|{\bf z}) these expectations can be computed exactly, as illustrated by the example of the mean-field qq and linear p(𝐲|𝐳)p({\bf y}|{\bf z}) in Section B. Otherwise, a sampling method can be used, often with very few samples due to the iterative nature of the learning algorithm. In experiments (Section 5), we simply evaluate at the mode of qq.

The bound is tight when the variational distribution qq is equal to the true posterior p(𝐳|𝐲,s)p({\bf z}|{\bf y},s), otherwise q(𝐳)q({\bf z}) serves as an approximate posterior. We refer to the three additive terms as the observation part T1=Eq[logp(𝐲|𝐳)]T_{1}=E_{q}[\log p({\bf y}|{\bf z})], the feature part T2=Eq[logp(𝐳|s)]T_{2}=E_{q}[\log p({\bf z}|s)], and the posterior entropy T3=Eq[logq(𝐳)]T_{3}=-E_{q}[\log q({\bf z})].

Given a dataset of pairs {st,𝐲t}t=1T\{s^{t},{\bf y}^{t}\}_{t=1}^{T}, the optimization criterion—the bound on the log likelihood of the entire dataset—is

L({qt},θf,θe,θy)=tBt=tEqt[logp(𝐲t|𝐳)]+Eqt[logp(𝐳|s)]Eqt[logqt(𝐳)]L(\{q^{t}\},\theta_{f},\theta_{e},\theta_{y})=\sum_{t}B_{t}=\sum_{t}E_{q^{t}}[\log p({\bf y}^{t}|{\bf z})]+E_{q^{t}}[\log p({\bf z}|s)]-E_{q^{t}}[\log q^{t}({\bf z})] (2)

In general, maximizing the criterion LL w.r.t. all variables except semantic model parameters (feature descriptions) θf\theta_{f} is straightforward: we can use existing packages or derive an EM algorithm specific to the choice of the model’s statistical distributions. The variational distributions qt(𝐳)q^{t}({\bf z}) depend only on the individual bounds for each data pair (𝐲t,st)({\bf y}^{t},s^{t}) in (1) and can be fitted independently for given model parameters θf,θe,θy\theta_{f},\theta_{e},\theta_{y}.

The (approximate) posteriors qt(zi)q_{t}(z_{i}) provide the statistical signal discussed in Section 2 for tuning the semantic feature descriptions θfi{\theta_{f}}_{i}: sampling them separates the samples sts^{t} into two groups, and exemplars from each are used to prompt the foundational model with the prompt in Fig. 2 for a new definition of the ii-th feature. The optimization of LL then consists of steps that improve it w.r.t. the statistical model parameters and the (approximate) posteriors qq, alternating with updates of the feature descriptions using those posteriors.

This general recipe can be used in a variety of ways, with different choices for the structure of the conditionals in the model pp and different approximations of the posterior qq leading to a variety of algorithms including variational learning, belief propagation and sampling frey-jojic-pami-tutorial. We now develop specific modeling choices and a learning algorithm we use in our experiments.

Binary feature model

As discussed, the feature model uses a prompt-based feature classification, where each feature is treated independently based on its description θfi{\theta_{f}}_{i} (Fig. 1)

p(𝐳|s,θf,θe)=ip(zi|s,θfi,pie),p({\bf z}|s,\theta_{f},\theta_{e})=\prod_{i}p(z_{i}|s,{\theta_{f}}_{i},p^{e}_{i}), (3)

with the uncertainty (probability of error) piep^{e}_{i}, so that

p(zi|s)=(1pie)[h(s,θfi)=zi](pie)[h(s,θfi)zi]p(z_{i}|s)=(1-p^{e}_{i})^{[h(s,{\theta_{f}}_{i})=z_{i}]}(p^{e}_{i})^{[h(s,{\theta_{f}}_{i})\neq z_{i}]} (4)

where hh is the binary function that uses one of the prompts in Fig. 1 to detect the feature θfi{\theta_{f}}_{i} in the semantic item ss. The expression above uses the indicator function [][] to express that with the probability of error peip_{e}^{i}, the feature indicator ziz_{i} is different from h(s,θfi)h(s,{\theta_{f}}_{i}). Denoting hi=h(s,θfi)h_{i}=h(s,{\theta_{f}}_{i}), we write

logp(zi|s)=[zi=1]i1+[zi=0]i0,\displaystyle\log p(z_{i}|s)=[z_{i}=1]\ell^{1}_{i}+[z_{i}=0]\ell^{0}_{i}, (5)
i1=[hi=1]log(1pie)+[hi=0]logpie,\displaystyle\qquad\ell^{1}_{i}=[h_{i}=1]\log(1-p^{e}_{i})+[h_{i}=0]\log p^{e}_{i},
i0=[hi=1]logpie+[hi=0]log(1pie)\displaystyle\quad\ell^{0}_{i}=[h_{i}=1]\log p^{e}_{i}+[h_{i}=0]\log(1-p^{e}_{i})

Using a factorized posterior q=iq(zi)q=\prod_{i}q(z_{i}), the feature term T2T_{2} simplifies to

T2=Eq[logp(𝐳|s)]=iziq(zi)logp(zi|s)=iqii1+(1qi)i0,\displaystyle T_{2}=E_{q}[\log p({\bf z}|s)]=\sum_{i}\sum_{z_{i}}q(z_{i})\log p(z_{i}|s)=\sum_{i}q_{i}\ell^{1}_{i}+(1-q_{i})\ell^{0}_{i}, (6)

with i1\ell^{1}_{i} and i0\ell^{0}_{i} defined in (5). Finally, the entropy term becomes

T3=Eq[logq(𝐳)]=iqilogqi+(1qi)log(1qi),qi=q(zi=1)T_{3}=-E_{q}[\log q({\bf z})]=-\sum_{i}q_{i}\log q_{i}+(1-q_{i})\log(1-q_{i}),\quad q_{i}=q(z_{i}=1) (7)

General observation model

We can soften an arbitrary mapping function 𝐲=f(𝐳;θm){\bf y}=f({\bf z};\theta_{m}) by defining a distribution:

p(𝐲|𝐳,θy={θm,σy2})\displaystyle p({\bf y}|{\bf z},\theta_{y}=\{\theta_{m},{\bf\sigma}^{2}_{y}\}) =\displaystyle= 𝒩(𝐲|f(𝐳,θm),Σ=diag(σy2)),\displaystyle{\cal N}({\bf y}|f({\bf z},\theta_{m}),\Sigma=diag({\bf\sigma}^{2}_{y})), (8)
T1=Eq[logp(𝐲|𝐳)]\displaystyle T_{1}=E_{q}[\log p({\bf y}|{\bf z})] =\displaystyle= Eq[12(𝐲f(𝐳))TΣ1(yf(𝐳))]12log2π|Σ|,\displaystyle E_{q}\big[-\frac{1}{2}({\bf y}-f({\bf z}))^{T}\Sigma^{-1}(y-f({\bf z}))\big]-\frac{1}{2}\log 2\pi|\Sigma|, (9)

with θm\theta_{m} being the mapping parameters of function ff, e.g., regression weights in 𝐲=𝐰T𝐳{\bf y}={\bf w}^{T}{\bf z}, and σy2{\bf\sigma}^{2}_{y} being a vector of variances, one for each dimension of 𝐲{\bf y}. Different specialized inference algorithms for specific forms of function ff can be derived. For example, see the appendix for a procedure derived for a linear model (factor analysis). However, in our experiments, we use a generic inference technique using ff in plug-and-play fashion, described below.

Inference of q(zi)q(z_{i}) in the fully factorized (mean field) qq model

Keeping all but one q(zi)q(z_{i}) fixed, we derive the update for the ii-th feature’s posterior by setting the derivative of the bound BB wrt parameters q(zi=1)q(z_{i}=1) and q(zi=0)q(z_{i}=0), subject to q(zi=1)+q(zi=0)=1q(z_{i}=1)+q(z_{i}=0)=1. We can express the result as:

q(zi=1)\displaystyle q(z_{i}=1) \displaystyle\propto eEq+i[logp(𝐲|𝐳)]+Eq+i[logp(𝐳|s)]Eq+i[logq(𝐳)],\displaystyle e^{E_{q_{+i}}[\log p({\bf y}|{\bf z})]+E_{q_{+i}}[\log p({\bf z}|s)]-E_{q_{+i}}[\log q({\bf z})]},
q(zi=0)\displaystyle q(z_{i}=0) \displaystyle\propto eEqi[logp(𝐲|𝐳)]+Eqi[logp(𝐳|s)]Eqi[logq(𝐳)]\displaystyle e^{E_{q_{-i}}[\log p({\bf y}|{\bf z})]+E_{q_{-i}}[\log p({\bf z}|s)]-E_{q_{-i}}[\log q({\bf z})]} (10)

where q+i=[zi=1]jiq(zj)q_{+i}=[z_{i}=1]\prod_{j\neq i}q(z_{j}), and q+i=[zi=0]jiq(zj)q_{+i}=[z_{i}=0]\prod_{j\neq i}q(z_{j}), i.e the "+i" and "-i" posteriors have their ziz_{i} value clamped to 1 and 0 respectively in the factorized q=q(zi)q=\prod q(z_{i}). Upon normalization, as the entropy terms E𝐪+i[logq(𝐳)]E_{{\bf q}^{+i}}[\log q({\bf z})] and E𝐪i[logq(𝐳)]E_{{\bf q}^{-i}}[\log q({\bf z})] cancel each other, we obtain:

qi=q(zi=1)=1q(zi=0)=11+eE𝐪i[logp(𝐲|𝐳)]E𝐪+i[logp(𝐲|𝐳)]+i0i1q_{i}=q(z_{i}=1)=1-q(z_{i}=0)=\frac{1}{1+e^{E_{{\bf q}^{-i}}[\log p({\bf y}|{\bf z})]-E_{{\bf q}^{+i}}[\log p({\bf y}|{\bf z})]+\ell^{0}_{i}-\ell^{1}_{i}}} (11)

Interpretation: Whether or not ziz_{i} should take value 1 or 0 for particular semantic item ss depends on:

  • the assignment hh that the foundational model assigns it and the level of trust we have in that assignment (the probability of error piep^{e}_{i}). The two are captured in prior log likelihoods 1,0\ell_{1},\ell_{0}.

  • error vector 𝐲f(𝐳){\bf y}-f({\bf z}), scaled by the currently estimated variances σy2{\bf\sigma}^{2}_{y} for different elements of the target 𝐲{\bf y}, for two possible assignments zi=1z_{i}=1 and zi=0z_{i}=0, with other assignments zjz_{j} following the current distributions over other features q(zj)q(z_{j})

While expectations Eq+iE_{q_{+i}} and Eq+iE_{q_{+i}} can potentially be better estimated through sampling, in our experiments, we simply compute logp(𝐲|𝐳)\log p({\bf y}|{\bf z}) at the mode (assuming most likely values for other zjz_{j} variables).

Optimizing numerical parameters of p(𝐲|𝐳)p({\bf y}|{\bf z}) and p(𝐳|s)p({\bf z}|s)

Optimizing the bound BB wrt to parameters θm\theta_{m} of the mapping in 𝐲f(𝐳;θm){\bf y}\approx f({\bf z};\theta_{m}) and the remaining uncertainty/variance σy2{\bf\sigma}^{2}_{y} reduces to maximizing sum of T1T_{1} terms over the traing set, i.e. tEqt[logp(𝐲t|𝐳t)]\sum_{t}E_{q}^{t}[\log p({\bf y}^{t}|{\bf z}^{t})]. Again, while sampling of qq might be helpful, in our experiments we simply use the mode and iterate:

θm\displaystyle\theta_{m} =\displaystyle= argmint(𝐲tf(𝐳t;θm))TΣ1(𝐲tf(𝐳t;θm))\displaystyle\arg\min\sum_{t}({\bf y}^{t}-f({\bf z}^{t};\theta_{m}))^{T}\Sigma^{-1}({\bf y}^{t}-f({\bf z}^{t};\theta_{m})) (12)
σy2\displaystyle{\bf\sigma}^{2}_{y} =\displaystyle= 1Tt(𝐲tf(𝐳t;θm))(𝐲tf(𝐳t;θm))\displaystyle\frac{1}{T}\sum_{t}({\bf y}^{t}-f({\bf z}^{t};\theta_{m}))\circ({\bf y}^{t}-f({\bf z}^{t};\theta_{m})) (13)

where 𝐳t{\bf z}^{t} is the binary vector of assignments of features ziz_{i}, either a sample from qtq_{t} or at the mode of the posterior qtq_{t} for the tt-th out of TT entries in the data set of pairs st,𝐲ts_{t},{\bf y}_{t}. \circ is a point-wise multiplication.

The semantic model has numeric parameters piep^{e}_{i}, modeling the hybrid model’s uncertainty in semantic function hh that makes an API call to the foundational model. By optimizing the bound wrt these parameters we obtain the update:

pie=1Tt=1Tqit[hit=0]+(1qit)[hit=1]p^{e}_{i}=\frac{1}{T}\sum_{t=1}^{T}q_{i}^{t}[h_{i}^{t}=0]+(1-q_{i}^{t})[h_{i}^{t}=1] (14)

Thus, for a given set of semantic feature descriptors θf\theta_{f}, iterating equations (11) for each item tt, and (12, 14) using all items in the training set, would fit the numerical model parameters and create the soft feature assignments qtq^{t} that best balance those feature descriptors with the target real and possible multi-dimensional targets 𝐲t{\bf y}^{t}.

4 Algorithms

In section 2 we pointed out that individual q(zi)q(z_{i}) distributions carry information about how the data should be split into two groups to create the mining prompt in Fig. 2. But as each of these distributions depend on the others, there is the question of good initialization. In our experiment we use a method akin to the variational EM view of boosting neal1998boosting, albeit in our case applied to learning to predict real-valued multi-dimensional targets, rather than just for classification.

Note that if we iterate (11, 12, 14) while keeping pie=0.5p^{e}_{i}=0.5 for a select feature ii the entire hybrid model and the posterior q(zi)q(z_{i}) are decoupled from the foundational model’s prediction hih_{i} which depends on feature description θfi{\theta_{f}}_{i} to provide its guess at ziz_{i}. Therefore, the iterative process is allowed to come up with the assignment of zitz^{t}_{i} for each item tt in such a way that maximizes the likelihood (and minimizes the prediction error). This provides a natural binary split of the data into two groups based on error vectors. In case of predicting house prices (one dimensional yy, those two groups would be the lower and higher priced houses relative to current prediction, but in case of multi-dimensional observations, such as movie embeddings, the split would be into two clusters, each with their own multi-dimensional adjustment to the current prediction.

A simple algorithm for adding features is shown in Algorithm 1. The key idea is to fit a model with only features j[1..i]j\in[1..i] while keeping the new feature ii decoupled from the LLM prediction by fixing pie=0.5p^{e}_{i}=0.5. This allows the data to determine the optimal split before discovering its semantic explanation.

The algorithm operates on the model p({𝐲t,{zjt}j=1i,st}t=1T)p(\{{\bf y}^{t},\{z_{j}^{t}\}_{j=1}^{i},s^{t}\}_{t=1}^{T}), where features j>ij>i do not participate. By keeping pie=0.5p^{e}_{i}=0.5 fixed during the inner loop, the posterior qitq_{i}^{t} is determined purely by how well zitz_{i}^{t} helps explain 𝐲t{\bf y}^{t}, independent of any LLM prediction. This provides the statistical signal needed to discover meaningful features through the mining prompt.

As we show in experiments on the simple task of discovering binary representation of numbers, iteratively running Algorithm 1 with increasing ii adds new features in coarse-to-fine manner, from those that split the data based on large differences in target 𝐲{\bf y} to those that contribute to explaining increasingly smaller differences.

Conditional (targeted) feature addition

A possible problem with such coarse-to-fine approach, especially for nonlinear and easy to overfit functions f(𝐳)f({\bf z}), is that items with different combinations of features can be arbitrarily split with the feature ziz_{i} we are updating. For example, suppose we are working on the house pricing dataset and are running the Algorithm 1 for the second time, having added only one feature so far: θf1=\theta_{f_{1}}=’Located in Arizona desert communities’. Then as we iteratively look for a natural split q2q_{2} it is possible that for items with high q1q_{1}, the procedure may split the data along a different semantic axis (e.g., presence of swimming pools) than for the items with low q1q_{1}. To deal with this, we alter Algorithm 1 very slightly: When we sample positive and negative examples, we take them from items that have other features identical. This is akin to using the structured variational approximation q=q(zi|{zj}ji)jiq(zj)q=q(z_{i}|\{z_{j}\}_{j\neq i})\prod_{j\neq i}q(z_{j}), but instead of fitting this for all combinations of other features, we target only one combination at a time. The discovered feature θfi\theta_{f_{i}} is still used to assign values hih_{i} for all data samples, whether they were in the targeted group or not, so that it can be used in the next step.

To select the combination of features {zj}j=1i1\{z_{j}\}_{j=1}^{i-1} on which we condition the ii-th feature discovery, we start with the cost being optimized in (12), from the observation part (T1T_{1}) of the approximate log likelihood based on a sample (or the mode) 𝐳t{\bf z}^{t} for each item:

t(𝐲tf(𝐳t))TΣ1(𝐲tf(𝐳t)),\sum_{t}\big({\bf y}^{t}-f({\bf z}^{t})\big)^{T}\Sigma^{-1}\big({\bf y}^{t}-f({\bf z}^{t})\big), (15)

or equivalently, by partitioning over all observed combinations 𝐜{\bf c} of the other features:

𝐜t:𝐳it=𝐜(𝐲tf(𝐳t))TΣ1(𝐲tf(𝐳t))=𝐜T1𝐜,\sum_{{\bf c}}\sum_{t:{\bf z}_{-i}^{t}={\bf c}}\big({\bf y}^{t}-f({\bf z}^{t})\big)^{T}\Sigma^{-1}\big({\bf y}^{t}-f({\bf z}^{t})\big)=\sum_{{\bf c}}T_{1}^{{\bf c}}, (16)

where T1𝐜T_{1}^{{\bf c}} are the summands that show the total error due to each observed combination 𝐜{\bf c} of the other features 𝐳i=[z1,,zi1]T{\bf z}_{-i}=[z_{1},\ldots,z_{i-1}]^{T}. We select the combination 𝐜{\bf c}^{*} with the largest total error 𝐜=argmax𝐜T1𝐜{\bf c}^{*}=\arg\max_{{\bf c}}T_{1}^{{\bf c}}. Algorithm 2 shows the conditional (targeted) feature addition procedure. The key difference from Algorithm 1 is that positive and negative examples are sampled from a targeted subgroup GG^{*} (items sharing the same values for features 11 through i1i-1), selected by identifying which combination has the largest total error. The discovered feature θfi\theta_{f_{i}} is then applied to all items.

These algorithms add a feature but can also be used to update a feature by removing it from the set and then adding a new one. While we can select a feature to update at random, we use a more efficient approach: we choose the feature whose removal affects the prediction (likelihood bound) the least, as shown in Algorithm 3. A single feature update becomes a removal of one feature (Algorithm 3) followed by adding a feature (Algorithm 1). Generally, we alternate an arbitrary number of feature adding steps with feature removal steps, as shown in Algorithm 4.

alg1
Algorithm 1 Basic feature discovery (non-targeted)
alg2
Algorithm 2 Conditional (Targeted) Feature Addition
alg3
Algorithm 3 Prune Feature (remove least useful feature)
alg4
Algorithm 4 Expand-Contract (iterative feature refinement)

5 Experiments

Especially exciting applications of hybrid models discussed above involve situations in which the foundational model is capable of recognizing patterns in the data, i.e., features that some items share and others do not, but the high-dimensional target 𝐲{\bf y} is not understood well by the model, e.g., in scientific discovery based on new experiments. However, there is value in analyzing situations where we (humans) understand the links so that we can better understand what the model is discovering. This is especially interesting when the data involves well understood semantic items, but also novel (or dated!), or otherwise skewed statistics between these items and the targets 𝐲{\bf y} which we want to discover, despite a potential domain shift w.r.t. what the foundational model was trained on.

We analyzed three datasets we expected to be illustrative:

  • Learning binary: A toy dataset with items ss being strings 0,1,,511"0","1",\ldots,"511", and the corresponding targets y=str2num(s)y=\text{str2num}(s). We demonstrate that iterating Algorithm 1 recovers a 9-bit binary representation.

  • Hedonic regression on house prices: A multimodal dataset from ahmed2016house, consisting of four listing images and metadata, which we turned into semantic items ss describing all information about a house and the target log price yy (Fig. 3).

  • Cold start in recommendation systems: The Netflix prize dataset. Singular value decomposition of the user-movie matrix is used to make 32-dimensional embedding 𝐲{\bf y} for each movie, and the associated semantic item is the movie title and the year of release.

Since LLMs can understand similarities between movies or houses and prices, there was a chance that the models could discover relevant features without our iterative algorithms. In fact, as discussed in Section 1, most prior work on using off-the-shelf LLMs to discover features useful in classification and, especially, regression focused on probing LLMs with the description of the task, rather than the data itself, and subsequent feature selection. Our baseline that represents such approaches, which we refer to as 0-shot, is to provide an LLM with an example list of semantic items, describe the task, and ask for around 50 yes/no characteristics that in the LLM’s judgment would be most useful for the task.

Except for the first experiment, whose purpose is to illustrate coarse-to-fine nature of feature discovery with a linear model, we tested two f(𝐳)f({\bf z}) functions (softened in p(𝐲|𝐳)p({\bf y}|{\bf z}) through learned variances σy2\sigma^{2}_{y}): a linear model and a neural network with two hidden layers (64 and 32 units with ReLU activations and dropout rate 0.1), trained using Adam optimizer with learning rate 0.001 for 100 epochs and batch size 32.

For the two more involved datasets (houses and movies), we used conditional feature adding Algorithm 2 with 10-20 cycles of adding 5 features and removing 2. During the process, when a discovered feature is active for less than 2% or more than 98% of the data, it is discarded, resulting in the feature count generally growing until it stabilizes towards the end of the runs.

Mining prompts are executed by GPT 5, while the extraction prompts are executed with GPT 4.

5.1 Learning, or rather, recalling binary

The primary purpose of this experiment was to serve as an explanation of what iteration of the Algorithm 1 tends to do. The semantic items s{0,1,,511}s\in\{"0","1",\ldots,"511"\}, have the corresponding targets y=str2num(s)y=\text{str2num}(s). We map characteristics 𝐳{\bf z} to predictions yy with a linear function f(𝐳)=𝐰T𝐳+bf({\bf z})={\bf w}^{T}{\bf z}+b.

When we add the first feature by running Algorithm 1, qiq_{i} splits the data into high and low values (above or below the mean). When positive and negative groups are incorporated into the mining prompt (Figure 1) and given to GPT 5, it easily detects the pattern: Depending on the random initialization of 𝐰{\bf w}, the positive group’s characteristic is of two sorts: either an equivalent of "numbers 0-255 inclusive", or an equivalent of "numbers greater than 255".

The first feature splits the data into high and low values, and for all items the model will now predict the mean of one of these two subsets: 127.5 for numbers less than 256, and 383.5 for the rest. The next execution of Algorithm 1 will now assign z2=1z_{2}=1 and z2=0z_{2}=0 to the items based on where the model over- or underestimates yy. (Due to randomness in initialization, z2=1z_{2}=1 could focus on either all of the overshoots or all of the undershoots). So, the positive group will consist of either the upper halves of the z1z_{1} groups or the lower halves. For example, the positive group would include all items in both range [0..127][0..127] and range [256..383][256..383]. Given these as the positive group, and the rest as negative in the mining prompt, GPT 5 sets the next feature to be the equivalent of "Integers with the 2ˆ7 bit unset." As before, due to random initialization, the positive and negative group could be switched, but that would still yield an equivalent feature, as the model simply adjusts weights.

Here is an example of a set of features θf\theta_{f} obtained by running Algorithm 1 9 times with ii increasing from 1 to 9:

  • Integers greater than or equal to 256 (i.e., with the 2ˆ8 bit set)

  • Integers with the 2ˆ7 bit set (i.e., numbers whose value modulo 256 is between 128 and 255)

  • Integers with the 2ˆ6 (64) bit unset (i.e., numbers whose value modulo 128 is 0–63)

  • Integers with the 2ˆ5 (32) bit set (i.e., numbers whose value modulo 64 is between 32 and 63)

  • Integers with the 2ˆ4 (16) bit unset (i.e., numbers whose value modulo 32 is 0–15)

  • Integers with the 2ˆ3 (8) bit set (i.e., value modulo 16 is between 8 and 15)

  • Integers with the 2ˆ2 (4) bit unset (i.e., numbers whose value modulo 8 is 0–3)

  • Integers with the 2ˆ1 (2) bit set (i.e., numbers congruent to 2 or 3 modulo 4)

  • Integers with the 2ˆ0 (1) bit unset (even numbers)

And here is another one:

  • Integers from 256 to 511 (inclusive)

  • Integers whose value modulo 256 is in the range 128–255 (i.e., their low byte has its highest bit set)

  • Integers whose value modulo 128 is in the range 0–63 (inclusive)

  • Integers whose value modulo 64 is in the range 0–31 (the lower half of each 64-block)

  • Integers whose value modulo 32 is in the range 16–31 (the upper half of each 32-number block)

  • Integers whose value modulo 16 is in the range 0–7 (the lower half of each 16-number block)

  • Integers whose value modulo 8 is in the range 0–3 (the lower half of each 8-number block)

  • Integers congruent to 2 or 3 modulo 4 (i.e., with the second least significant bit set)

  • Even integers (including 0)

Both reconstructions are perfect. We see the same for other ground truth function choices of yy that are linearly related to the value of the number in the string ss, y=a1num2str(s)+a2y=a_{1}\cdot\text{num2str}(s)+a_{2}. The feature discovery is driven by the structure of the algorithm that breaks the data in a coarse-to-fine manner through binary splits, perfectly matching the binary number representation. All that a foundational model has to do is to recognize this in the provided positive/negative splits. In many applications, such coarse-to-fine binary splitting of the data can be problematic, as features may not interact additively. In the next two experiments in more realistic scenarios, we use Algorithm 2, which chooses the groups conditional on the assignment of the already discovered features.

5.2 Hedonic regression of house prices

The multi-modal dataset ahmed2016house consists of information for 535 houses listed in 2016, the first of which is shown in Figure 3. While our approach is directly applicable to multi-modal data, to save tokens and avoid limitations on the number of images we can submit in our mining prompt, we first transformed the information to pure text, by running all images through GPT 5 with the prompt "Describe the listing image in detail.". Each semantic item is a JSON file with metadata (without the price) and image description, and the target yy is the log of the price. (Log price modeling is standard in hedonic regression, as some features multiplicatively affect prices. For example, an upgraded interior is costlier for a larger home, or for a home in an expensive ZIP code.)

S5.F3
Figure 3:House listing images (left) and corresponding MLLM-generated semantic item (right). Strikethrough price indicates that price is withheld from items 𝑠 in training our model. However, in the alternative to our approach, all items including prices were included in baseline prompting, where GPT 5 was given the entire dataset and asked to generate features predictive of the price.
S5.F4
(a)Linear model
S5.F4.sf2
(b)Neural network
Figure 4: Hedonic regression results on log of house prices. Both plots show the median absolute error (MAE) (left axis) and number of features (right axis) over iterations. The horizontal dashed line shows GPT-5 0-shot baseline (MAE=0.38). The video showing which features are added (blue) or removed in (b) is available at: https://github.com/mjojic/genZ/tree/main/media.

As discussed in Section 5, we fit yy using both linear and nonlinear (neural net) forms of the function ff. In both cases, we supply ff with additional real-valued inputs x1=1,x2=area,x3=bedrooms,x4=bathroomsx_{1}=1,x_{2}=\text{area},x_{3}=\text{bedrooms},x_{4}=\text{bathrooms}, as we do not wish to rediscover these numerical features as we did in the toy experiment in the previous section. However, in the semantic items ss, full metadata (the number of bedrooms and bathrooms, the area in square feet, but not the price) is still provided, as it can participate in features 𝐳{\bf z} to account for both nonlinear transformations of price and combined feature effects. For example, in some locales, luxury homes have more bathrooms than bedrooms and that signal can be captured in the discrete zz as an indicator of higher priced homes.

We split the data into the training set (80% of the data) and the test set. We show the learning curves in Figure 4 for the linear model ff and a neural network ff. The video showing which features are added (blue) or removed (red) in Figure 4 (b) is available at: https://github.com/mjojic/genZ/tree/main/media. We plot the progress in terms of median absolute error, as this corresponds to how the performance of hedonic regression of house prices tends to be evaluated in industry. For example, Zillow zillow reported that its Zestimate for off-market homes has a median error rate of 7.01%. (Zestimate for on-market homes is much better because it benefits from the known listing price set by an agent; In our experiments we are trying to predict the price without such professional appraisal). As we model log prices, small median absolute errors (MAE) are close to the relative (percentage) error of the actual (not log) price. For example, achieving MAE of 0.10-0.11, as our model does, is equivalent to a median error rate of around 12%.

The 0-shot baseline is indicated by the horizontal line. The baseline is computed as follows. First, we provided the entire set of semantic items, including the prices to GPT 5 and asked for 50 yes/no questions whose answers would be helpful in predicting the price (GPT gave us 52 instead). Then, these are set to be features θf\theta_{f}, and the pruning Algorithm 3 is applied repeatedly on the training set using linear f(𝐳,𝐱)f({\bf z},{\bf x}), keeping track of the test performance. (Note that the function ff still takes the numeric features x1,x2,x3,x4x_{1},x_{2},x_{3},x_{4}). The horizontal line on the graph represents the best test set performance achieved during pruning. As discussed in Sections 1, 5, this baseline is meant to simulate what could be expected from methods that rely on LLM’s understanding of the domain, rather than iteratively contrasting items based on modeling error as our approach does. The baseline vastly underperforms, and by comparing the baseline’s features with the features our method discovered (Section A.1) we can see why. For example, the baseline’s features rarely focus on location or the architecture/quality of the build/remodel, and instead capture relatively cosmetic features such as fenced yards or green lawns.

The differences in the learning curves provide another illustration that the hybrid model is adjusting all its parameters in unison during learning. The train and test median absolute error drop together until over-training happens. In the case of the linear model, the training error continues to drop, but the testing error stays stable, while in the case of the nonlinear (neural) model, we see that both train and test errors reach lower levels of around 0.110.11. This corresponds to roughly 12%12\% relative error on the raw (not log) price, after which the model begins to overtrain more severely, with training error dropping even lower as the testing error continues to rise. (Note Zillow’s 7% errorzillow, but based on a different, much larger and richer, proprietary dataset of hand-crafted features).

5.3 Cold start recommendations: Netflix prize dataset

The cold start problem in collaborative filtering systems refers to the challenge of making predictions for new items or users without sufficient interaction history. We use the classic Netflix prize dataset bennett2007netflix; netflix-prize-data to demonstrate how our hybrid model can address the cold start problem for new movies by discovering semantic features that predict collaborative filtering embeddings.

For 480,189 users and 17,770 movies, the dataset provides user-movie ratings (0-5, where 0 indicates that the movie was not rated by that user). We computed 32-dimensional embeddings 𝐲{\bf y} for each movie using singular value decomposition (SVD) of the ratings matrix. It is well understood that such embeddings capture latent user preferences and movie characteristics discovered through collaborative filtering, and are remarkably good at modeling similarity between movies and matching users to movies via simple vector projection of the embeddings. (See also the discussion in Section C on how instead of embeddings the raw matrix could be modeled with two different types of semantic items – users and movies – using a similar formulation to the one above).

To illustrate the cold start problem, we simulate adding a new movie with limited exposure to viewers, by removing a single movie and gradually adding user ratings. For a selected movie, we set the values in the ratings matrix to zero for all users, then gradually add users’ actual ratings in a random order and recompute the movie’s embedding after each addition. Figure 5 shows how the cosine similarity (CS) between the original (full observation) movie embedding and the embedding based on sparse observations improves as we add ratings for up to 4000 users. The plot shows the mean and 90% confidence interval over 100 randomly selected movies from the dataset. The similarity reaches only about 0.38 after 1000 users and approximately 0.57 after 4000 users. In other words, thousands of users would need to be exposed to the movie, many skipping it, and some rating it, before the CS could grow to around 0.6, leading to more reliable recommendation. We show below that our linear hybrid model can predict embeddings immediately from the semantic items describing the movie, achieving test CS of approximately 0.59 without requiring any user ratings—a performance level that would otherwise require roughly 4000 user ratings to achieve through collaborative filtering alone.

S5.F5
Figure 5:Cold start convergence simulation. Cosine similarity between embeddings computed from partial ratings vs. full ratings, as user ratings are gradually added in random order. The shaded region shows 90% confidence interval over 100 randomly selected movies. Hundreds of ratings are typically needed before embeddings converge, illustrating the severity of the cold start problem that our hybrid model addresses by predicting embeddings from semantic features alone.

In our experiments, we focused on the 512 most watched movies, again randomly split into training (80%) and test set. This allowed us to set the semantic items ss for these movies to consist only of the movie title and year of release: Modern LLMs are familiar with the most popular movies watched in the period from 1998 to 2006. Otherwise, we would need items ss to contain more information, as was the case in the house experiment, or allow the LLM to use a RAG system to retrieve relevant information when executing extraction and mining prompts. The quality of the predictions is measured using cosine similarity (CS) between predicted and true 32-dim embeddings 𝐲{\bf y}.

The learning curves are shown in Figure 6 for both linear and neural network models. The video showing which features are added (blue) or removed (red) in Figure 6 (a) is available at: https://github.com/mjojic/genZ/tree/main/media. As with the houses experiment, we compare against a strong 0-shot baseline (horizontal dashed line at CS=0.48) obtained by asking GPT-5 to suggest 50 features based on the task description and movie titles, then via pruning on the training set find the best feature set in terms of the test error, as this is the best the baseline could possibly get for that training-test split.

The linear model substantially outperforms the 0-shot baseline, reaching a test cosine similarity of approximately 0.59, indicating that the iterative feature discovery process finds movie characteristics more predictive of user preferences than those suggested by the LLM based on task understanding alone. The 0.11 improvement in cosine similarity (from 0.48 to 0.59) is substantial: As the cold start simulation in Figure 5 shows, this difference equals the improvement from adding roughly 2000 additional user ratings. Unlike the house price prediction experiment, using the neural network as the function f(𝐳)f({\bf z}) in the GenZ system leads to more dramatic overfitting, where the test CS score quickly levels off while the train CS continues to grow.

Examining the discovered features reveals interpretable patterns aligned with user preferences, though notably different in character from the 0-shot baseline features (see Section A.2 for complete feature lists). Where the 0-shot features focus on content-based attributes (genre, plot structure, themes, pacing, narrative devices), the discovered features gravitate toward external markers: specific talent (individual actors and composers), multiple fine-grained award distinctions (Best Picture, Best Actor/Actress for specific role, Best Original Screenplay), franchise membership, and surprisingly precise temporal windows (1995-2000, 2004-2005, or post-1970). The John Williams feature is particularly revealing—it cuts across multiple genres to identify a coherent aesthetic or cultural preference that would be invisible to a content-based analysis. This suggests that collaborative filtering embeddings capture user similarity not through "stories like this" but through shared preferences for prestige signals, specific creative talent, and cultural/temporal cohorts (at least in this dataset). In another dataset, focused on a different selection of movies, or a different time frame, or a different cross-section of users, the system would likely discover different emergent structure in collective viewing, likely somewhat in disagreement with the conventional theories of content similarity.

S5.F6
(a)Linear model
S5.F6.sf2
(b)Neural network
Figure 6: Cold start recommendation results on Netflix dataset. Both plots show cosine similarity (CS) between predicted and true embeddings (left axis) and number of features (right axis) over iterations. The horizontal dashed line shows GPT-5 0-shot baseline (CS=0.5). The linear model achieves better test performance (CS\approx0.59) than the neural network (CS\approx0.52) despite using fewer features, suggesting that the relationship between semantic features and collaborative filtering embeddings is predominantly linear. The video showing which features are added (blue) or removed in (a) is available at: https://github.com/mjojic/genZ/tree/main/media.

6 Conclusions

The differences and similarities of the learning curves relative to the baseline in both experiments indicate that components of the model jointly adjust to different semantics of the data as well as target statistics and their relationships to semantics. Even though the models used here (GPT 4/5) are familiar with the semantics of the data, their ability to predict the real-valued, possibly multi-dimensional targets benefits from the rest of the system and the modeling choices. For example, while a neural network better models interactions of features needed to predict house prices, the linear model is better suited for modeling movie embeddings. In both applications, the learning curves show classic behavior of iterative improvement until over-training. On the other hand, features generated by the LLMs alone are highly redundant preventing both effective prediction and overtraining. As evident in the discovered features (Section A), our system focuses on properties of the semantic items that fit the statistics of the links s𝐲s\rightarrow{\bf y}, rather than ’conventional wisdom’ of a zero-shot approach: For houses, the system discovers that location and quality of build/remodel matters most in the data it was given for listings in 2016, while for movies, the system discovers the features important to the audience in the period 1998-2006, such as the difference between newer and older content, importance of franchises, specific actors/directors, and so on.

Appendix A Learned features

A.1 Learned features for the house dataset

The following are the features discovered by GenZ with the nonlinear (neural net) f(𝐳)f({\bf z}) (on top of autoamtically supplied numerical values x1,x2,x3,x4x_{1},x_{2},x_{3},x_{4} for intercept, area, number of bedrooms and number of bathrooms), reaching the test and train median relative error around 12% of the raw target price (Fig. 4(b):

  • Site-built construction (not a manufactured/mobile home)

  • Located in Arizona desert communities (Scottsdale/Carefree area: zip codes 85255, 85262, 85377)

  • Located in the 92880 zip code (Eastvale/Corona area)

  • Located in the 62234 zip code (Collinsville, IL area)

  • Located in the 94531 zip code (Antioch, CA area)

  • Located in the 92276 zip code (Desert Hot Springs, CA)

  • Located in the 93111 zip code (Goleta/Santa Barbara area)

  • Located in the 81524 zip code (Loma, CO area)

  • Located in the 93510 zip code (Acton, CA area)

  • Located in the 91901 zip code (Alpine, CA area)

  • Bathrooms feature retro/vintage elements (e.g., colored fixtures or tile like powder blue/pink, clawfoot tubs, or

  • checkered floors)

  • Located in the 92677 zip code (Laguna Niguel, CA)

  • Bathrooms featuring Hollywood-style bulb strip lighting above the mirror

  • Located in the 94501 zip code (Alameda, CA)

  • Homes featuring plantation shutters on windows

  • Located in the 96019 zip code

  • Located in Chicago-area suburbs with zip codes starting with 600 (e.g., 60002, 60016, 60046)

  • Bathrooms feature built-in vanities with cabinetry (no pedestal sinks)

  • Interior walls feature wood paneling

  • Kitchen does not include an island (no central island present)

  • Home includes a carport for covered parking (rather than an attached garage)

  • Located in zip codes starting with 90 (Los Angeles/Long Beach area)

  • Home includes a three-car garage

In Fig. 4(b), the horizontal line represents the minimum test error after training on the training set with ever-reducing set of baseline features obtained in what we refer to as 0-shot manner. Specifically, to create those baseline features, GPT 5 was given the entire dataset (training and test) together with prices and asked to hypothesize features useful for price prediction. The following are those 52 features from the 0-shot baseline used in feature selection:

  • House has 4 or more bedrooms

  • House has 5 or more bedrooms

  • House has 3 or more full bathrooms

  • House has 4 or more full bathrooms

  • House has an attached garage

  • House has a 3-car (or larger) garage

  • House has a circular driveway

  • House has a gated entry (wrought-iron gate/entry gate)

  • House has a covered front porch

  • House has a covered patio

  • House has a pergola for shade

  • House has a raised deck

  • House has a balcony

  • House has a private swimming pool

  • House has a resort-style pool with water features

  • House has waterfront access (dock/marina/pond view)

  • House has mountain or panoramic scenic views

  • House has extensive desert/xeriscape landscaping (gravel/cacti/rocks)

  • House has a green lawn (grass yard)

  • House has a fenced yard

  • House has solar panels on the roof

  • House has a tile roof (e.g., clay/Spanish tile)

  • House has stucco exterior walls

  • House has exterior stone/stacked-stone accents

  • House has Mediterranean-inspired architecture (arches/turret/columns)

  • House has Craftsman-style exterior elements (stone columns/shingle siding)

  • House has an open-concept kitchen/living layout

  • House has a large kitchen island

  • House has a breakfast bar or peninsula seating in the kitchen

  • House has granite or stone kitchen countertops

  • House has a tile or mosaic kitchen backsplash

  • House has stainless steel kitchen appliances

  • House has a double oven

  • House has recessed lighting

  • House has decorative chandelier lighting

  • House has hardwood flooring in main living areas

  • House has tile flooring in main living areas

  • House has wall-to-wall carpeting in bedrooms

  • House has plantation shutters

  • House has vaulted or tray ceilings

  • House has crown molding

  • House has exposed wood ceiling beams

  • House has a fireplace (in any room)

  • House has a fireplace in the primary bedroom

  • House has a walk-in shower with glass enclosure

  • House has a soaking tub in the primary bathroom

  • House has a jetted tub (spa tub)

  • House has a freestanding bathtub

  • House has a double vanity (two sinks) in a bathroom

  • House has glass block windows in a bathroom

  • House has a skylight

  • House has a built-in desk or built-in shelving/storage cabinetry

A.2 Discovered features for Netflix dataset

The following 27 features were discovered by our approach with the linear model f(𝐳)f({\bf z}), reaching test CS of 0.59 (Fig. 6(a)):

  • Not science fiction (no movies centered on aliens, space, or futuristic tech)

  • Not released between 1995 and 2000 (inclusive)

  • Not primarily comedies

  • Not primarily romance-focused films (i.e., avoids romantic dramas and romantic comedies)

  • Mostly PG-13/PG-rated mainstream releases (i.e., generally avoids hard-R films)

  • Not animated (live-action films)

  • Not primarily action-adventure, heist, or spy thrillers (i.e., avoids high-octane studio action crowd-pleasers)

  • Not horror (i.e., avoids horror and horror-leaning supernatural thrillers)

  • Includes many Academy Award Best Picture nominees/winners (i.e., more prestige/canon titles)

  • Includes a notable number of pre-1995 titles (1960s–early ’90s), whereas the negatives are almost entirely 1995–2004 releases

  • Primarily entries in major film franchises or multi‑film series (including first installments that launched franchises)

  • Includes multiple war films set amid real historical conflicts (e.g., WWII/Vietnam), which are absent in the negative group

  • Not crude/raunchy, man‑child/SNL‑style comedies (when comedic, entries skew quirky/indie or gentle rather than broad/frat humor)

  • Skews toward female-led or women-centered narratives (e.g., Legally Blonde, The First Wives Club, Freaky Friday, The Pelican Brief, Lara Croft), whereas the negative group is predominantly male-led

  • Excludes 2004–2005 releases (the positives stop at 2003, while many negatives are from 2004–2005)

  • Includes multiple films centered on high school/college life or set primarily in educational settings (e.g., The Breakfast Club, Sixteen Candles, Fast Times at Ridgemont High, Billy Madison, Good Will Hunting, Top Gun), a theme largely absent from the negatives

  • Not children/family-oriented fare (avoids Disney-style live‑action family movies and musicals; skews toward teen/adult titles instead)

  • Not quirky/indie/cult, auteur‑driven or formally experimental films; the positives skew toward conventional, mainstream studio storytelling

  • Not ancient/medieval ‘sword‑and‑sandal’ or knightly epics (i.e., avoids Greco‑Roman/medieval battlefield sagas like Braveheart or Troy)

  • Includes multiple films scored by John Williams (e.g., Return of the Jedi, Raiders of the Lost Ark, E.T., Harry Potter and the Sorcerer’s Stone), whereas none of the negatives are Williams-scored

  • Includes multiple films that earned their lead performer an Academy Award (Best Actor/Best Actress) for that specific role (e.g., Monster, Training Day, Monster’s Ball, Dead Man Walking), whereas the negatives have few or none

  • Standard theatrical releases only (no titles labeled as Special/Extended/Anniversary/Platinum Edition)

  • Includes multiple films starring Angelina Jolie or Brad Pitt (none of the negatives star either)

  • Includes multiple films that won or were nominated for the Academy Award for Best Original Screenplay

  • No pre-1970 releases (all positives are 1972 or later, while the negatives include several 1950s–1960s films)

  • Includes multiple films starring Tom Hanks (e.g., Forrest Gump, Sleepless in Seattle), whereas none of the negatives feature him

On the other hand, a 0-shot approach produces very different features. The following 50 features were suggested by GPT-5 based on task description, then pruned to achieve the baseline test CS of 0.48:

  • Is animated

  • Is part of a franchise/sequel/prequel/remake

  • Is based on a true story or real events

  • Is adapted from a non-comic book or short story

  • Is adapted from a comic book/graphic novel

  • Features a superhero or superpowered protagonist

  • Is set primarily in a historical period (pre-1950)

  • Is set primarily in the future

  • Is set primarily in a contemporary/modern time

  • Is set primarily outside the United States

  • Has predominantly non-English dialogue

  • Has an ensemble cast with multiple co-leads

  • Has a female lead or co-lead driving the story

  • Centers on a male buddy duo

  • Has a strong, central romantic plot

  • Has a dominant comedic tone

  • Is action-driven with frequent set pieces

  • Contains graphic/intense violence

  • Contains significant gunplay

  • Contains extensive swordplay or martial arts

  • Features large-scale war or battle scenes

  • Contains high-speed car chases/vehicular action

  • Centers on a heist/caper

  • Centers on spy/espionage activities

  • Centers on a crime investigation/detective story

  • Centers on courtroom/legal drama

  • Centers on supernatural or paranormal elements

  • Centers on science fiction elements (tech, AI, space)

  • Centers on fantasy worldbuilding (magic, mythical beings)

  • Has prominent horror elements and scares

  • Hinges on a mystery with a major twist/reveal

  • Uses a nonlinear or fractured timeline

  • Uses an unreliable narrator or ambiguous reality

  • Has slow-burn, mood-driven pacing

  • Has fast-paced, high-energy pacing

  • Has a feel-good/optimistic tone with an uplifting resolution

  • Has a tragic or bittersweet ending

  • Has a dark or cynical tone

  • Explores strong themes of redemption or forgiveness

  • Is driven by a revenge motivation

  • Is a coming-of-age or personal growth story

  • Features a prominent mentor–protégé relationship

  • Centers on friendship/buddy dynamics

  • Centers on family relationships (parents/children/siblings)

  • Is a musical (characters sing to advance the story)

  • Is sports-centered

  • Has a military setting or procedural focus

  • Relies heavily on CGI/visual effects spectacle

  • Won at least one Academy Award

  • Has a runtime over 140 minutes

Appendix B Tractably summing out z in joint updates for the linear observation model (factor analysis)

In our experiments, we used a generic version of the GenZ learning algorithm with the functional link 𝐳𝐲{\bf z}\rightarrow{\bf y} arbitrary (plug-and-play). Instead of computing exact expectations of form Eq(𝐳)g(𝐳)E_{q({\bf z})}g({\bf z}), we just evaluated gg at the mode of qq. While this allowed us to run the same algorithm for the linear and nonlinear (neural net) models, we noted that in some cases these expectations can be computed tractably under the full distribution qq. Linear model is such a case, and we derive it here for illustration.

A linear link between 𝐳{\bf z} and 𝐲{\bf y} takes a form

p(𝐲|𝐳,θy={𝚲,σ2})=𝒩|(𝐲|𝚲𝐳,σ2I),p({\bf y}|{\bf z},\theta_{y}=\{{\bf\Lambda},\sigma^{2}\})={\cal N}|({\bf y}|{\bf\Lambda}{\bf z},\sigma^{2}I), (17)

where 𝐳{\bf z} is now (nf+1)×1(n_{f}+1)\times 1 feature vector with an additional element which is always 11, 𝐲{\bf y} is a nd×1n_{d}\times 1 observation vector, and 𝚲{\bf\Lambda} is an nd×(nf+1)n_{d}\times(n_{f}+1) parameter matrix. (Note that the overall model is still nonlinear because the relationships among features 𝐳{\bf z} in p(𝐳|s)p({\bf z}|s) are nonlinear). The additional constant element znf+1=1z_{n_{f}+1}=1 in 𝐳{\bf z} allows for learning the bias vector within 𝚲{\bf\Lambda}.

T1=Eq[logp(𝐲|𝐳)]\displaystyle T_{1}=E_{q}[\log p({\bf y}|{\bf z})] =\displaystyle= Eq[12σ2(y𝚲𝐳)T(y𝚲𝐳)]nd2log2πσ2\displaystyle E_{q}\big[-\frac{1}{2\sigma^{2}}(y-{\bf\Lambda}{\bf z})^{T}(y-{\bf\Lambda}{\bf z})\big]-\frac{n_{d}}{2}\log 2\pi\sigma^{2} (18)
=\displaystyle= 12σ2Eq[𝐲T𝐲2𝐲T𝚲𝐳+𝐳T𝚲T𝚲𝐳]nd2log2πσ2\displaystyle-\frac{1}{2\sigma^{2}}E_{q}\big[{\bf y}^{T}{\bf y}-2{\bf y}^{T}{\bf\Lambda}{\bf z}+{\bf z}^{T}{\bf\Lambda}^{T}{\bf\Lambda}{\bf z}\big]-\frac{n_{d}}{2}\log 2\pi\sigma^{2}
=\displaystyle= 12σ2(𝐲T𝐲2𝐲T𝚲Eq[𝐳]+Eq[tr(𝚲𝐳𝐳T𝚲T])nd2log2πσ2\displaystyle-\frac{1}{2\sigma^{2}}\big({\bf y}^{T}{\bf y}-2{\bf y}^{T}{\bf\Lambda}E_{q}[{\bf z}]+E_{q}[tr({\bf\Lambda}{\bf z}{\bf z}^{T}{\bf\Lambda}^{T}]\big)-\frac{n_{d}}{2}\log 2\pi\sigma^{2}
=\displaystyle= 12σ2(𝐲T𝐲2𝐲T𝚲Eq[𝐳]+tr(𝚲Eq[𝐳𝐳T]𝚲T))nd2log2πσ2\displaystyle-\frac{1}{2\sigma^{2}}\big({\bf y}^{T}{\bf y}-2{\bf y}^{T}{\bf\Lambda}E_{q}[{\bf z}]+tr({\bf\Lambda}E_{q}[{\bf z}{\bf z}^{T}]{\bf\Lambda}^{T})\big)-\frac{n_{d}}{2}\log 2\pi\sigma^{2}

We use the approximate posterior limited to the form q(𝐳)=iq(zi)q({\bf z})=\prod_{i}q(z_{i}). Defining q(zi=1)=qiq(z_{i}=1)=q_{i}, q(zi=0)=1qiq(z_{i}=0)=1-q_{i}, and 𝐪=[q1,q2,qnf,1]T{\bf q}=[q_{1},q_{2},...q_{n_{f}},1]^{T} (as znf+1:=1z_{n_{f}+1}:=1), we have

Eq[𝐳]\displaystyle E_{q}[{\bf z}] =\displaystyle= 𝐪,\displaystyle{\bf q}, (19)
Eq[𝐳𝐳T]\displaystyle E_{q}[{\bf z}{\bf z}^{T}] =\displaystyle= 𝐪𝐪T+diag(𝐪𝐪𝐪),\displaystyle{\bf q}{\bf q}^{T}+\operatorname{diag}({\bf q}-{\bf q}\odot{\bf q}), (20)

where \odot denotes point-wise multiplication. (To see this, note that for nondiagonal elements zizjz_{i}z_{j} in 𝐳𝐳T{\bf z}{\bf z}^{T} we have E[zizj]=qiqjE[z_{i}z_{j}]=q_{i}q_{j} but the expectation for the diagonal elements is E[zi2]=qiE[z_{i}^{2}]=q_{i}, rather than qi2q_{i}^{2}). Thus,

tr(𝚲Eq[𝐳𝐳T]𝚲T)\displaystyle tr({\bf\Lambda}E_{q}[{\bf z}{\bf z}^{T}]{\bf\Lambda}^{T}) =\displaystyle= tr(𝚲𝐪𝐪T𝚲T)+tr(𝚲diag(𝐪𝐪𝐪)𝚲T))\displaystyle\operatorname{tr}({\bf\Lambda}{\bf q}{\bf q}^{T}{\bf\Lambda}^{T})+\operatorname{tr}({\bf\Lambda}\operatorname{diag}({\bf q}-{\bf q}\odot{\bf q}){\bf\Lambda}^{T})\big) (21)
=\displaystyle= 𝐪T𝚲T𝚲𝐪+tr(𝚲diag(𝐪𝐪𝐪)𝚲T)),\displaystyle{\bf q}^{T}{\bf\Lambda}^{T}{\bf\Lambda}{\bf q}+\operatorname{tr}({\bf\Lambda}\operatorname{diag}({\bf q}-{\bf q}\odot{\bf q}){\bf\Lambda}^{T})\big),
T1=12σ2(𝐲𝚲𝐪)T(𝐲𝚲𝐪)12σ2tr(𝚲diag(𝐪𝐪𝐪)𝚲T))nd2log2πσ2.T_{1}=-\frac{1}{2\sigma^{2}}({\bf y}-{\bf\Lambda}{\bf q})^{T}({\bf y}-{\bf\Lambda}{\bf q})-\frac{1}{2\sigma^{2}}\operatorname{tr}({\bf\Lambda}\operatorname{diag}({\bf q}-{\bf q}\odot{\bf q}){\bf\Lambda}^{T})\big)-\frac{n_{d}}{2}\log 2\pi\sigma^{2}. (22)

This is the only term that involves the parameter θy=𝚲\theta_{y}={\bf\Lambda}, and in the M step, for a given 𝐪{\bf q} we will solve for 𝚲{\bf\Lambda} by maximizing it. The derivation is also useful for the E step, in which we estimate 𝐪{\bf q}.

With our choice of factorized posterior, the feature term T2T_{2} simplifies to

T2=Eq[logp(𝐳|s)]=iziq(zi)logp(zi|s)=iqii1+(1qi)i0,\displaystyle T_{2}=E_{q}[\log p({\bf z}|s)]=\sum_{i}\sum_{z_{i}}q(z_{i})\log p(z_{i}|s)=\sum_{i}q_{i}\ell^{1}_{i}+(1-q_{i})\ell^{0}_{i}, (23)

with i1\ell^{1}_{i} and i0\ell^{0}_{i} defined in (5). Finally, the entropy part of the bound becomes

T3=Eq[logq(𝐳)]=iqilogqi+(1qi)log(1qi).T_{3}=-E_{q}[\log q({\bf z})]=-\sum_{i}q_{i}\log q_{i}+(1-q_{i})\log(1-q_{i}). (24)

To perform inference, i.e. to approximate the posterior p(𝐳|𝐲,s)p({\bf z}|{\bf y},s), we optimize the bound B=T1+T2+T3B=T1+T2+T3 wrt to {qi}\{q_{i}\}, the posterior probabilities q(zi=1)q(z_{i}=1). This is a part of the (variational) E-step needed to compute the expectations (20) used in the observation part T1T_{1} in (22). By maximizing T1T_{1}, and therefore the log likelihood bound (1), wrt to 𝚲{\bf\Lambda} ans σ2\sigma^{2} we then perform a part of the M step related to the parameters of the observation conditional (17). In the M-step we can also maximize the bound wrt uncertainty parameters θe={pie}\theta_{e}=\{p^{e}_{i}\} of the feature model (5), and we use the mining prompt in Fig. 2 to re-estimate the semantic model parameters, the feature descriptions θf\theta_{f} as discussed in Sect. 2. While the optimization can be done with off-the-shelf optimization packages, we derive EM updates here.

E step: For a given (s,𝐲)(s,{\bf y}) pair we optimize qiq_{i} one at a time, keeping the rest fixed. A simple way to approximate qiq_{i} for a tight bound is to introduce vectors 𝐪+i{\bf q}^{+i}, and 𝐪i{\bf q}^{-i} which are both equal to 𝐪{\bf q} in all entries but the ii-th. In the i-th entry, 𝐪+i{\bf q}^{+i} equals 1, and 𝐪i{\bf q}^{-i} equals 0. In other words, the two are parameter vectors for posterior distributions where the i-th feature ziz_{i} is forced to be either 1 or 0 with certainty. Given the above derivations, we can thus compute the approximate log likelihood

logp(𝐲|zi=1,s)\displaystyle\log p({\bf y}|z_{i}=1,s) =\displaystyle= log{zj}jip(𝐲|{zj}j,zi=1,s)\displaystyle\log\sum_{\{z_{j}\}_{j}\neq_{i}}p({\bf y}|\{z_{j}\}_{j},z_{i}=1,s) (25)
\displaystyle\approx E𝐪+i[logp(𝐲|𝐳)]+E𝐪+i[logp(𝐳|s)]E𝐪+i[logq(𝐳)],\displaystyle E_{{\bf q}^{+i}}[\log p({\bf y}|{\bf z})]+E_{{\bf q}^{+i}}[\log p({\bf z}|s)]-E_{{\bf q}^{+i}}[\log q({\bf z})],

and similarly for p(𝐲|zi=0,s)p({\bf y}|z_{i}=0,s), but using expectations wrt 𝐪i{\bf q}^{-i}. In each case the three terms are computed as above in (22), (23), and (24), but with the appropriate distribution parameter vector 𝐪+i{\bf q}^{+i} or 𝐪i{\bf q}^{-i} replacing 𝐪{\bf q}. The posterior parameter qi=p(𝐲|zi=1,s)p(𝐲|zi=1,s)+p(𝐲|zi=0,s)q_{i}=\frac{p({\bf y}|z_{i}=1,s)}{p({\bf y}|z_{i}=1,s)+p({\bf y}|z_{i}=0,s)} is thus approximated as

qi=eE𝐪+i[logp(𝐲|𝐳)]+E𝐪+i[logp(𝐳|s)]E𝐪+i[logq(𝐳)]eE𝐪+i[logp(𝐲|𝐳)]+E𝐪+i[logp(𝐳|s)]E𝐪+i[logq(𝐳)]+eE𝐪i[logp(𝐲|𝐳)]+E𝐪i[logp(𝐳|s)]E𝐪i[logq(𝐳)]q_{i}=\frac{e^{E_{{\bf q}^{+i}}[\log p({\bf y}|{\bf z})]+E_{{\bf q}^{+i}}[\log p({\bf z}|s)]-E_{{\bf q}^{+i}}[\log q({\bf z})]}}{e^{E_{{\bf q}^{+i}}[\log p({\bf y}|{\bf z})]+E_{{\bf q}^{+i}}[\log p({\bf z}|s)]-E_{{\bf q}^{+i}}[\log q({\bf z})]}+e^{E_{{\bf q}^{-i}}[\log p({\bf y}|{\bf z})]+E_{{\bf q}^{-i}}[\log p({\bf z}|s)]-E_{{\bf q}^{-i}}[\log q({\bf z})]}} (26)

The entropy terms for 𝐪+i{\bf q}^{+i} and 𝐪i{\bf q}^{-i} are the same: Applying equation (7) to 𝐪+i{\bf q}^{+i} and 𝐪i{\bf q}^{-i} we see that all summands are the same as 𝐪j+i=𝐪ji{\bf q}^{+i}_{j}={\bf q}^{-i}_{j}, except for the ii-th summand which is zero in both cases as 𝐪i+i=1{\bf q}^{+i}_{i}=1 and 𝐪ii=0{\bf q}^{-i}_{i}=0. Therefore the entropy terms divide out. The feature terms E𝐪+i[logp(𝐳|s)]E_{{\bf q}^{+i}}[\log p({\bf z}|s)] E𝐪i[logp(𝐳|s)]E_{{\bf q}^{-i}}[\log p({\bf z}|s)] are not the same. The summands are still the same for iji\neq j, but the ii-th terms are different: They are i1\ell^{1}_{i} and i0\ell^{0}_{i}, respectively. Therefore, we can divide out the shared parts. The expression simplifies to

qi=eE𝐪+i[logp(𝐲|𝐳)]+i1]eE𝐪+i[logp(𝐲|𝐳)]+i1+eE𝐪i[logp(𝐲|𝐳)]+0i=11+eE𝐪i[logp(𝐲|𝐳)]E𝐪+i[logp(𝐲|𝐳)]+i0i1q_{i}=\frac{e^{E_{{\bf q}^{+i}}[\log p({\bf y}|{\bf z})]+\ell^{1}_{i}]}}{e^{E_{{\bf q}^{+i}}[\log p({\bf y}|{\bf z})]+\ell^{1}_{i}}+e^{E_{{\bf q}^{-i}}[\log p({\bf y}|{\bf z})]+\ell^{i}_{0}}}=\frac{1}{1+e^{E_{{\bf q}^{-i}}[\log p({\bf y}|{\bf z})]-E_{{\bf q}^{+i}}[\log p({\bf y}|{\bf z})]+\ell^{0}_{i}-\ell^{1}_{i}}} (27)

To get all parameters qiq_{i} these equations need to be iterated, but the iterations can be interspersed with parameter updates of the M step. In this update rule we see that the foundational model’s prediction of ziz_{i} is balanced with the observation 𝐲{\bf y} which may be better explained if the different ziz_{i} is chosen. As discussed in Sect. 2, this is a signal needed to update the feature descriptor θfi{\theta_{f}}_{i}. For example, if 𝐲\bf y is a movie embedding vector, then the presence or absence of the feature "historical war movie" helps explain explain, through Λ\Lambda, the variation in 𝐲{\bf y}, which in turn reflect user preferences. If the users do not in fact have preference specifically for historical war movies, but instead for movies based on real historical events, war or not, then this feature will only partially explain the variation, and the posterior might indicate the positive feature, zi=1z_{i}=1, with high probability for a movie which is about historical events but the foundational model correctly classified as not a "historical war movie."

M step: While the E-step is performed independently for each item ss, in the M step we consider a collection of item-onservation pairs (st,𝐲t)(s^{t},{\bf y}^{t}), and the posterior distributions 𝐪t{\bf q}^{t} from the E-step, and optimize the sum of the bounds BtB^{t}. From (22), we see that

(tBt)𝚲=1σ2t((𝐲t𝚲𝐪t)(𝐪t)T+𝚲diag(𝐪t𝐪t𝐪t))\frac{\partial\big(\sum_{t}B^{t}\big)}{\partial{\bf\Lambda}}=-\frac{1}{\sigma^{2}}\sum_{t}\bigg(({\bf y}^{t}-{\bf\Lambda}{\bf q}^{t})({\bf q}^{t})^{T}+{\bf\Lambda}\operatorname{diag}({\bf q}^{t}-{\bf q}^{t}\odot{\bf q}^{t})\bigg) (28)

Setting the derivative to zero we derive the update

𝚲=(t𝐲t(𝐪t)T)(t𝐪t(𝐪t)T+diag(𝐪t𝐪t𝐪t))1{\bf\Lambda}=\bigg(\sum_{t}{\bf y}^{t}({\bf q}^{t})^{T}\bigg)\bigg(\sum_{t}{\bf q}^{t}({\bf q}^{t})^{T}+\operatorname{diag}({\bf q}^{t}-{\bf q}^{t}\odot{\bf q}^{t})\bigg)^{-1} (29)

The noise parameter σ2\sigma^{2} is optimized when

σ2=1ndTt=1T(𝐲t𝚲𝐪t)T(𝐲𝚲𝐪t)+tr(𝚲diag(𝐪t𝐪t𝐪t)𝚲T).\sigma^{2}=\frac{1}{n_{d}T}\sum_{t=1}^{T}({\bf y}^{t}-{\bf\Lambda}{\bf q}^{t})^{T}({\bf y}-{\bf\Lambda}{\bf q}^{t})+\operatorname{tr}({\bf\Lambda}\operatorname{diag}({\bf q}^{t}-{\bf q}^{t}\odot{\bf q}^{t}){\bf\Lambda}^{T}). (30)

We can (and in our experiments do) optimize the feature uncertainty parameters piep^{e}_{i}. These parameters could in principle be extracted from the foundational model using token probabilities or its own claim of confidence in the response (see prompt in Fig.2). However, the foundational model’s own uncertainty only captures its confidence about judging the characteristic. In our hybrid model, we can have peip_{e}^{i} model additional uncertainty that the feature is well-defined. This is especially useful during learning as features evolve. As we discussed above using the "historical war movie" feature as an example, in the E step the posterior may diverge from the foundational model’s feature classification if the feature only approximately aligns with true variation in observations. To estimate, and then reuse, the full uncertainty, we can maximize the bound wrt piep^{e}_{i}, to arrive at:

pie=1Tt=1Tqit[hit=0]+(1qit)[hit=1],p^{e}_{i}=\frac{1}{T}\sum_{t=1}^{T}q_{i}^{t}[h_{i}^{t}=0]+(1-q_{i}^{t})[h_{i}^{t}=1], (31)

where hith_{i}^{t} are the binary responses of the foundational model for feature i in the semantic item sts^{t}. (To derive, see (5), although the update is intuitive, as it simply sums the evidence of error as indicated by the posterior).

Finally, to update the feature descriptors θfi{\theta_{f}}_{i}, we sample or threshold the posteriors qitq^{t}_{i} to get some positive zit=1z_{i}^{t}=1, and negative zit=0z_{i}^{t}=0 examples. We then use exemplars from each group to form the feature mining prompt in Fig. 2. The fondational model’s response in the "characteristic" field is the updated feature description θfi{\theta_{f}}_{i}.

Appendix C Statistical modeling of co-occurrence/association/preferences between two sets of items

We studied a model that links a semantic description of an item ss with an observation vector 𝐲{\bf y}. Among other applications, the approach allows us to find features that determine user preferences for movies if 𝐲{\bf y} are movie embeddings inferred from movie-user observation matrix. Here we show that such an approach is a simplification of a model that involves two sets of interacting items (e.g. movies/viewers, products/buyers, plants/geographic areas, images/captions, congress members/bills). That model (without the simplification) can also be learned from data.

Suppose that the matrix of the pairwise interactions xt1,t2x_{t_{1},t_{2}} is observed for item pairs from the two groups (s1t1,s2t2)(s_{1}^{t_{1}},s_{2}^{t_{2}}). We would then model the joint distribution p(x|𝐳1,𝐳2)p(𝐳1|s1)p(𝐳2|s2)p(x|{\bf z}_{1},{\bf z}_{2})p({\bf z}_{1}|s_{1})p({\bf z}_{2}|s_{2}) where, for example, s1s_{1} would be a textual description (or an image!) of a plant and s2s_{2} a textual description (or a satellite observation!) of a geographic area, and the scalar xx would be the presence (or thriving) signal for the combination. We could develop a number of different complex models with suitable latent variables (in addition to 𝐳{\bf z}), but honoring the traditionally very successful collaborative filtering methods for dealing with such data, we will focus on embedding methods (which typically rely on matrix decomposition methods, at least in recommender systems).

Thus, for each set of items our latent variables 𝐳{\bf z} should determine a real-valued embedding 𝐲{\bf y} probabilistically, through a distribution p(𝐲|𝐳)p({\bf y}|{\bf z}).

Then, given the low-dimensional embeddings 𝐲1,𝐲2{\bf y}_{1},{\bf y}_{2} for the items in the pair (e.g. a movie and a user), we would model the association variable as 𝐲𝐮1T𝐮2{\bf y}\approx{\bf u}_{1}^{T}{\bf u}_{2}:

logp(x|𝐲1,𝐲2)=log𝒩(x;𝐲1T𝐲2,σx2)=12σy2(x𝐲1T𝐲2)212log(2πσx2)\log p(x|{\bf y}_{1},{\bf y}_{2})=\log{\cal N}(x;{\bf y}_{1}^{T}{\bf y}_{2},\sigma^{2}_{x})=-\frac{1}{2\sigma_{y}^{2}}(x-{\bf y}_{1}^{T}{\bf y}_{2})^{2}-\frac{1}{2}\log(2\pi\sigma^{2}_{x}) (32)

In the model over observed association strength xx, item embeddings 𝐲1{\bf y}_{1} and 𝐲2{\bf y}_{2}, each based on their own set’s feature vectors 𝐳1{\bf z}_{1}, 𝐳2{\bf z}_{2}, which themselves are probabilistically dependent on the items semantics s1,s2s_{1},s_{2} the likelihood can be approximated as

logp(x|s1,s2)\displaystyle\log p(x|s_{1},s_{2}) =\displaystyle= log𝐳1,𝐳2,𝐲1,𝐲2p(x|𝐲1,𝐲2)p(𝐲1|𝐳1)p(𝐲2|𝐳2)p(𝐳1|s1)p(𝐳2|s2)\displaystyle\log\sum_{{\bf z}_{1},{\bf z}_{2},{\bf y}_{1},{\bf y}_{2}}p(x|{\bf y}_{1},{\bf y}_{2})p({\bf y}_{1}|{\bf z}_{1})p({\bf y}_{2}|{\bf z}_{2})p({\bf z}_{1}|s_{1})p({\bf z}_{2}|s_{2}) (33)
\displaystyle\geq Eq(𝐳1,𝐳2,𝐲1,𝐲2)[logp(x|𝐲1,𝐲2)p(𝐲1|𝐳1)p(𝐲2|𝐳2)p(𝐳1|s1)p(𝐳2|s2)q(𝐳1,𝐳2,𝐲1,𝐲2)]\displaystyle E_{q({\bf z}_{1},{\bf z}_{2},{\bf y}_{1},{\bf y}_{2})}\bigg[\log\frac{p(x|{\bf y}_{1},{\bf y}_{2})p({\bf y}_{1}|{\bf z}_{1})p({\bf y}_{2}|{\bf z}_{2})p({\bf z}_{1}|s_{1})p({\bf z}_{2}|s_{2})}{q({\bf z}_{1},{\bf z}_{2},{\bf y}_{1},{\bf y}_{2})}\bigg]

where q(𝐳1,𝐳2,𝐲1,𝐲2)q({\bf z}_{1},{\bf z}_{2},{\bf y}_{1},{\bf y}_{2}) is an approximate posterior. If we assume a factored posterior q=q(𝐳1)q(𝐳2)q(𝐲1)q(𝐲2)q=q({\bf z}_{1})q({\bf z}_{2})q({\bf y}_{1})q({\bf y}_{2}), and that the posterior over embeddings Dirac q(𝐲1)=δ(𝐲1𝐲^1)q({\bf y}_{1})=\delta({\bf y}_{1}-\hat{{\bf y}}_{1}), q(𝐲2)=δ(𝐲2𝐲^2)q({\bf y}_{2})=\delta({\bf y}_{2}-\hat{{\bf y}}_{2}) then based on (32):

logp(x|s1,s2)12σx2(x𝐲^1T𝐲^2)212log(2πσy2)+Eq[logp(𝐲1,𝐳1|s1)q(𝐳1)]+Eq[logp(𝐲2,𝐳2|s2)q(𝐳2)],\log p(x|s_{1},s_{2})\approx-\frac{1}{2\sigma_{x}^{2}}(x-\hat{{\bf y}}_{1}^{T}\hat{{\bf y}}_{2})^{2}-\frac{1}{2}\log(2\pi\sigma^{2}_{y})+E_{q}\big[\log\frac{p({\bf y}_{1},{\bf z}_{1}|s_{1})}{q({\bf z}_{1})}\big]+E_{q}\big[\frac{\log p({\bf y}_{2},{\bf z}_{2}|s_{2})}{q({\bf z}_{2})}\big],

where the expected embeddings are 𝐮1^=Eq(𝐳1)[𝐮1]\hat{{\bf u}_{1}}=E_{q({\bf z}_{1})}[{\bf u}_{1}], 𝐮2^=Eq(𝐳2)[𝐮2]\hat{{\bf u}_{2}}=E_{q({\bf z}_{2})}[{\bf u}_{2}] If we observe the full association matrix Y, e.g. over n118000n_{1}\approx 18000 movies, indexed by i1i_{1} and n2500000n_{2}\approx 500000 users indexed by i2i_{2}, then the approximate log likelihood of the data, broken into two terms is:

L\displaystyle L =\displaystyle= 12σx2t1,t2(xt1,t2(𝐲^1t1)T𝐲^2t2)2n1n22log(2πσx2)+\displaystyle-\frac{1}{2\sigma_{x}^{2}}\sum_{t_{1},t_{2}}(x_{t_{1},t_{2}}-(\hat{{\bf y}}_{1}^{t_{1}})^{T}\hat{{\bf y}}_{2}^{t_{2}})^{2}-\frac{n_{1}n_{2}}{2}\log(2\pi\sigma^{2}_{x})+ (34)
+\displaystyle+ t1Eq[logp(𝐲1^t1,𝐳1t1|s1t1)q(𝐳1t1)]+i2Eq[logp(𝐲2^t2,𝐳2t2|s2t2)q(𝐳2t2)].\displaystyle\sum_{t_{1}}E_{q}\big[\log\frac{p(\hat{{\bf y}_{1}}^{t_{1}},{\bf z}_{1}^{t_{1}}|s_{1}^{t_{1}})}{q({\bf z}_{1}^{t_{1}})}\big]+\sum_{i_{2}}E_{q}\big[\log\frac{p(\hat{{\bf y}_{2}}^{t_{2}},{\bf z}_{2}^{t_{2}}|s_{2}^{t_{2}})}{q({\bf z}_{2}^{t_{2}})}\big].

This can be seen as solving a regularized matrix decomposition problem as by collecting xt1,t2x_{t_{1},t_{2}} into the observation matrix 𝐗{\bf X} and all items’ embedding vectors into matrices 𝐘^1\hat{{\bf Y}}_{1} and 𝐘^2\hat{{\bf Y}}_{2} the first term (first line) above can be written as

T1=12σy2𝐗𝐘^1T𝐘^2F2n1n22log(2πσx2),T_{1}=-\frac{1}{2\sigma_{y}^{2}}||{\bf X}-\hat{{\bf Y}}_{1}^{T}\hat{{\bf Y}}_{2}||_{F}^{2}-\frac{n_{1}n_{2}}{2}\log(2\pi\sigma^{2}_{x}), (35)

while the second term (line) serves as a regularization term. The regularization term is in fact consisting of two models, one for each type of item, of the form studied in the previous sections. When the observation matrix is very large and also has low intrinsic dimension, as is typically the case in collaborative filtering applications, the first term can be fit very well resulting in low σx2\sigma_{x}^{2}, and making the regularization term adapting to the natural data embedding. Therefore, as our goal for the two p(𝐲,𝐳|s)p({\bf y},{\bf z}|s) components is not to interfere with natural embeddings of the data, but rather to explain the statistical patterns so that they would transfer to new semantic items with new descriptions and previously unseen interaction patterns in testing, we can fit the first term independently of the other two, after which the joint modeling problem separates into two separate ones, e.g. extracting the plant features that explain their embeddings and extracting the geographic areas features to explain their own embeddings.

For a given embedding dimension nen_{e} the term T1T_{1} is minimized when 𝐘^1\hat{{\bf Y}}_{1} and 𝐘^2\hat{{\bf Y}}_{2} are related to the SVD of 𝐗{\bf X}, which reduces the dimensionality to nen_{e} 𝐗𝐔𝐒𝐕T{\bf X}\approx{\bf U}{\bf S}{\bf V}^{T}, where n1×nen_{1}\times n_{e} matrix UU and the n2×nen_{2}\times n_{e} matrix VV are orthonormal and SS is a diagonal matrix. For example, the term is minimized when

𝐘1^=𝐔𝐒α𝐑,𝐘2^=𝐑T𝐒1α𝐕T,\hat{{\bf Y}_{1}}={\bf U}{\bf S}^{\alpha}{\bf R},\quad\hat{{\bf Y}_{2}}={\bf R}^{T}{\bf S}^{1-\alpha}{\bf V}^{T}, (36)

for arbitrary rotation matrix 𝐑{\bf R} and a scalar α[0,1]\alpha\in[0,1].

Therefore, if we solve the matrix decomposition problem in T1T_{1} we can then fit the features for the two sets of items separately as the second term is decomposed:

T2=t1Eq[logp(𝐲1^t1,𝐳1t1|s1t1)q(𝐳1t1)]+i2Eq[logp(𝐲2^t2,𝐳2t2|s2t2)q(𝐳2t2)].T_{2}=\sum_{t_{1}}E_{q}\big[\log\frac{p(\hat{{\bf y}_{1}}^{t_{1}},{\bf z}_{1}^{t_{1}}|s_{1}^{t_{1}})}{q({\bf z}_{1}^{t_{1}})}\big]+\sum_{i_{2}}E_{q}\big[\log\frac{p(\hat{{\bf y}_{2}}^{t_{2}},{\bf z}_{2}^{t_{2}}|s_{2}^{t_{2}})}{q({\bf z}_{2}^{t_{2}})}\big]. (37)

In other words, by computing the embeddings using a standard matrix decomposition technique, we can build models that generate these item embeddings 𝐲{\bf y} using the semantics of the items ss.

In case of movie preference datasets, like the Netflix Prize dataset, and data from many other recommender systems, the second set of items – users – are usually not described semantically (through information like their demographics, their writing, etc.). Privacy reasons are often quoted for this, but it is also true that people’s complex overall behavior, as it relates to the task, is better represented simply by their preferences. I.e. knowing which movies they have watched is a better indicator of whether or not they will like a particular new movie for example, than is some description of their life story, even if it were available. In other words, in our Netflix prize experiments, we only fit the movie embeddings.