GenZ: Foundational models as latent variable generators within traditional statistical models

Marko Jojic
Arizona State University, Tempe, AZ, USA &Nebojsa Jojic
Microsoft Research, Redmond, WA, USA

Abstract

We present GenZ, a hybrid model that bridges foundational models and statistical modeling through interpretable semantic features. While large language models possess broad domain knowledge, they often fail to capture dataset-specific patterns critical for prediction tasks. Our approach addresses this by discovering semantic feature descriptions through an iterative process that contrasts groups of items identified via statistical modeling errors, rather than relying solely on the foundational model’s domain understanding. We formulate this as a generalized EM algorithm that jointly optimizes semantic feature descriptors and statistical model parameters. The method prompts a frozen foundational model to classify items based on discovered features, treating these judgments as noisy observations of latent binary features that predict real-valued targets through learned statistical relationships. We demonstrate the approach on two domains: house price prediction (hedonic regression) and cold-start collaborative filtering for movie recommendations. On house prices, our model achieves 12% median relative error using discovered semantic features from multimodal listing data, substantially outperforming a GPT-5 baseline (38% error) that relies on the LLM’s general domain knowledge. For Netflix movie embeddings, our model predicts collaborative filtering representations with 0.59 cosine similarity purely from semantic descriptions—matching the performance that would require approximately 4000 user ratings through traditional collaborative filtering. The discovered features reveal dataset-specific patterns (e.g., architectural details predicting local housing markets, franchise membership predicting user preferences) that diverge from the model’s domain knowledge alone.

1 Introduction

Given an item’s semantics, such as a text description, image, or audio recording, large foundational models can answer specific questions about the content, and thus convert the item’s semantic representation into a canonical discrete representation. Such a representation is useful for traditional statistical models—such as graphical generative models, belief networks, or probabilistic circuits—which can capture statistics from external, possibly proprietary, datasets related to these items.

Movies are an example of such items. We can use an LLM to featurize them as binary vectors based on answers to questions as illustrated in Fig. 1. These binary vectors can help explain the patterns found in the movie-user ratings matrix. Collaborative filtering techniques reduce the observation matrix to embeddings, which capture usage statistics absent from the foundational models’ training data. The foundational models, however, can reason about the semantic content—such as the movie synopsis, critical response, credits, and other metadata—either from their training or by retrieving information from a database as part of a RAG system.

We describe how appropriate prompts can be learned from such item statistics by jointly modeling the semantic features and the statistical patterns in the data.

We study a general hybrid statistical-foundational model defined as follows. Let $s$ be the semantic representation of an item, e.g., text with a movie title or its description. Given a collection of semantic feature descriptors $\theta_{f}$ , such as ’historical war film’, a foundational model, which we refer to as function $h$ , makes a judgment whether the feature applies to the item (Figure 1). Let ${\bf z}$ be the true binary feature vector that allows for possible errors in either the model’s judgment or the wrongly worded feature descriptors. We model uncertainty by defining a conditional distribution $p({\bf z}|s,\theta_{f},\theta_{e})$ , where ${\bf z}=[z_{1},z_{2},...,z_{n_{f}}]^{T}$ is a discrete feature vector (binary, in the example in the figure and in our experiments), $\theta_{f}$ is a collection of semantic feature descriptors, such as ’historical war film’, and $\theta_{e}=[p^{e}_{1},p^{e}_{2},...,p^{e}_{n_{f}}]$ is a set of estimated probabilities of feature classification errors. A feature vector is then linked to observation ${\bf y}$ through a classical statistical model $p({\bf y}|{\bf z},\theta_{y})$ , possibly involving latent variables or some other way of modeling structure, and which has its own parameters $\theta_{y}$ . By jointly optimizing for the parameters of the model $p({\bf y}|s)=\sum_{{\bf z}}p({\bf y}|{\bf z})p({\bf z}|s)$ based on a large number of pairs $({\bf y},s)$ we can discover the feature descriptions, or the right questions to ask, in order to explain the statistics in the dataset. More generally, hybrid models could have more than one category of items with a (learnable) statistical model of their relationships, for example geographic regions and native plants.

S1.F1 — Figure 1:Feature classification ℎ ( 𝑠 , 𝜃 𝑓 ) with extraction prompt strings shown in Python syntax. Items 𝑠 in the list are featurized using feature descriptions in 𝜃 𝑓 . The first prompt is used within a double loop ( 𝑡 , 𝑖 ) over items and features. The second is looped only over 𝑖 (and batches of items {s}) as each call retrieves the indices of items for which 𝑧 𝑖 = 1 (with 𝑧 𝑖 = 0 for all other items).

The idea of using interpretable intermediate representations to bridge inputs and predictions has a long history in machine learning. Concept bottleneck models (CBM) koh2020concept; kim2023probabilistic formalized this approach by training neural networks to predict pre-defined human-interpretable concepts from inputs, then using those concepts to make final predictions. Recent work has leveraged large language models (LLMs) to overcome the need for manually labeled concepts, e.g., benara2024craftinginterpretableembeddingsasking; feng2025bayesianconceptbottleneckmodels; chattopadhyay2023variationalinformationpursuitinterpretable; zhong2025explainingdatasetswordsstatistical; ludan2024interpretablebydesigntextunderstandingiteratively. These methods may or may not use the CBM formulation but share with it the goal of feature interpretability. In addition, they have a goal of feature discovery, which is also one of the goals here. They address the difficulty of modeling and jointly adjusting the semantic and statistical parts of the model—in our notation $p({\bf z}|s)$ and $p({\bf y}|{\bf z})$ —in different ways.

In most cases, the target ${\bf y}$ is a discrete variable (the goal is a classification through intermediate concepts), and thus the classification model $p({\bf y}|{\bf z})$ is a simple logistic regressor or an exhaustive conditional table chattopadhyay2023variationalinformationpursuitinterpretable; feng2025bayesianconceptbottleneckmodels; zhong2025explainingdatasetswordsstatistical. The optimization of the featurizer $p({\bf z}|s)$ primarily relies on the LLM’s domain understanding (such as a task description), but some methods also provide examples of misclassified items, e.g., zhong2025explainingdatasetswordsstatistical; ludan2024interpretablebydesigntextunderstandingiteratively.

When real-valued multi-dimensional targets ${\bf y}$ are modeled, showing items with large prediction errors is much less informative because each item may have a different error vector, and thus item-driven discovery is usually abandoned. For example, in benara2024craftinginterpretableembeddingsasking multidimensional targets (fMRI recordings) are modeled, and the method relies on direct LLM prompting about potentially predictive features based on domain knowledge alone, followed by feature selection, rather than showing specific items to contrast.

A fundamental limitation shared by these approaches is their reliance on the LLM’s domain understanding to propose discriminative features. This works well when the LLM has been trained on relevant data and the task aligns with common patterns in its training distribution. However, as our experiments demonstrate, domain understanding alone is insufficient when dataset-specific patterns diverge from general knowledge—such as house pricing dynamics in particular markets, or user preferences captured in collaborative filtering from specific time periods or demographics. Moreover, for high-dimensional real-valued targets ${\bf y}$ , individual prediction errors have multidimensional structure that cannot be easily communicated to an LLM by simply showing "incorrect" examples.

Our approach differs in several key aspects. First, as in some of the recent work, we avoid concept supervision by using a frozen foundational model as an oracle within $p({\bf z}|s)$ , but with trainable uncertainty parameters $\theta_{e}$ . Second, our feature mining is driven by contrasting groups of items identified through the statistical model’s posterior $q({\bf z})$ rather than by querying the LLM’s domain knowledge or showing individual misclassified examples (Figure 2). This group-based contrast allows the method to discover dataset-specific patterns that may not align with the LLM’s training distribution—or even in cases where the LLM has no understanding of what the target ${\bf y}$ represents. (We do, however, require the (M)LLM to reason about the semantic items $s$ , either based on knowledge encoded in its weights or by retrieving information from a database as in RAG systems). Third, we support arbitrary mappings $p({\bf y}|{\bf z})$ including nonlinear functions of binary features for high-dimensional real-valued observations, shifting the burden of modeling complex feature interactions from the LLM to the statistical model. The LLM is only responsible for explaining the semantic coherence of binary splits discovered based on the model’s prediction errors, making it applicable to domains where the relationship between semantics and observations is not well understood in advance.

We refer to our hybrid model as GenZ (or GenAIz?) as it relies on a pre-trained generative AI model to generate candidate latents ${\bf z}$ in a prediction model $s\rightarrow{\bf z}\rightarrow{\bf y}$ . We derive a generalized EM algorithm for training it, which naturally requires investigation of items $s$ when tuning parameters $\theta_{f}$ of the featurizer $p({\bf z}|s)$ .

S1.F2 — Figure 2:Discovering feature descriptions for prompts in Fig. 1. When the statistical model identifies a binary split in the data, we can prompt a foundational model to generate an explanation by providing example items from each group. This explanation can then be used as a feature descriptor in 𝜃 𝑓 . The prompt need not include all members of each group, only representative examples.

2 Semantic model

Before developing the full algorithm, we first describe the semantic part: the semantic parameters $\theta_{f}$ , how $p({\bf z}|s)$ depends on them and on a foundational model, and how statistical signals from the model enable updating these parameters in an iterative EM-like algorithm.

2.1 Extracting semantic features: Modeling $p({\bf z}|s)$

The foundational model’s reasoning abilities are sufficient to featurize a semantic item using standard classification prompts. A generic version of this is illustrated in Fig. 1. In the example shown, the items are represented with text, in this case movie titles that large language models are likely familiar with. Each string in the list is an item description $s$ —however, the text describing each item could be expanded to include more details about the movie, or a RAG system with access to a movie database could be used in place of an LLM. We use the output $h(s,\theta_{f})$ of such systems cautiously: we associate an uncertainty level, i.e., a probability $p^{e}_{j}$ of mislabeling $h\neq z$ for entry $z_{j}$ in ${\bf z}$ . Assuming that ${\bf z}$ is not deterministically linked to $s$ makes it a latent variable in the $p({\bf y},{\bf z}|s)$ model, allowing flexibility in inference and enabling updates to the semantic feature descriptions in $\theta_{f}$ : If the posterior over ${\bf y}$ can override the foundational model’s determination of whether an item satisfies the feature, this provides a signal for updating the feature semantics.

2.2 Mining semantic features

As already alluded to, the feature semantics $\theta_{f}$ are not hand-crafted but learned from the dataset statistics. The statistical signal from the data indicates how the items should be separated along a particular feature $j$ , i.e., how the $z_{j}$ for different items can be altered so that the model explains the data better. In our setup, we deal with binary features, so the signal separates the items into two groups, and we seek a semantic explanation for this separation, $\theta_{f}$ .

Specifically, for fixed model parameters, focusing on a particular binary feature $z_{i}\in\{0,1\}$ for one semantic item $s$ , the prior and posterior distributions are generally different:

p(z_{i}|s)\neq p(z_{i}|{\bf y},s)=\sum_{\{z_{j}\}_{j\neq i}}p(z_{i}|{\bf y},\{z_{j}\}_{j\neq i},s)

Thus, sampling the posterior distribution (referred to below as $q$ ) for several different items $s^{t}$ yields new feature assignments $z_{i}^{t}$ . These help update the feature descriptor ${\theta_{f}}_{i}$ used in $p(z_{i}|s,\theta_{f})$ by using the prompt in Fig. 2. There, the positive group has a few items $s^{t}$ for which $z_{i}^{t}=1$ and the negative group has a few items for which $z_{i}^{t}=0$ . Furthermore, we can select group exemplars to contain both semantic items where the posterior and the prior agree, and where they disagree. This encourages small refinements, e.g., from "fearless female characters in an action movie" to "strong female leads".

3 Statistical model

Given an item’s semantics $s$ , e.g., a movie description, and its corresponding observed vector ${\bf y}$ , e.g., the movie’s embedding derived from the statistics in the movie-user matrix, the log likelihood can be bounded as

	$\displaystyle\log p({\bf y}\|s)\geq B$	$\displaystyle=$	$\displaystyle\sum_{\bf z}q({\bf z})\log\frac{p({\bf y}\|{\bf z})p({\bf z}\|s)}{q({\bf z})}$		(1)
		$\displaystyle=$	$\displaystyle E_{q}[\log p({\bf y}\|{\bf z})]+E_{q}[\log p({\bf z}\|s)]-E_{q}[\log q({\bf z})],$		(1)

where $E_{q}$ refers to an expectation under the distribution $q$ . Depending on the variational approximation of $q$ and the model $p({\bf y}|{\bf z})$ these expectations can be computed exactly, as illustrated by the example of the mean-field $q$ and linear $p({\bf y}|{\bf z})$ in Section B. Otherwise, a sampling method can be used, often with very few samples due to the iterative nature of the learning algorithm. In experiments (Section 5), we simply evaluate at the mode of $q$ .

The bound is tight when the variational distribution $q$ is equal to the true posterior $p({\bf z}|{\bf y},s)$ , otherwise $q({\bf z})$ serves as an approximate posterior. We refer to the three additive terms as the observation part $T_{1}=E_{q}[\log p({\bf y}|{\bf z})]$ , the feature part $T_{2}=E_{q}[\log p({\bf z}|s)]$ , and the posterior entropy $T_{3}=-E_{q}[\log q({\bf z})]$ .

Given a dataset of pairs $\{s^{t},{\bf y}^{t}\}_{t=1}^{T}$ , the optimization criterion—the bound on the log likelihood of the entire dataset—is

L(\{q^{t}\},\theta_{f},\theta_{e},\theta_{y})=\sum_{t}B_{t}=\sum_{t}E_{q^{t}}[\log p({\bf y}^{t}|{\bf z})]+E_{q^{t}}[\log p({\bf z}|s)]-E_{q^{t}}[\log q^{t}({\bf z})]

(2)

In general, maximizing the criterion $L$ w.r.t. all variables except semantic model parameters (feature descriptions) $\theta_{f}$ is straightforward: we can use existing packages or derive an EM algorithm specific to the choice of the model’s statistical distributions. The variational distributions $q^{t}({\bf z})$ depend only on the individual bounds for each data pair $({\bf y}^{t},s^{t})$ in (1) and can be fitted independently for given model parameters $\theta_{f},\theta_{e},\theta_{y}$ .

The (approximate) posteriors $q_{t}(z_{i})$ provide the statistical signal discussed in Section 2 for tuning the semantic feature descriptions ${\theta_{f}}_{i}$ : sampling them separates the samples $s^{t}$ into two groups, and exemplars from each are used to prompt the foundational model with the prompt in Fig. 2 for a new definition of the $i$ -th feature. The optimization of $L$ then consists of steps that improve it w.r.t. the statistical model parameters and the (approximate) posteriors $q$ , alternating with updates of the feature descriptions using those posteriors.

This general recipe can be used in a variety of ways, with different choices for the structure of the conditionals in the model $p$ and different approximations of the posterior $q$ leading to a variety of algorithms including variational learning, belief propagation and sampling frey-jojic-pami-tutorial. We now develop specific modeling choices and a learning algorithm we use in our experiments.

Binary feature model

As discussed, the feature model uses a prompt-based feature classification, where each feature is treated independently based on its description ${\theta_{f}}_{i}$ (Fig. 1)

p({\bf z}|s,\theta_{f},\theta_{e})=\prod_{i}p(z_{i}|s,{\theta_{f}}_{i},p^{e}_{i}),

(3)

with the uncertainty (probability of error) $p^{e}_{i}$ , so that

p(z_{i}|s)=(1-p^{e}_{i})^{[h(s,{\theta_{f}}_{i})=z_{i}]}(p^{e}_{i})^{[h(s,{\theta_{f}}_{i})\neq z_{i}]}

(4)

where $h$ is the binary function that uses one of the prompts in Fig. 1 to detect the feature ${\theta_{f}}_{i}$ in the semantic item $s$ . The expression above uses the indicator function $[]$ to express that with the probability of error $p_{e}^{i}$ , the feature indicator $z_{i}$ is different from $h(s,{\theta_{f}}_{i})$ . Denoting $h_{i}=h(s,{\theta_{f}}_{i})$ , we write

	$\displaystyle\log p(z_{i}\|s)=[z_{i}=1]\ell^{1}_{i}+[z_{i}=0]\ell^{0}_{i},$		(5)
	$\displaystyle\qquad\ell^{1}_{i}=[h_{i}=1]\log(1-p^{e}_{i})+[h_{i}=0]\log p^{e}_{i},$
	$\displaystyle\quad\ell^{0}_{i}=[h_{i}=1]\log p^{e}_{i}+[h_{i}=0]\log(1-p^{e}_{i})$

Using a factorized posterior $q=\prod_{i}q(z_{i})$ , the feature term $T_{2}$ simplifies to

\displaystyle T_{2}=E_{q}[\log p({\bf z}|s)]=\sum_{i}\sum_{z_{i}}q(z_{i})\log p(z_{i}|s)=\sum_{i}q_{i}\ell^{1}_{i}+(1-q_{i})\ell^{0}_{i},

(6)

with $\ell^{1}_{i}$ and $\ell^{0}_{i}$ defined in (5). Finally, the entropy term becomes

T_{3}=-E_{q}[\log q({\bf z})]=-\sum_{i}q_{i}\log q_{i}+(1-q_{i})\log(1-q_{i}),\quad q_{i}=q(z_{i}=1)

(7)

General observation model

We can soften an arbitrary mapping function ${\bf y}=f({\bf z};\theta_{m})$ by defining a distribution:

	$\displaystyle p({\bf y}\|{\bf z},\theta_{y}=\{\theta_{m},{\bf\sigma}^{2}_{y}\})$	$\displaystyle=$	$\displaystyle{\cal N}({\bf y}\|f({\bf z},\theta_{m}),\Sigma=diag({\bf\sigma}^{2}_{y})),$		(8)
	$\displaystyle T_{1}=E_{q}[\log p({\bf y}\|{\bf z})]$	$\displaystyle=$	$\displaystyle E_{q}\big[-\frac{1}{2}({\bf y}-f({\bf z}))^{T}\Sigma^{-1}(y-f({\bf z}))\big]-\frac{1}{2}\log 2\pi\|\Sigma\|,$		(9)

with $\theta_{m}$ being the mapping parameters of function $f$ , e.g., regression weights in ${\bf y}={\bf w}^{T}{\bf z}$ , and ${\bf\sigma}^{2}_{y}$ being a vector of variances, one for each dimension of ${\bf y}$ . Different specialized inference algorithms for specific forms of function $f$ can be derived. For example, see the appendix for a procedure derived for a linear model (factor analysis). However, in our experiments, we use a generic inference technique using $f$ in plug-and-play fashion, described below.

Inference of $q(z_{i})$ in the fully factorized (mean field) $q$ model

Keeping all but one $q(z_{i})$ fixed, we derive the update for the $i$ -th feature’s posterior by setting the derivative of the bound $B$ wrt parameters $q(z_{i}=1)$ and $q(z_{i}=0)$ , subject to $q(z_{i}=1)+q(z_{i}=0)=1$ . We can express the result as:

	$\displaystyle q(z_{i}=1)$	$\displaystyle\propto$	$\displaystyle e^{E_{q_{+i}}[\log p({\bf y}\|{\bf z})]+E_{q_{+i}}[\log p({\bf z}\|s)]-E_{q_{+i}}[\log q({\bf z})]},$
	$\displaystyle q(z_{i}=0)$	$\displaystyle\propto$	$\displaystyle e^{E_{q_{-i}}[\log p({\bf y}\|{\bf z})]+E_{q_{-i}}[\log p({\bf z}\|s)]-E_{q_{-i}}[\log q({\bf z})]}$		(10)

where $q_{+i}=[z_{i}=1]\prod_{j\neq i}q(z_{j})$ , and $q_{+i}=[z_{i}=0]\prod_{j\neq i}q(z_{j})$ , i.e the "+i" and "-i" posteriors have their $z_{i}$ value clamped to 1 and 0 respectively in the factorized $q=\prod q(z_{i})$ . Upon normalization, as the entropy terms $E_{{\bf q}^{+i}}[\log q({\bf z})]$ and $E_{{\bf q}^{-i}}[\log q({\bf z})]$ cancel each other, we obtain:

q_{i}=q(z_{i}=1)=1-q(z_{i}=0)=\frac{1}{1+e^{E_{{\bf q}^{-i}}[\log p({\bf y}|{\bf z})]-E_{{\bf q}^{+i}}[\log p({\bf y}|{\bf z})]+\ell^{0}_{i}-\ell^{1}_{i}}}

(11)

Interpretation: Whether or not $z_{i}$ should take value 1 or 0 for particular semantic item $s$ depends on:

•

the assignment $h$ that the foundational model assigns it and the level of trust we have in that assignment (the probability of error $p^{e}_{i}$ ). The two are captured in prior log likelihoods $\ell_{1},\ell_{0}$ .
•

error vector ${\bf y}-f({\bf z})$ , scaled by the currently estimated variances ${\bf\sigma}^{2}_{y}$ for different elements of the target ${\bf y}$ , for two possible assignments $z_{i}=1$ and $z_{i}=0$ , with other assignments $z_{j}$ following the current distributions over other features $q(z_{j})$

While expectations $E_{q_{+i}}$ and $E_{q_{+i}}$ can potentially be better estimated through sampling, in our experiments, we simply compute $\log p({\bf y}|{\bf z})$ at the mode (assuming most likely values for other $z_{j}$ variables).

Optimizing numerical parameters of $p({\bf y}|{\bf z})$ and $p({\bf z}|s)$

Optimizing the bound $B$ wrt to parameters $\theta_{m}$ of the mapping in ${\bf y}\approx f({\bf z};\theta_{m})$ and the remaining uncertainty/variance ${\bf\sigma}^{2}_{y}$ reduces to maximizing sum of $T_{1}$ terms over the traing set, i.e. $\sum_{t}E_{q}^{t}[\log p({\bf y}^{t}|{\bf z}^{t})]$ . Again, while sampling of $q$ might be helpful, in our experiments we simply use the mode and iterate:

	$\displaystyle\theta_{m}$	$\displaystyle=$	$\displaystyle\arg\min\sum_{t}({\bf y}^{t}-f({\bf z}^{t};\theta_{m}))^{T}\Sigma^{-1}({\bf y}^{t}-f({\bf z}^{t};\theta_{m}))$		(12)
	$\displaystyle{\bf\sigma}^{2}_{y}$	$\displaystyle=$	$\displaystyle\frac{1}{T}\sum_{t}({\bf y}^{t}-f({\bf z}^{t};\theta_{m}))\circ({\bf y}^{t}-f({\bf z}^{t};\theta_{m}))$		(13)

where ${\bf z}^{t}$ is the binary vector of assignments of features $z_{i}$ , either a sample from $q_{t}$ or at the mode of the posterior $q_{t}$ for the $t$ -th out of $T$ entries in the data set of pairs $s_{t},{\bf y}_{t}$ . $\circ$ is a point-wise multiplication.

The semantic model has numeric parameters $p^{e}_{i}$ , modeling the hybrid model’s uncertainty in semantic function $h$ that makes an API call to the foundational model. By optimizing the bound wrt these parameters we obtain the update:

p^{e}_{i}=\frac{1}{T}\sum_{t=1}^{T}q_{i}^{t}[h_{i}^{t}=0]+(1-q_{i}^{t})[h_{i}^{t}=1]

(14)

Thus, for a given set of semantic feature descriptors $\theta_{f}$ , iterating equations (11) for each item $t$ , and (12, 14) using all items in the training set, would fit the numerical model parameters and create the soft feature assignments $q^{t}$ that best balance those feature descriptors with the target real and possible multi-dimensional targets ${\bf y}^{t}$ .

4 Algorithms

In section 2 we pointed out that individual $q(z_{i})$ distributions carry information about how the data should be split into two groups to create the mining prompt in Fig. 2. But as each of these distributions depend on the others, there is the question of good initialization. In our experiment we use a method akin to the variational EM view of boosting neal1998boosting, albeit in our case applied to learning to predict real-valued multi-dimensional targets, rather than just for classification.

Note that if we iterate (11, 12, 14) while keeping $p^{e}_{i}=0.5$ for a select feature $i$ the entire hybrid model and the posterior $q(z_{i})$ are decoupled from the foundational model’s prediction $h_{i}$ which depends on feature description ${\theta_{f}}_{i}$ to provide its guess at $z_{i}$ . Therefore, the iterative process is allowed to come up with the assignment of $z^{t}_{i}$ for each item $t$ in such a way that maximizes the likelihood (and minimizes the prediction error). This provides a natural binary split of the data into two groups based on error vectors. In case of predicting house prices (one dimensional $y$ , those two groups would be the lower and higher priced houses relative to current prediction, but in case of multi-dimensional observations, such as movie embeddings, the split would be into two clusters, each with their own multi-dimensional adjustment to the current prediction.

A simple algorithm for adding features is shown in Algorithm 1. The key idea is to fit a model with only features $j\in[1..i]$ while keeping the new feature $i$ decoupled from the LLM prediction by fixing $p^{e}_{i}=0.5$ . This allows the data to determine the optimal split before discovering its semantic explanation.

The algorithm operates on the model $p(\{{\bf y}^{t},\{z_{j}^{t}\}_{j=1}^{i},s^{t}\}_{t=1}^{T})$ , where features $j>i$ do not participate. By keeping $p^{e}_{i}=0.5$ fixed during the inner loop, the posterior $q_{i}^{t}$ is determined purely by how well $z_{i}^{t}$ helps explain ${\bf y}^{t}$ , independent of any LLM prediction. This provides the statistical signal needed to discover meaningful features through the mining prompt.

As we show in experiments on the simple task of discovering binary representation of numbers, iteratively running Algorithm 1 with increasing $i$ adds new features in coarse-to-fine manner, from those that split the data based on large differences in target ${\bf y}$ to those that contribute to explaining increasingly smaller differences.

Conditional (targeted) feature addition

A possible problem with such coarse-to-fine approach, especially for nonlinear and easy to overfit functions $f({\bf z})$ , is that items with different combinations of features can be arbitrarily split with the feature $z_{i}$ we are updating. For example, suppose we are working on the house pricing dataset and are running the Algorithm 1 for the second time, having added only one feature so far: $\theta_{f_{1}}=$ ’Located in Arizona desert communities’. Then as we iteratively look for a natural split $q_{2}$ it is possible that for items with high $q_{1}$ , the procedure may split the data along a different semantic axis (e.g., presence of swimming pools) than for the items with low $q_{1}$ . To deal with this, we alter Algorithm 1 very slightly: When we sample positive and negative examples, we take them from items that have other features identical. This is akin to using the structured variational approximation $q=q(z_{i}|\{z_{j}\}_{j\neq i})\prod_{j\neq i}q(z_{j})$ , but instead of fitting this for all combinations of other features, we target only one combination at a time. The discovered feature $\theta_{f_{i}}$ is still used to assign values $h_{i}$ for all data samples, whether they were in the targeted group or not, so that it can be used in the next step.

To select the combination of features $\{z_{j}\}_{j=1}^{i-1}$ on which we condition the $i$ -th feature discovery, we start with the cost being optimized in (12), from the observation part ( $T_{1}$ ) of the approximate log likelihood based on a sample (or the mode) ${\bf z}^{t}$ for each item:

\sum_{t}\big({\bf y}^{t}-f({\bf z}^{t})\big)^{T}\Sigma^{-1}\big({\bf y}^{t}-f({\bf z}^{t})\big),

(15)

or equivalently, by partitioning over all observed combinations ${\bf c}$ of the other features:

\sum_{{\bf c}}\sum_{t:{\bf z}_{-i}^{t}={\bf c}}\big({\bf y}^{t}-f({\bf z}^{t})\big)^{T}\Sigma^{-1}\big({\bf y}^{t}-f({\bf z}^{t})\big)=\sum_{{\bf c}}T_{1}^{{\bf c}},

(16)

where $T_{1}^{{\bf c}}$ are the summands that show the total error due to each observed combination ${\bf c}$ of the other features ${\bf z}_{-i}=[z_{1},\ldots,z_{i-1}]^{T}$ . We select the combination ${\bf c}^{*}$ with the largest total error ${\bf c}^{*}=\arg\max_{{\bf c}}T_{1}^{{\bf c}}$ . Algorithm 2 shows the conditional (targeted) feature addition procedure. The key difference from Algorithm 1 is that positive and negative examples are sampled from a targeted subgroup $G^{*}$ (items sharing the same values for features $1$ through $i-1$ ), selected by identifying which combination has the largest total error. The discovered feature $\theta_{f_{i}}$ is then applied to all items.

These algorithms add a feature but can also be used to update a feature by removing it from the set and then adding a new one. While we can select a feature to update at random, we use a more efficient approach: we choose the feature whose removal affects the prediction (likelihood bound) the least, as shown in Algorithm 3. A single feature update becomes a removal of one feature (Algorithm 3) followed by adding a feature (Algorithm 1). Generally, we alternate an arbitrary number of feature adding steps with feature removal steps, as shown in Algorithm 4.

alg1 — Algorithm 1 Basic feature discovery (non-targeted)

alg2 — Algorithm 2 Conditional (Targeted) Feature Addition

alg3 — Algorithm 3 Prune Feature (remove least useful feature)

alg4 — Algorithm 4 Expand-Contract (iterative feature refinement)

5 Experiments

Especially exciting applications of hybrid models discussed above involve situations in which the foundational model is capable of recognizing patterns in the data, i.e., features that some items share and others do not, but the high-dimensional target ${\bf y}$ is not understood well by the model, e.g., in scientific discovery based on new experiments. However, there is value in analyzing situations where we (humans) understand the links so that we can better understand what the model is discovering. This is especially interesting when the data involves well understood semantic items, but also novel (or dated!), or otherwise skewed statistics between these items and the targets ${\bf y}$ which we want to discover, despite a potential domain shift w.r.t. what the foundational model was trained on.

We analyzed three datasets we expected to be illustrative:

•

Learning binary: A toy dataset with items $s$ being strings $"0","1",\ldots,"511"$ , and the corresponding targets $y=\text{str2num}(s)$ . We demonstrate that iterating Algorithm 1 recovers a 9-bit binary representation.
•

Hedonic regression on house prices: A multimodal dataset from ahmed2016house, consisting of four listing images and metadata, which we turned into semantic items $s$ describing all information about a house and the target log price $y$ (Fig. 3).
•

Cold start in recommendation systems: The Netflix prize dataset. Singular value decomposition of the user-movie matrix is used to make 32-dimensional embedding ${\bf y}$ for each movie, and the associated semantic item is the movie title and the year of release.

Since LLMs can understand similarities between movies or houses and prices, there was a chance that the models could discover relevant features without our iterative algorithms. In fact, as discussed in Section 1, most prior work on using off-the-shelf LLMs to discover features useful in classification and, especially, regression focused on probing LLMs with the description of the task, rather than the data itself, and subsequent feature selection. Our baseline that represents such approaches, which we refer to as 0-shot, is to provide an LLM with an example list of semantic items, describe the task, and ask for around 50 yes/no characteristics that in the LLM’s judgment would be most useful for the task.

Except for the first experiment, whose purpose is to illustrate coarse-to-fine nature of feature discovery with a linear model, we tested two $f({\bf z})$ functions (softened in $p({\bf y}|{\bf z})$ through learned variances $\sigma^{2}_{y}$ ): a linear model and a neural network with two hidden layers (64 and 32 units with ReLU activations and dropout rate 0.1), trained using Adam optimizer with learning rate 0.001 for 100 epochs and batch size 32.

For the two more involved datasets (houses and movies), we used conditional feature adding Algorithm 2 with 10-20 cycles of adding 5 features and removing 2. During the process, when a discovered feature is active for less than 2% or more than 98% of the data, it is discarded, resulting in the feature count generally growing until it stabilizes towards the end of the runs.

Mining prompts are executed by GPT 5, while the extraction prompts are executed with GPT 4.

5.1 Learning, or rather, recalling binary

The primary purpose of this experiment was to serve as an explanation of what iteration of the Algorithm 1 tends to do. The semantic items $s\in\{"0","1",\ldots,"511"\}$ , have the corresponding targets $y=\text{str2num}(s)$ . We map characteristics ${\bf z}$ to predictions $y$ with a linear function $f({\bf z})={\bf w}^{T}{\bf z}+b$ .

When we add the first feature by running Algorithm 1, $q_{i}$ splits the data into high and low values (above or below the mean). When positive and negative groups are incorporated into the mining prompt (Figure 1) and given to GPT 5, it easily detects the pattern: Depending on the random initialization of ${\bf w}$ , the positive group’s characteristic is of two sorts: either an equivalent of "numbers 0-255 inclusive", or an equivalent of "numbers greater than 255".

The first feature splits the data into high and low values, and for all items the model will now predict the mean of one of these two subsets: 127.5 for numbers less than 256, and 383.5 for the rest. The next execution of Algorithm 1 will now assign $z_{2}=1$ and $z_{2}=0$ to the items based on where the model over- or underestimates $y$ . (Due to randomness in initialization, $z_{2}=1$ could focus on either all of the overshoots or all of the undershoots). So, the positive group will consist of either the upper halves of the $z_{1}$ groups or the lower halves. For example, the positive group would include all items in both range $[0..127]$ and range $[256..383]$ . Given these as the positive group, and the rest as negative in the mining prompt, GPT 5 sets the next feature to be the equivalent of "Integers with the 2ˆ7 bit unset." As before, due to random initialization, the positive and negative group could be switched, but that would still yield an equivalent feature, as the model simply adjusts weights.

Here is an example of a set of features $\theta_{f}$ obtained by running Algorithm 1 9 times with $i$ increasing from 1 to 9:

•

Integers greater than or equal to 256 (i.e., with the 2ˆ8 bit set)
•

Integers with the 2ˆ7 bit set (i.e., numbers whose value modulo 256 is between 128 and 255)
•

Integers with the 2ˆ6 (64) bit unset (i.e., numbers whose value modulo 128 is 0–63)
•

Integers with the 2ˆ5 (32) bit set (i.e., numbers whose value modulo 64 is between 32 and 63)
•

Integers with the 2ˆ4 (16) bit unset (i.e., numbers whose value modulo 32 is 0–15)
•

Integers with the 2ˆ3 (8) bit set (i.e., value modulo 16 is between 8 and 15)
•

Integers with the 2ˆ2 (4) bit unset (i.e., numbers whose value modulo 8 is 0–3)
•

Integers with the 2ˆ1 (2) bit set (i.e., numbers congruent to 2 or 3 modulo 4)
•

Integers with the 2ˆ0 (1) bit unset (even numbers)

And here is another one:

•

Integers from 256 to 511 (inclusive)
•

Integers whose value modulo 256 is in the range 128–255 (i.e., their low byte has its highest bit set)
•

Integers whose value modulo 128 is in the range 0–63 (inclusive)
•

Integers whose value modulo 64 is in the range 0–31 (the lower half of each 64-block)
•

Integers whose value modulo 32 is in the range 16–31 (the upper half of each 32-number block)
•

Integers whose value modulo 16 is in the range 0–7 (the lower half of each 16-number block)
•

Integers whose value modulo 8 is in the range 0–3 (the lower half of each 8-number block)
•

Integers congruent to 2 or 3 modulo 4 (i.e., with the second least significant bit set)
•

Even integers (including 0)

Both reconstructions are perfect. We see the same for other ground truth function choices of $y$ that are linearly related to the value of the number in the string $s$ , $y=a_{1}\cdot\text{num2str}(s)+a_{2}$ . The feature discovery is driven by the structure of the algorithm that breaks the data in a coarse-to-fine manner through binary splits, perfectly matching the binary number representation. All that a foundational model has to do is to recognize this in the provided positive/negative splits. In many applications, such coarse-to-fine binary splitting of the data can be problematic, as features may not interact additively. In the next two experiments in more realistic scenarios, we use Algorithm 2, which chooses the groups conditional on the assignment of the already discovered features.

5.2 Hedonic regression of house prices

The multi-modal dataset ahmed2016house consists of information for 535 houses listed in 2016, the first of which is shown in Figure 3. While our approach is directly applicable to multi-modal data, to save tokens and avoid limitations on the number of images we can submit in our mining prompt, we first transformed the information to pure text, by running all images through GPT 5 with the prompt "Describe the listing image in detail.". Each semantic item is a JSON file with metadata (without the price) and image description, and the target $y$ is the log of the price. (Log price modeling is standard in hedonic regression, as some features multiplicatively affect prices. For example, an upgraded interior is costlier for a larger home, or for a home in an expensive ZIP code.)

S5.F3 — Figure 3:House listing images (left) and corresponding MLLM-generated semantic item (right). Strikethrough price indicates that price is withheld from items 𝑠 in training our model. However, in the alternative to our approach, all items including prices were included in baseline prompting, where GPT 5 was given the entire dataset and asked to generate features predictive of the price.

Figure 4: Hedonic regression results on log of house prices. Both plots show the median absolute error (MAE) (left axis) and number of features (right axis) over iterations. The horizontal dashed line shows GPT-5 0-shot baseline (MAE=0.38). The video showing which features are added (blue) or removed in (b) is available at: https://github.com/mjojic/genZ/tree/main/media.

As discussed in Section 5, we fit $y$ using both linear and nonlinear (neural net) forms of the function $f$ . In both cases, we supply $f$ with additional real-valued inputs $x_{1}=1,x_{2}=\text{area},x_{3}=\text{bedrooms},x_{4}=\text{bathrooms}$ , as we do not wish to rediscover these numerical features as we did in the toy experiment in the previous section. However, in the semantic items $s$ , full metadata (the number of bedrooms and bathrooms, the area in square feet, but not the price) is still provided, as it can participate in features ${\bf z}$ to account for both nonlinear transformations of price and combined feature effects. For example, in some locales, luxury homes have more bathrooms than bedrooms and that signal can be captured in the discrete $z$ as an indicator of higher priced homes.

We split the data into the training set (80% of the data) and the test set. We show the learning curves in Figure 4 for the linear model $f$ and a neural network $f$ . The video showing which features are added (blue) or removed (red) in Figure 4 (b) is available at: https://github.com/mjojic/genZ/tree/main/media. We plot the progress in terms of median absolute error, as this corresponds to how the performance of hedonic regression of house prices tends to be evaluated in industry. For example, Zillow zillow reported that its Zestimate for off-market homes has a median error rate of 7.01%. (Zestimate for on-market homes is much better because it benefits from the known listing price set by an agent; In our experiments we are trying to predict the price without such professional appraisal). As we model log prices, small median absolute errors (MAE) are close to the relative (percentage) error of the actual (not log) price. For example, achieving MAE of 0.10-0.11, as our model does, is equivalent to a median error rate of around 12%.

The 0-shot baseline is indicated by the horizontal line. The baseline is computed as follows. First, we provided the entire set of semantic items, including the prices to GPT 5 and asked for 50 yes/no questions whose answers would be helpful in predicting the price (GPT gave us 52 instead). Then, these are set to be features $\theta_{f}$ , and the pruning Algorithm 3 is applied repeatedly on the training set using linear $f({\bf z},{\bf x})$ , keeping track of the test performance. (Note that the function $f$ still takes the numeric features $x_{1},x_{2},x_{3},x_{4}$ ). The horizontal line on the graph represents the best test set performance achieved during pruning. As discussed in Sections 1, 5, this baseline is meant to simulate what could be expected from methods that rely on LLM’s understanding of the domain, rather than iteratively contrasting items based on modeling error as our approach does. The baseline vastly underperforms, and by comparing the baseline’s features with the features our method discovered (Section A.1) we can see why. For example, the baseline’s features rarely focus on location or the architecture/quality of the build/remodel, and instead capture relatively cosmetic features such as fenced yards or green lawns.

The differences in the learning curves provide another illustration that the hybrid model is adjusting all its parameters in unison during learning. The train and test median absolute error drop together until over-training happens. In the case of the linear model, the training error continues to drop, but the testing error stays stable, while in the case of the nonlinear (neural) model, we see that both train and test errors reach lower levels of around $0.11$ . This corresponds to roughly $12\%$ relative error on the raw (not log) price, after which the model begins to overtrain more severely, with training error dropping even lower as the testing error continues to rise. (Note Zillow’s 7% errorzillow, but based on a different, much larger and richer, proprietary dataset of hand-crafted features).

5.3 Cold start recommendations: Netflix prize dataset

The cold start problem in collaborative filtering systems refers to the challenge of making predictions for new items or users without sufficient interaction history. We use the classic Netflix prize dataset bennett2007netflix; netflix-prize-data to demonstrate how our hybrid model can address the cold start problem for new movies by discovering semantic features that predict collaborative filtering embeddings.

For 480,189 users and 17,770 movies, the dataset provides user-movie ratings (0-5, where 0 indicates that the movie was not rated by that user). We computed 32-dimensional embeddings ${\bf y}$ for each movie using singular value decomposition (SVD) of the ratings matrix. It is well understood that such embeddings capture latent user preferences and movie characteristics discovered through collaborative filtering, and are remarkably good at modeling similarity between movies and matching users to movies via simple vector projection of the embeddings. (See also the discussion in Section C on how instead of embeddings the raw matrix could be modeled with two different types of semantic items – users and movies – using a similar formulation to the one above).

To illustrate the cold start problem, we simulate adding a new movie with limited exposure to viewers, by removing a single movie and gradually adding user ratings. For a selected movie, we set the values in the ratings matrix to zero for all users, then gradually add users’ actual ratings in a random order and recompute the movie’s embedding after each addition. Figure 5 shows how the cosine similarity (CS) between the original (full observation) movie embedding and the embedding based on sparse observations improves as we add ratings for up to 4000 users. The plot shows the mean and 90% confidence interval over 100 randomly selected movies from the dataset. The similarity reaches only about 0.38 after 1000 users and approximately 0.57 after 4000 users. In other words, thousands of users would need to be exposed to the movie, many skipping it, and some rating it, before the CS could grow to around 0.6, leading to more reliable recommendation. We show below that our linear hybrid model can predict embeddings immediately from the semantic items describing the movie, achieving test CS of approximately 0.59 without requiring any user ratings—a performance level that would otherwise require roughly 4000 user ratings to achieve through collaborative filtering alone.

S5.F5 — Figure 5:Cold start convergence simulation. Cosine similarity between embeddings computed from partial ratings vs. full ratings, as user ratings are gradually added in random order. The shaded region shows 90% confidence interval over 100 randomly selected movies. Hundreds of ratings are typically needed before embeddings converge, illustrating the severity of the cold start problem that our hybrid model addresses by predicting embeddings from semantic features alone.

In our experiments, we focused on the 512 most watched movies, again randomly split into training (80%) and test set. This allowed us to set the semantic items $s$ for these movies to consist only of the movie title and year of release: Modern LLMs are familiar with the most popular movies watched in the period from 1998 to 2006. Otherwise, we would need items $s$ to contain more information, as was the case in the house experiment, or allow the LLM to use a RAG system to retrieve relevant information when executing extraction and mining prompts. The quality of the predictions is measured using cosine similarity (CS) between predicted and true 32-dim embeddings ${\bf y}$ .

The learning curves are shown in Figure 6 for both linear and neural network models. The video showing which features are added (blue) or removed (red) in Figure 6 (a) is available at: https://github.com/mjojic/genZ/tree/main/media. As with the houses experiment, we compare against a strong 0-shot baseline (horizontal dashed line at CS=0.48) obtained by asking GPT-5 to suggest 50 features based on the task description and movie titles, then via pruning on the training set find the best feature set in terms of the test error, as this is the best the baseline could possibly get for that training-test split.

The linear model substantially outperforms the 0-shot baseline, reaching a test cosine similarity of approximately 0.59, indicating that the iterative feature discovery process finds movie characteristics more predictive of user preferences than those suggested by the LLM based on task understanding alone. The 0.11 improvement in cosine similarity (from 0.48 to 0.59) is substantial: As the cold start simulation in Figure 5 shows, this difference equals the improvement from adding roughly 2000 additional user ratings. Unlike the house price prediction experiment, using the neural network as the function $f({\bf z})$ in the GenZ system leads to more dramatic overfitting, where the test CS score quickly levels off while the train CS continues to grow.

Examining the discovered features reveals interpretable patterns aligned with user preferences, though notably different in character from the 0-shot baseline features (see Section A.2 for complete feature lists). Where the 0-shot features focus on content-based attributes (genre, plot structure, themes, pacing, narrative devices), the discovered features gravitate toward external markers: specific talent (individual actors and composers), multiple fine-grained award distinctions (Best Picture, Best Actor/Actress for specific role, Best Original Screenplay), franchise membership, and surprisingly precise temporal windows (1995-2000, 2004-2005, or post-1970). The John Williams feature is particularly revealing—it cuts across multiple genres to identify a coherent aesthetic or cultural preference that would be invisible to a content-based analysis. This suggests that collaborative filtering embeddings capture user similarity not through "stories like this" but through shared preferences for prestige signals, specific creative talent, and cultural/temporal cohorts (at least in this dataset). In another dataset, focused on a different selection of movies, or a different time frame, or a different cross-section of users, the system would likely discover different emergent structure in collective viewing, likely somewhat in disagreement with the conventional theories of content similarity.

Figure 6: Cold start recommendation results on Netflix dataset. Both plots show cosine similarity (CS) between predicted and true embeddings (left axis) and number of features (right axis) over iterations. The horizontal dashed line shows GPT-5 0-shot baseline (CS=0.5). The linear model achieves better test performance (CS

\approx

0.59) than the neural network (CS

\approx

0.52) despite using fewer features, suggesting that the relationship between semantic features and collaborative filtering embeddings is predominantly linear. The video showing which features are added (blue) or removed in (a) is available at: https://github.com/mjojic/genZ/tree/main/media.

6 Conclusions

The differences and similarities of the learning curves relative to the baseline in both experiments indicate that components of the model jointly adjust to different semantics of the data as well as target statistics and their relationships to semantics. Even though the models used here (GPT 4/5) are familiar with the semantics of the data, their ability to predict the real-valued, possibly multi-dimensional targets benefits from the rest of the system and the modeling choices. For example, while a neural network better models interactions of features needed to predict house prices, the linear model is better suited for modeling movie embeddings. In both applications, the learning curves show classic behavior of iterative improvement until over-training. On the other hand, features generated by the LLMs alone are highly redundant preventing both effective prediction and overtraining. As evident in the discovered features (Section A), our system focuses on properties of the semantic items that fit the statistics of the links $s\rightarrow{\bf y}$ , rather than ’conventional wisdom’ of a zero-shot approach: For houses, the system discovers that location and quality of build/remodel matters most in the data it was given for listings in 2016, while for movies, the system discovers the features important to the audience in the period 1998-2006, such as the difference between newer and older content, importance of franchises, specific actors/directors, and so on.

Appendix A Learned features

A.1 Learned features for the house dataset

The following are the features discovered by GenZ with the nonlinear (neural net) $f({\bf z})$ (on top of autoamtically supplied numerical values $x_{1},x_{2},x_{3},x_{4}$ for intercept, area, number of bedrooms and number of bathrooms), reaching the test and train median relative error around 12% of the raw target price (Fig. 4(b):

•

Site-built construction (not a manufactured/mobile home)
•

Located in Arizona desert communities (Scottsdale/Carefree area: zip codes 85255, 85262, 85377)
•

Located in the 92880 zip code (Eastvale/Corona area)
•

Located in the 62234 zip code (Collinsville, IL area)
•

Located in the 94531 zip code (Antioch, CA area)
•

Located in the 92276 zip code (Desert Hot Springs, CA)
•

Located in the 93111 zip code (Goleta/Santa Barbara area)
•

Located in the 81524 zip code (Loma, CO area)
•

Located in the 93510 zip code (Acton, CA area)
•

Located in the 91901 zip code (Alpine, CA area)
•

Bathrooms feature retro/vintage elements (e.g., colored fixtures or tile like powder blue/pink, clawfoot tubs, or
•

checkered floors)
•

Located in the 92677 zip code (Laguna Niguel, CA)
•

Bathrooms featuring Hollywood-style bulb strip lighting above the mirror
•

Located in the 94501 zip code (Alameda, CA)
•

Homes featuring plantation shutters on windows
•

Located in the 96019 zip code
•

Located in Chicago-area suburbs with zip codes starting with 600 (e.g., 60002, 60016, 60046)
•

Bathrooms feature built-in vanities with cabinetry (no pedestal sinks)
•

Interior walls feature wood paneling
•

Kitchen does not include an island (no central island present)
•

Home includes a carport for covered parking (rather than an attached garage)
•

Located in zip codes starting with 90 (Los Angeles/Long Beach area)
•

Home includes a three-car garage

In Fig. 4(b), the horizontal line represents the minimum test error after training on the training set with ever-reducing set of baseline features obtained in what we refer to as 0-shot manner. Specifically, to create those baseline features, GPT 5 was given the entire dataset (training and test) together with prices and asked to hypothesize features useful for price prediction. The following are those 52 features from the 0-shot baseline used in feature selection:

•

House has 4 or more bedrooms
•

House has 5 or more bedrooms
•

House has 3 or more full bathrooms
•

House has 4 or more full bathrooms
•

House has an attached garage
•

House has a 3-car (or larger) garage
•

House has a circular driveway
•

House has a gated entry (wrought-iron gate/entry gate)
•

House has a covered front porch
•

House has a covered patio
•

House has a pergola for shade
•

House has a raised deck
•

House has a balcony
•

House has a private swimming pool
•

House has a resort-style pool with water features
•

House has waterfront access (dock/marina/pond view)
•

House has mountain or panoramic scenic views
•

House has extensive desert/xeriscape landscaping (gravel/cacti/rocks)
•

House has a green lawn (grass yard)
•

House has a fenced yard
•

House has solar panels on the roof
•

House has a tile roof (e.g., clay/Spanish tile)
•

House has stucco exterior walls
•

House has exterior stone/stacked-stone accents
•

House has Mediterranean-inspired architecture (arches/turret/columns)
•

House has Craftsman-style exterior elements (stone columns/shingle siding)
•

House has an open-concept kitchen/living layout
•

House has a large kitchen island
•

House has a breakfast bar or peninsula seating in the kitchen
•

House has granite or stone kitchen countertops
•

House has a tile or mosaic kitchen backsplash
•

House has stainless steel kitchen appliances
•

House has a double oven
•

House has recessed lighting
•

House has decorative chandelier lighting
•

House has hardwood flooring in main living areas
•

House has tile flooring in main living areas
•

House has wall-to-wall carpeting in bedrooms
•

House has plantation shutters
•

House has vaulted or tray ceilings
•

House has crown molding
•

House has exposed wood ceiling beams
•

House has a fireplace (in any room)
•

House has a fireplace in the primary bedroom
•

House has a walk-in shower with glass enclosure
•

House has a soaking tub in the primary bathroom
•

House has a jetted tub (spa tub)
•

House has a freestanding bathtub
•

House has a double vanity (two sinks) in a bathroom
•

House has glass block windows in a bathroom
•

House has a skylight
•

House has a built-in desk or built-in shelving/storage cabinetry

A.2 Discovered features for Netflix dataset

The following 27 features were discovered by our approach with the linear model $f({\bf z})$ , reaching test CS of 0.59 (Fig. 6(a)):

•

Not science fiction (no movies centered on aliens, space, or futuristic tech)
•

Not released between 1995 and 2000 (inclusive)
•

Not primarily comedies
•

Not primarily romance-focused films (i.e., avoids romantic dramas and romantic comedies)
•

Mostly PG-13/PG-rated mainstream releases (i.e., generally avoids hard-R films)
•

Not animated (live-action films)
•

Not primarily action-adventure, heist, or spy thrillers (i.e., avoids high-octane studio action crowd-pleasers)
•

Not horror (i.e., avoids horror and horror-leaning supernatural thrillers)
•

Includes many Academy Award Best Picture nominees/winners (i.e., more prestige/canon titles)
•

Includes a notable number of pre-1995 titles (1960s–early ’90s), whereas the negatives are almost entirely 1995–2004 releases
•

Primarily entries in major film franchises or multi‑film series (including first installments that launched franchises)
•

Includes multiple war films set amid real historical conflicts (e.g., WWII/Vietnam), which are absent in the negative group
•

Not crude/raunchy, man‑child/SNL‑style comedies (when comedic, entries skew quirky/indie or gentle rather than broad/frat humor)
•

Skews toward female-led or women-centered narratives (e.g., Legally Blonde, The First Wives Club, Freaky Friday, The Pelican Brief, Lara Croft), whereas the negative group is predominantly male-led
•

Excludes 2004–2005 releases (the positives stop at 2003, while many negatives are from 2004–2005)
•

Includes multiple films centered on high school/college life or set primarily in educational settings (e.g., The Breakfast Club, Sixteen Candles, Fast Times at Ridgemont High, Billy Madison, Good Will Hunting, Top Gun), a theme largely absent from the negatives
•

Not children/family-oriented fare (avoids Disney-style live‑action family movies and musicals; skews toward teen/adult titles instead)
•

Not quirky/indie/cult, auteur‑driven or formally experimental films; the positives skew toward conventional, mainstream studio storytelling
•

Not ancient/medieval ‘sword‑and‑sandal’ or knightly epics (i.e., avoids Greco‑Roman/medieval battlefield sagas like Braveheart or Troy)
•

Includes multiple films scored by John Williams (e.g., Return of the Jedi, Raiders of the Lost Ark, E.T., Harry Potter and the Sorcerer’s Stone), whereas none of the negatives are Williams-scored
•

Includes multiple films that earned their lead performer an Academy Award (Best Actor/Best Actress) for that specific role (e.g., Monster, Training Day, Monster’s Ball, Dead Man Walking), whereas the negatives have few or none
•

Standard theatrical releases only (no titles labeled as Special/Extended/Anniversary/Platinum Edition)
•

Includes multiple films starring Angelina Jolie or Brad Pitt (none of the negatives star either)
•

Includes multiple films that won or were nominated for the Academy Award for Best Original Screenplay
•

No pre-1970 releases (all positives are 1972 or later, while the negatives include several 1950s–1960s films)
•

Includes multiple films starring Tom Hanks (e.g., Forrest Gump, Sleepless in Seattle), whereas none of the negatives feature him

On the other hand, a 0-shot approach produces very different features. The following 50 features were suggested by GPT-5 based on task description, then pruned to achieve the baseline test CS of 0.48:

•

Is animated
•

Is part of a franchise/sequel/prequel/remake
•

Is based on a true story or real events
•

Is adapted from a non-comic book or short story
•

Is adapted from a comic book/graphic novel
•

Features a superhero or superpowered protagonist
•

Is set primarily in a historical period (pre-1950)
•

Is set primarily in the future
•

Is set primarily in a contemporary/modern time
•

Is set primarily outside the United States
•

Has predominantly non-English dialogue
•

Has an ensemble cast with multiple co-leads
•

Has a female lead or co-lead driving the story
•

Centers on a male buddy duo
•

Has a strong, central romantic plot
•

Has a dominant comedic tone
•

Is action-driven with frequent set pieces
•

Contains graphic/intense violence
•

Contains significant gunplay
•

Contains extensive swordplay or martial arts
•

Features large-scale war or battle scenes
•

Contains high-speed car chases/vehicular action
•

Centers on a heist/caper
•

Centers on spy/espionage activities
•

Centers on a crime investigation/detective story
•

Centers on courtroom/legal drama
•

Centers on supernatural or paranormal elements
•

Centers on science fiction elements (tech, AI, space)
•

Centers on fantasy worldbuilding (magic, mythical beings)
•

Has prominent horror elements and scares
•

Hinges on a mystery with a major twist/reveal
•

Uses a nonlinear or fractured timeline
•

Uses an unreliable narrator or ambiguous reality
•

Has slow-burn, mood-driven pacing
•

Has fast-paced, high-energy pacing
•

Has a feel-good/optimistic tone with an uplifting resolution
•

Has a tragic or bittersweet ending
•

Has a dark or cynical tone
•

Explores strong themes of redemption or forgiveness
•

Is driven by a revenge motivation
•

Is a coming-of-age or personal growth story
•

Features a prominent mentor–protégé relationship
•

Centers on friendship/buddy dynamics
•

Centers on family relationships (parents/children/siblings)
•

Is a musical (characters sing to advance the story)
•

Is sports-centered
•

Has a military setting or procedural focus
•

Relies heavily on CGI/visual effects spectacle
•

Won at least one Academy Award
•

Has a runtime over 140 minutes

Appendix B Tractably summing out z in joint updates for the linear observation model (factor analysis)

In our experiments, we used a generic version of the GenZ learning algorithm with the functional link ${\bf z}\rightarrow{\bf y}$ arbitrary (plug-and-play). Instead of computing exact expectations of form $E_{q({\bf z})}g({\bf z})$ , we just evaluated $g$ at the mode of $q$ . While this allowed us to run the same algorithm for the linear and nonlinear (neural net) models, we noted that in some cases these expectations can be computed tractably under the full distribution $q$ . Linear model is such a case, and we derive it here for illustration.

A linear link between ${\bf z}$ and ${\bf y}$ takes a form

p({\bf y}|{\bf z},\theta_{y}=\{{\bf\Lambda},\sigma^{2}\})={\cal N}|({\bf y}|{\bf\Lambda}{\bf z},\sigma^{2}I),

(17)

where ${\bf z}$ is now $(n_{f}+1)\times 1$ feature vector with an additional element which is always $1$ , ${\bf y}$ is a $n_{d}\times 1$ observation vector, and ${\bf\Lambda}$ is an $n_{d}\times(n_{f}+1)$ parameter matrix. (Note that the overall model is still nonlinear because the relationships among features ${\bf z}$ in $p({\bf z}|s)$ are nonlinear). The additional constant element $z_{n_{f}+1}=1$ in ${\bf z}$ allows for learning the bias vector within ${\bf\Lambda}$ .

$\displaystyle T_{1}=E_{q}[\log p({\bf y}\|{\bf z})]$	$\displaystyle=$	$\displaystyle E_{q}\big[-\frac{1}{2\sigma^{2}}(y-{\bf\Lambda}{\bf z})^{T}(y-{\bf\Lambda}{\bf z})\big]-\frac{n_{d}}{2}\log 2\pi\sigma^{2}$	(18)
	$\displaystyle=$	$\displaystyle-\frac{1}{2\sigma^{2}}E_{q}\big[{\bf y}^{T}{\bf y}-2{\bf y}^{T}{\bf\Lambda}{\bf z}+{\bf z}^{T}{\bf\Lambda}^{T}{\bf\Lambda}{\bf z}\big]-\frac{n_{d}}{2}\log 2\pi\sigma^{2}$
	$\displaystyle=$	$\displaystyle-\frac{1}{2\sigma^{2}}\big({\bf y}^{T}{\bf y}-2{\bf y}^{T}{\bf\Lambda}E_{q}[{\bf z}]+E_{q}[tr({\bf\Lambda}{\bf z}{\bf z}^{T}{\bf\Lambda}^{T}]\big)-\frac{n_{d}}{2}\log 2\pi\sigma^{2}$
	$\displaystyle=$	$\displaystyle-\frac{1}{2\sigma^{2}}\big({\bf y}^{T}{\bf y}-2{\bf y}^{T}{\bf\Lambda}E_{q}[{\bf z}]+tr({\bf\Lambda}E_{q}[{\bf z}{\bf z}^{T}]{\bf\Lambda}^{T})\big)-\frac{n_{d}}{2}\log 2\pi\sigma^{2}$

We use the approximate posterior limited to the form $q({\bf z})=\prod_{i}q(z_{i})$ . Defining $q(z_{i}=1)=q_{i}$ , $q(z_{i}=0)=1-q_{i}$ , and ${\bf q}=[q_{1},q_{2},...q_{n_{f}},1]^{T}$ (as $z_{n_{f}+1}:=1$ ), we have

	$\displaystyle E_{q}[{\bf z}]$	$\displaystyle=$	$\displaystyle{\bf q},$		(19)
	$\displaystyle E_{q}[{\bf z}{\bf z}^{T}]$	$\displaystyle=$	$\displaystyle{\bf q}{\bf q}^{T}+\operatorname{diag}({\bf q}-{\bf q}\odot{\bf q}),$		(20)

where $\odot$ denotes point-wise multiplication. (To see this, note that for nondiagonal elements $z_{i}z_{j}$ in ${\bf z}{\bf z}^{T}$ we have $E[z_{i}z_{j}]=q_{i}q_{j}$ but the expectation for the diagonal elements is $E[z_{i}^{2}]=q_{i}$ , rather than $q_{i}^{2}$ ). Thus,

	$\displaystyle tr({\bf\Lambda}E_{q}[{\bf z}{\bf z}^{T}]{\bf\Lambda}^{T})$	$\displaystyle=$	$\displaystyle\operatorname{tr}({\bf\Lambda}{\bf q}{\bf q}^{T}{\bf\Lambda}^{T})+\operatorname{tr}({\bf\Lambda}\operatorname{diag}({\bf q}-{\bf q}\odot{\bf q}){\bf\Lambda}^{T})\big)$		(21)
		$\displaystyle=$	$\displaystyle{\bf q}^{T}{\bf\Lambda}^{T}{\bf\Lambda}{\bf q}+\operatorname{tr}({\bf\Lambda}\operatorname{diag}({\bf q}-{\bf q}\odot{\bf q}){\bf\Lambda}^{T})\big),$		(21)

T_{1}=-\frac{1}{2\sigma^{2}}({\bf y}-{\bf\Lambda}{\bf q})^{T}({\bf y}-{\bf\Lambda}{\bf q})-\frac{1}{2\sigma^{2}}\operatorname{tr}({\bf\Lambda}\operatorname{diag}({\bf q}-{\bf q}\odot{\bf q}){\bf\Lambda}^{T})\big)-\frac{n_{d}}{2}\log 2\pi\sigma^{2}.

(22)

This is the only term that involves the parameter $\theta_{y}={\bf\Lambda}$ , and in the M step, for a given ${\bf q}$ we will solve for ${\bf\Lambda}$ by maximizing it. The derivation is also useful for the E step, in which we estimate ${\bf q}$ .

With our choice of factorized posterior, the feature term $T_{2}$ simplifies to

\displaystyle T_{2}=E_{q}[\log p({\bf z}|s)]=\sum_{i}\sum_{z_{i}}q(z_{i})\log p(z_{i}|s)=\sum_{i}q_{i}\ell^{1}_{i}+(1-q_{i})\ell^{0}_{i},

(23)

with $\ell^{1}_{i}$ and $\ell^{0}_{i}$ defined in (5). Finally, the entropy part of the bound becomes

T_{3}=-E_{q}[\log q({\bf z})]=-\sum_{i}q_{i}\log q_{i}+(1-q_{i})\log(1-q_{i}).

(24)

To perform inference, i.e. to approximate the posterior $p({\bf z}|{\bf y},s)$ , we optimize the bound $B=T1+T2+T3$ wrt to $\{q_{i}\}$ , the posterior probabilities $q(z_{i}=1)$ . This is a part of the (variational) E-step needed to compute the expectations (20) used in the observation part $T_{1}$ in (22). By maximizing $T_{1}$ , and therefore the log likelihood bound (1), wrt to ${\bf\Lambda}$ ans $\sigma^{2}$ we then perform a part of the M step related to the parameters of the observation conditional (17). In the M-step we can also maximize the bound wrt uncertainty parameters $\theta_{e}=\{p^{e}_{i}\}$ of the feature model (5), and we use the mining prompt in Fig. 2 to re-estimate the semantic model parameters, the feature descriptions $\theta_{f}$ as discussed in Sect. 2. While the optimization can be done with off-the-shelf optimization packages, we derive EM updates here.

E step: For a given $(s,{\bf y})$ pair we optimize $q_{i}$ one at a time, keeping the rest fixed. A simple way to approximate $q_{i}$ for a tight bound is to introduce vectors ${\bf q}^{+i}$ , and ${\bf q}^{-i}$ which are both equal to ${\bf q}$ in all entries but the $i$ -th. In the i-th entry, ${\bf q}^{+i}$ equals 1, and ${\bf q}^{-i}$ equals 0. In other words, the two are parameter vectors for posterior distributions where the i-th feature $z_{i}$ is forced to be either 1 or 0 with certainty. Given the above derivations, we can thus compute the approximate log likelihood

	$\displaystyle\log p({\bf y}\|z_{i}=1,s)$	$\displaystyle=$	$\displaystyle\log\sum_{\{z_{j}\}_{j}\neq_{i}}p({\bf y}\|\{z_{j}\}_{j},z_{i}=1,s)$		(25)
		$\displaystyle\approx$	$\displaystyle E_{{\bf q}^{+i}}[\log p({\bf y}\|{\bf z})]+E_{{\bf q}^{+i}}[\log p({\bf z}\|s)]-E_{{\bf q}^{+i}}[\log q({\bf z})],$		(25)

and similarly for $p({\bf y}|z_{i}=0,s)$ , but using expectations wrt ${\bf q}^{-i}$ . In each case the three terms are computed as above in (22), (23), and (24), but with the appropriate distribution parameter vector ${\bf q}^{+i}$ or ${\bf q}^{-i}$ replacing ${\bf q}$ . The posterior parameter $q_{i}=\frac{p({\bf y}|z_{i}=1,s)}{p({\bf y}|z_{i}=1,s)+p({\bf y}|z_{i}=0,s)}$ is thus approximated as

q_{i}=\frac{e^{E_{{\bf q}^{+i}}[\log p({\bf y}|{\bf z})]+E_{{\bf q}^{+i}}[\log p({\bf z}|s)]-E_{{\bf q}^{+i}}[\log q({\bf z})]}}{e^{E_{{\bf q}^{+i}}[\log p({\bf y}|{\bf z})]+E_{{\bf q}^{+i}}[\log p({\bf z}|s)]-E_{{\bf q}^{+i}}[\log q({\bf z})]}+e^{E_{{\bf q}^{-i}}[\log p({\bf y}|{\bf z})]+E_{{\bf q}^{-i}}[\log p({\bf z}|s)]-E_{{\bf q}^{-i}}[\log q({\bf z})]}}

(26)

The entropy terms for ${\bf q}^{+i}$ and ${\bf q}^{-i}$ are the same: Applying equation (7) to ${\bf q}^{+i}$ and ${\bf q}^{-i}$ we see that all summands are the same as ${\bf q}^{+i}_{j}={\bf q}^{-i}_{j}$ , except for the $i$ -th summand which is zero in both cases as ${\bf q}^{+i}_{i}=1$ and ${\bf q}^{-i}_{i}=0$ . Therefore the entropy terms divide out. The feature terms $E_{{\bf q}^{+i}}[\log p({\bf z}|s)]$ $E_{{\bf q}^{-i}}[\log p({\bf z}|s)]$ are not the same. The summands are still the same for $i\neq j$ , but the $i$ -th terms are different: They are $\ell^{1}_{i}$ and $\ell^{0}_{i}$ , respectively. Therefore, we can divide out the shared parts. The expression simplifies to

q_{i}=\frac{e^{E_{{\bf q}^{+i}}[\log p({\bf y}|{\bf z})]+\ell^{1}_{i}]}}{e^{E_{{\bf q}^{+i}}[\log p({\bf y}|{\bf z})]+\ell^{1}_{i}}+e^{E_{{\bf q}^{-i}}[\log p({\bf y}|{\bf z})]+\ell^{i}_{0}}}=\frac{1}{1+e^{E_{{\bf q}^{-i}}[\log p({\bf y}|{\bf z})]-E_{{\bf q}^{+i}}[\log p({\bf y}|{\bf z})]+\ell^{0}_{i}-\ell^{1}_{i}}}

(27)

To get all parameters $q_{i}$ these equations need to be iterated, but the iterations can be interspersed with parameter updates of the M step. In this update rule we see that the foundational model’s prediction of $z_{i}$ is balanced with the observation ${\bf y}$ which may be better explained if the different $z_{i}$ is chosen. As discussed in Sect. 2, this is a signal needed to update the feature descriptor ${\theta_{f}}_{i}$ . For example, if $\bf y$ is a movie embedding vector, then the presence or absence of the feature "historical war movie" helps explain explain, through $\Lambda$ , the variation in ${\bf y}$ , which in turn reflect user preferences. If the users do not in fact have preference specifically for historical war movies, but instead for movies based on real historical events, war or not, then this feature will only partially explain the variation, and the posterior might indicate the positive feature, $z_{i}=1$ , with high probability for a movie which is about historical events but the foundational model correctly classified as not a "historical war movie."

M step: While the E-step is performed independently for each item $s$ , in the M step we consider a collection of item-onservation pairs $(s^{t},{\bf y}^{t})$ , and the posterior distributions ${\bf q}^{t}$ from the E-step, and optimize the sum of the bounds $B^{t}$ . From (22), we see that

\frac{\partial\big(\sum_{t}B^{t}\big)}{\partial{\bf\Lambda}}=-\frac{1}{\sigma^{2}}\sum_{t}\bigg(({\bf y}^{t}-{\bf\Lambda}{\bf q}^{t})({\bf q}^{t})^{T}+{\bf\Lambda}\operatorname{diag}({\bf q}^{t}-{\bf q}^{t}\odot{\bf q}^{t})\bigg)

(28)

Setting the derivative to zero we derive the update

{\bf\Lambda}=\bigg(\sum_{t}{\bf y}^{t}({\bf q}^{t})^{T}\bigg)\bigg(\sum_{t}{\bf q}^{t}({\bf q}^{t})^{T}+\operatorname{diag}({\bf q}^{t}-{\bf q}^{t}\odot{\bf q}^{t})\bigg)^{-1}

(29)

The noise parameter $\sigma^{2}$ is optimized when

\sigma^{2}=\frac{1}{n_{d}T}\sum_{t=1}^{T}({\bf y}^{t}-{\bf\Lambda}{\bf q}^{t})^{T}({\bf y}-{\bf\Lambda}{\bf q}^{t})+\operatorname{tr}({\bf\Lambda}\operatorname{diag}({\bf q}^{t}-{\bf q}^{t}\odot{\bf q}^{t}){\bf\Lambda}^{T}).

(30)

We can (and in our experiments do) optimize the feature uncertainty parameters $p^{e}_{i}$ . These parameters could in principle be extracted from the foundational model using token probabilities or its own claim of confidence in the response (see prompt in Fig.2). However, the foundational model’s own uncertainty only captures its confidence about judging the characteristic. In our hybrid model, we can have $p_{e}^{i}$ model additional uncertainty that the feature is well-defined. This is especially useful during learning as features evolve. As we discussed above using the "historical war movie" feature as an example, in the E step the posterior may diverge from the foundational model’s feature classification if the feature only approximately aligns with true variation in observations. To estimate, and then reuse, the full uncertainty, we can maximize the bound wrt $p^{e}_{i}$ , to arrive at:

p^{e}_{i}=\frac{1}{T}\sum_{t=1}^{T}q_{i}^{t}[h_{i}^{t}=0]+(1-q_{i}^{t})[h_{i}^{t}=1],

(31)

where $h_{i}^{t}$ are the binary responses of the foundational model for feature i in the semantic item $s^{t}$ . (To derive, see (5), although the update is intuitive, as it simply sums the evidence of error as indicated by the posterior).

Finally, to update the feature descriptors ${\theta_{f}}_{i}$ , we sample or threshold the posteriors $q^{t}_{i}$ to get some positive $z_{i}^{t}=1$ , and negative $z_{i}^{t}=0$ examples. We then use exemplars from each group to form the feature mining prompt in Fig. 2. The fondational model’s response in the "characteristic" field is the updated feature description ${\theta_{f}}_{i}$ .

Appendix C Statistical modeling of co-occurrence/association/preferences between two sets of items

We studied a model that links a semantic description of an item $s$ with an observation vector ${\bf y}$ . Among other applications, the approach allows us to find features that determine user preferences for movies if ${\bf y}$ are movie embeddings inferred from movie-user observation matrix. Here we show that such an approach is a simplification of a model that involves two sets of interacting items (e.g. movies/viewers, products/buyers, plants/geographic areas, images/captions, congress members/bills). That model (without the simplification) can also be learned from data.

Suppose that the matrix of the pairwise interactions $x_{t_{1},t_{2}}$ is observed for item pairs from the two groups $(s_{1}^{t_{1}},s_{2}^{t_{2}})$ . We would then model the joint distribution $p(x|{\bf z}_{1},{\bf z}_{2})p({\bf z}_{1}|s_{1})p({\bf z}_{2}|s_{2})$ where, for example, $s_{1}$ would be a textual description (or an image!) of a plant and $s_{2}$ a textual description (or a satellite observation!) of a geographic area, and the scalar $x$ would be the presence (or thriving) signal for the combination. We could develop a number of different complex models with suitable latent variables (in addition to ${\bf z}$ ), but honoring the traditionally very successful collaborative filtering methods for dealing with such data, we will focus on embedding methods (which typically rely on matrix decomposition methods, at least in recommender systems).

Thus, for each set of items our latent variables ${\bf z}$ should determine a real-valued embedding ${\bf y}$ probabilistically, through a distribution $p({\bf y}|{\bf z})$ .

Then, given the low-dimensional embeddings ${\bf y}_{1},{\bf y}_{2}$ for the items in the pair (e.g. a movie and a user), we would model the association variable as ${\bf y}\approx{\bf u}_{1}^{T}{\bf u}_{2}$ :

\log p(x|{\bf y}_{1},{\bf y}_{2})=\log{\cal N}(x;{\bf y}_{1}^{T}{\bf y}_{2},\sigma^{2}_{x})=-\frac{1}{2\sigma_{y}^{2}}(x-{\bf y}_{1}^{T}{\bf y}_{2})^{2}-\frac{1}{2}\log(2\pi\sigma^{2}_{x})

(32)

In the model over observed association strength $x$ , item embeddings ${\bf y}_{1}$ and ${\bf y}_{2}$ , each based on their own set’s feature vectors ${\bf z}_{1}$ , ${\bf z}_{2}$ , which themselves are probabilistically dependent on the items semantics $s_{1},s_{2}$ the likelihood can be approximated as

	$\displaystyle\log p(x\|s_{1},s_{2})$	$\displaystyle=$	$\displaystyle\log\sum_{{\bf z}_{1},{\bf z}_{2},{\bf y}_{1},{\bf y}_{2}}p(x\|{\bf y}_{1},{\bf y}_{2})p({\bf y}_{1}\|{\bf z}_{1})p({\bf y}_{2}\|{\bf z}_{2})p({\bf z}_{1}\|s_{1})p({\bf z}_{2}\|s_{2})$		(33)
		$\displaystyle\geq$	$\displaystyle E_{q({\bf z}_{1},{\bf z}_{2},{\bf y}_{1},{\bf y}_{2})}\bigg[\log\frac{p(x\|{\bf y}_{1},{\bf y}_{2})p({\bf y}_{1}\|{\bf z}_{1})p({\bf y}_{2}\|{\bf z}_{2})p({\bf z}_{1}\|s_{1})p({\bf z}_{2}\|s_{2})}{q({\bf z}_{1},{\bf z}_{2},{\bf y}_{1},{\bf y}_{2})}\bigg]$		(33)

where $q({\bf z}_{1},{\bf z}_{2},{\bf y}_{1},{\bf y}_{2})$ is an approximate posterior. If we assume a factored posterior $q=q({\bf z}_{1})q({\bf z}_{2})q({\bf y}_{1})q({\bf y}_{2})$ , and that the posterior over embeddings Dirac $q({\bf y}_{1})=\delta({\bf y}_{1}-\hat{{\bf y}}_{1})$ , $q({\bf y}_{2})=\delta({\bf y}_{2}-\hat{{\bf y}}_{2})$ then based on (32):

$\log p(x|s_{1},s_{2})\approx-\frac{1}{2\sigma_{x}^{2}}(x-\hat{{\bf y}}_{1}^{T}\hat{{\bf y}}_{2})^{2}-\frac{1}{2}\log(2\pi\sigma^{2}_{y})+E_{q}\big[\log\frac{p({\bf y}_{1},{\bf z}_{1}|s_{1})}{q({\bf z}_{1})}\big]+E_{q}\big[\frac{\log p({\bf y}_{2},{\bf z}_{2}|s_{2})}{q({\bf z}_{2})}\big],$

where the expected embeddings are $\hat{{\bf u}_{1}}=E_{q({\bf z}_{1})}[{\bf u}_{1}]$ , $\hat{{\bf u}_{2}}=E_{q({\bf z}_{2})}[{\bf u}_{2}]$ If we observe the full association matrix Y, e.g. over $n_{1}\approx 18000$ movies, indexed by $i_{1}$ and $n_{2}\approx 500000$ users indexed by $i_{2}$ , then the approximate log likelihood of the data, broken into two terms is:

	$\displaystyle L$	$\displaystyle=$	$\displaystyle-\frac{1}{2\sigma_{x}^{2}}\sum_{t_{1},t_{2}}(x_{t_{1},t_{2}}-(\hat{{\bf y}}_{1}^{t_{1}})^{T}\hat{{\bf y}}_{2}^{t_{2}})^{2}-\frac{n_{1}n_{2}}{2}\log(2\pi\sigma^{2}_{x})+$		(34)
		$\displaystyle+$	$\displaystyle\sum_{t_{1}}E_{q}\big[\log\frac{p(\hat{{\bf y}_{1}}^{t_{1}},{\bf z}_{1}^{t_{1}}\|s_{1}^{t_{1}})}{q({\bf z}_{1}^{t_{1}})}\big]+\sum_{i_{2}}E_{q}\big[\log\frac{p(\hat{{\bf y}_{2}}^{t_{2}},{\bf z}_{2}^{t_{2}}\|s_{2}^{t_{2}})}{q({\bf z}_{2}^{t_{2}})}\big].$		(34)

This can be seen as solving a regularized matrix decomposition problem as by collecting $x_{t_{1},t_{2}}$ into the observation matrix ${\bf X}$ and all items’ embedding vectors into matrices $\hat{{\bf Y}}_{1}$ and $\hat{{\bf Y}}_{2}$ the first term (first line) above can be written as

T_{1}=-\frac{1}{2\sigma_{y}^{2}}||{\bf X}-\hat{{\bf Y}}_{1}^{T}\hat{{\bf Y}}_{2}||_{F}^{2}-\frac{n_{1}n_{2}}{2}\log(2\pi\sigma^{2}_{x}),

(35)

while the second term (line) serves as a regularization term. The regularization term is in fact consisting of two models, one for each type of item, of the form studied in the previous sections. When the observation matrix is very large and also has low intrinsic dimension, as is typically the case in collaborative filtering applications, the first term can be fit very well resulting in low $\sigma_{x}^{2}$ , and making the regularization term adapting to the natural data embedding. Therefore, as our goal for the two $p({\bf y},{\bf z}|s)$ components is not to interfere with natural embeddings of the data, but rather to explain the statistical patterns so that they would transfer to new semantic items with new descriptions and previously unseen interaction patterns in testing, we can fit the first term independently of the other two, after which the joint modeling problem separates into two separate ones, e.g. extracting the plant features that explain their embeddings and extracting the geographic areas features to explain their own embeddings.

For a given embedding dimension $n_{e}$ the term $T_{1}$ is minimized when $\hat{{\bf Y}}_{1}$ and $\hat{{\bf Y}}_{2}$ are related to the SVD of ${\bf X}$ , which reduces the dimensionality to $n_{e}$ ${\bf X}\approx{\bf U}{\bf S}{\bf V}^{T}$ , where $n_{1}\times n_{e}$ matrix $U$ and the $n_{2}\times n_{e}$ matrix $V$ are orthonormal and $S$ is a diagonal matrix. For example, the term is minimized when

\hat{{\bf Y}_{1}}={\bf U}{\bf S}^{\alpha}{\bf R},\quad\hat{{\bf Y}_{2}}={\bf R}^{T}{\bf S}^{1-\alpha}{\bf V}^{T},

(36)

for arbitrary rotation matrix ${\bf R}$ and a scalar $\alpha\in[0,1]$ .

Therefore, if we solve the matrix decomposition problem in $T_{1}$ we can then fit the features for the two sets of items separately as the second term is decomposed:

T_{2}=\sum_{t_{1}}E_{q}\big[\log\frac{p(\hat{{\bf y}_{1}}^{t_{1}},{\bf z}_{1}^{t_{1}}|s_{1}^{t_{1}})}{q({\bf z}_{1}^{t_{1}})}\big]+\sum_{i_{2}}E_{q}\big[\log\frac{p(\hat{{\bf y}_{2}}^{t_{2}},{\bf z}_{2}^{t_{2}}|s_{2}^{t_{2}})}{q({\bf z}_{2}^{t_{2}})}\big].

(37)

In other words, by computing the embeddings using a standard matrix decomposition technique, we can build models that generate these item embeddings ${\bf y}$ using the semantics of the items $s$ .

In case of movie preference datasets, like the Netflix Prize dataset, and data from many other recommender systems, the second set of items – users – are usually not described semantically (through information like their demographics, their writing, etc.). Privacy reasons are often quoted for this, but it is also true that people’s complex overall behavior, as it relates to the task, is better represented simply by their preferences. I.e. knowing which movies they have watched is a better indicator of whether or not they will like a particular new movie for example, than is some description of their life story, even if it were available. In other words, in our Netflix prize experiments, we only fit the movie embeddings.