YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation
Abstract
Steering Large Language Models (LLMs) through activation interventions has emerged as a lightweight alternative to fine-tuning for alignment and personalization. Recent work on Bi-directional Preference Optimization (BiPO) shows that dense steering vectors can be learned directly from preference data in a Direct Preference Optimization (DPO) fashion, enabling control over truthfulness, hallucinations, and safety behaviors. However, dense steering vectors often entangle multiple latent factors due to neuron multi-semanticity, limiting their effectiveness and stability in fine-grained settings such as cultural alignment, where closely related values and behaviors (e.g., among Middle Eastern cultures) must be distinguished. In this paper, we propose Yet another Policy Optimization (YaPO), a reference-free method that learns sparse steering vectors in the latent space of a Sparse Autoencoder (SAE). By optimizing sparse codes, YaPO produces disentangled, interpretable, and efficient steering directions. Empirically, we show that YaPO converges faster, achieves stronger performance, and exhibits improved training stability compared to dense steering baselines. Beyond cultural alignment, YaPO generalizes to a range of alignment-related behaviors, including hallucination, wealth-seeking, jailbreak, and power-seeking. Importantly, YaPO preserves general knowledge, with no measurable degradation on MMLU. Overall, our results show that YaPO provides a general recipe for efficient, stable, and fine-grained alignment of LLMs, with broad applications to controllability and domain adaptation. The associated code and data are publicly available111https://github.com/MBZUAI-Paris/YaPO.
rmTeXGyreTermesX [*devanagari]rmLohit Devanagari [*arabic]rmNoto Sans Arabic
YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation
Abdelaziz Bounhar1,∗, Rania Hossam Elmohamady Elbadry1, Hadi Abdine1, Preslav Nakov1, Michalis Vazirgiannis1,2, Guokan Shang1,∗ 1MBZUAI, 2Ecole Polytechnique ∗Correspondence: {abdelaziz.bounhar, guokan.shang}@mbzuai.ac.ae
1 Introduction
Large language models have achieved remarkable progress in generating coherent, contextually appropriate, and useful text across domains. However, controlling their behavior in a fine-grained and interpretable manner remains a central challenge for alignment and personalization. Traditional approaches such as Reinforcement Learning from Human Feedback (RLHF) (Ziegler et al., 2019) are effective but costly, difficult to scale, and often inflexible, while also offering little transparency into how specific behaviors are modulated. Prompt engineering provides a lightweight alternative but is brittle and usually less efficient compared to fine-tuning. More importantly, RLHF lacks scalability: modulating a single behavior may require updating millions of parameters or collecting large amounts of preference data, with the risk of degrading performance on unrelated tasks. These limitations have motivated growing interest in activation steering, a lightweight paradigm that guides model outputs by directly modifying hidden activations at inference time, via steering vector injection at specific layers without retraining or altering original model weights (Turner et al., 2023).
Early activation steering methods such as Contrastive Activation Addition (CAA) (Panickssery et al., 2024) compute steering vectors by averaging activation differences over contrastive prompts. While simple, this approach captures only coarse behavioral signals and often fails in complex settings. Bi-directional Preference Optimization (BiPO) (Cao et al., 2024) introduced a DPO-style objective to directly learn dense steering vectors from preference data, enabling improved control over behaviors such as hallucination and refusal.
However, both CAA and BiPO rely on dense steering vectors, which are prone to entangling multiple latent factors due to neuron multi-semanticity and superposition (Elhage et al., 2022). This limits their stability, interpretability, and effectiveness in fine-grained alignment settings. In parallel, Sparse Activation Steering (SAS) (Bayat et al., 2025) leverages Sparse Autoencoders (SAEs) to operate on approximately monosemantic features, enabling more interpretable interventions, but relies on static averaged activations rather than learnable sparse vectors.
In this work, we introduce Yet Another Policy Optimization (YaPO), a reference-free method that learns trainable sparse steering vectors directly in the latent space of a pretrained SAE using a BiPO-style objective. YaPO combines the preference optimization of BiPO with the interpretability of SAS, yielding sparse, stable, and effective steering directions with minimal training overhead.
We study cultural adaptation as a representative domain adaptation setting, introducing a new benchmark spanning five language families and fifteen cultural contexts. Our results identify a substantial implicit–explicit localization gap in baseline models as in (Veselovsky et al., 2025), and show that YaPO consistently closes this gap through improved fine-grained alignment. We further assess the generalization of YaPO on MMLU and on established alignment benchmarks from prior studies (Cao et al., 2024; Panickssery et al., 2024; Bayat et al., 2025).
In summary, our contributions are threefold: We propose YaPO, the first reference-free method for learning sparse steering vectors (in the latent space of a SAE) from preference data.
We curate a new dataset and benchmark for cultural alignment that targets fine-grained cultural distinctions, including same-language cultures with subtle differences in values and norms, spanning five language families and fifteen cultural contexts.
We empirically show that YaPO converges faster, exhibits improved training stability, and yields more interpretable steering directions than dense baselines, while also generalizing beyond cultural alignment to broader alignment tasks and benchmarks.
2 Related Works
Alignment and controllability. RLHF (Christiano et al., 2017; Ziegler et al., 2019; Stiennon et al., 2020; Ouyang et al., 2022) has become the standard approach to align LLMs, training a reward model on human preference data and fine-tuning with PPO (Schulman et al., 2017) under the Bradley–Terry framework (Bradley and Terry, 1952). Recent methods simplify this pipeline by bypassing explicit reward modeling: DPO (Rafailov et al., 2024) directly optimizes on preference pairs, while SLiC (Zhao et al., 2023) introduces a contrastive calibration loss with regularization toward the SFT model. Statistical rejection sampling (Liu et al., 2024) unifies both objectives and provides a tighter policy estimate.
Activation engineering. Activation-based methods steer LLMs by freezing weights and intervening on hidden activations. Early approaches optimized sentence-specific latent vectors (Subramani et al., 2022), while activation addition (Turner et al., 2023) and CAA (Rimsky et al., 2023) compute averaged activation differences from contrastive prompts. Although simple, these methods are often noisy and unstable, particularly for long-form or alignment-critical generation (Wang and Shu, 2023). More recent work perturbs attention heads (Li et al., 2024; Liu et al., 2023). BiPO (Cao et al., 2024) improves over prior work by framing steering as preference optimization, learning dense steering vectors via a bi-directional DPO-style objective.
Sparse activation steering. To mitigate superposition, Sparse Autoencoders (SAEs) (Lieberum et al., 2024) decompose activations into sparse, approximately monosemantic features. Sparse Activation Steering (SAS) (Bayat et al., 2025) exploits this structure by averaging sparse activations from contrastive data, yielding interpretable and fine-grained control. However, SAS does not optimize steering directions against preferences, limiting its effectiveness.
SAE-based steering and editing. Recent work combines activation steering with sparse or structured representation bases (Wu et al., 2025a, b; Chalnev et al., 2024; He et al., 2025; Sun et al., 2025; Xu et al., 2025). ReFT-r1 (Wu et al., 2025a) learns a single dense steering direction on frozen models using a language-modeling objective with sparsity constraints. RePS (Wu et al., 2025b) introduces a reference-free, bi-directional preference objective to train intervention-based steering methods. Other approaches operate directly in SAE space: SAE-TS and SAE-SSV (Chalnev et al., 2024; He et al., 2025) optimize or select sparse SAE features for controlled steering, while HyperSteer (Sun et al., 2025) generates steering vectors on demand via a hypernetwork.
Positioning of YaPO. BiPO provides strong optimization but suffers from dense entanglement; SAS offers interpretability but lacks optimization. YaPO unifies these lines by learning preference-optimized, sparse steering vectors in SAE space. This yields disentangled, interpretable, and stable steering, with improved convergence and generalization across cultural alignment, truthfulness, hallucination suppression, and jailbreak defense.
3 Method
3.1 Motivation: From Dense to Sparse Steering
Existing approaches extract steering vectors by directly operating in the dense activation space of LLMs (Rimsky et al., 2023; Wang and Shu, 2023). While effective in some cases, these methods inherit the multi-semantic entanglement of neurons: individual dense features often conflate multiple latent factors (Elhage et al., 2022), leading to noisy and unstable control signals. As a result, vectors obtained from contrastive prompt pairs can misalign with actual generation behaviors, especially in alignment-critical tasks.
To address this, we leverage SAEs, which have recently been shown to disentangle latent concepts in LLM activations into sparse, interpretable features (Bayat et al., 2025; Lieberum et al., 2024). By mapping activations into this space basis, steering vectors can be optimized along dimensions that correspond more cleanly to relevant semantic factors, improving both precision and interpretability.
3.2 Preference-Optimized Steering in Sparse Space
Let denote the hidden activations of the transformer at layer for input . Let also denote the upper part of the transformer (from layer to output). BiPO (Cao et al., 2024) learns a steering vector in the dense activation space of dimension using the bi-directional preference optimization objective (see equation 1). and are respectively the preferred and dispreferred responses which are jointly drawn with the prompt from the preference dataset , is the logistic function, a deviation control parameter, and a uniformly random coefficient enforcing bi-directionality. At inference time, the learned steering vector is injected to the hidden state to cause a perturbation towards the desired steering behavior as follows
| (2) |
with fixed to either -1 or 1 (negative or positive steering) and being a multiplicative factor that controlling the strength of steering.
In contrast, with YaPO, we introduce a sparse transformation function that steers activations through an SAE as follows:
| (3) |
where Enc and Dec are the encoder and decoder of a pretrained SAE, and is the learnable steering vector in sparse space of dimension . To correct for SAE reconstruction error, we add a residual correction term ensuring consistency with the original hidden state (see equation 3.2). The rational behind applying ReLU function is to enforce non-negativity in sparse codes (Bayat et al., 2025). We train steering vectors to increase the likelihood of preferred responses while decreasing that of dispreferred responses . The resulting optimization objective is outlined in equation 4.
With , the objective increases the relative probability of over ; with , it enforces the reverse. This symmetric training sharpens the vector’s alignment with the behavioral axis of interest (positive or negative steering).
During optimization, we detach gradients through the SAE parameters (which along with the LLM parameter remain frozen) and only update . This setup enables to live in a disentangled basis, while the decoder projects it back to the model’s hidden space. We summarize the overall optimization procedure in Algorithm 1.
4 Experiments
4.1 Experimental Setup
Target LLM. For clarity, in the main paper we present all experiments on Gemma-2-2B (Team et al., 2024), a light yet efficient model. Scalability to the bigger model Gemma-2-9B is differed to Appendix D. The choice of this model is further motivated by the availability of pretrained SAEs from Gemma-Scope (Lieberum et al., 2024), which are trained directly on Gemma-2 hidden activations and enable sparse steering without additional overhead of training SAEs from scratch.
Tasks. For readability, we focus on cultural adaptation, followed by a generalization study on other standard alignment tasks as studied in previous work (Cao et al., 2024; Panickssery et al., 2024; Bayat et al., 2025). For cultural adaptation, we select the steering layer via activation patching, see Appendix A. Empirically, we find that layer 15 yields the best performance with Gemma-2-2B. Training details and hyperparameter settings are reported in Appendix B.
Dataset. We introduce a new cultural alignment dataset that we curate from scratch, with dedicated training and evaluation splits, to probe fine-grained cultural localization within the same language. Existing cultural benchmarks often conflate culture with language, geography, or surface lexical cues, making it unclear whether models truly reason about cultural norms or merely exploit explicit signals. Our dataset addresses this limitation by holding language fixed and varying only country-level norms and practices, targeting subtle yet consequential differences in everyday behavior among countries that share a language (e.g., Moroccan vs. Egyptian Arabic, US vs. UK English).
Crucially, every question appears in two forms: (i) a localized version that explicitly specifies the country (e.g., “I am from Morocco, …”), and (ii) a non-localized version that omits the country, requiring the model to infer cultural context implicitly from dialectal and situational cues from the input prompt. This paired construction enables principled measurement of the implicit–explicit localization gap, the performance drop when explicit country information is removed—following (Veselovsky et al., 2025).
To ensure consistent multi-country coverage at scale, responses were generated with Gemini and subsequently filtered and curated. For clarity of presentation, full details on the dataset curation process and statistics are differed to Appendix F.
Definition 1 (Performance–Normalized Localization Gap (PNLG)).
Let and be a localized and its corresponding non–localized prompt, and let be the culturally correct answer. For a model , define the per-instance correctness scores
where indicates whether the model output matches the correct answer. In the multiple-choice questions setting, is the accuracy and thus is if the predicted option equals , and otherwise. In the open-ended generation setting, is a score determined by an external LLM judge.
Let . The performance–normalized localization gap is:
| (5) |
with arbitrarily small for numerical stability and controlling the strength of the normalization.
Definition 2 (Robust Cultural Accuracy (RCA)).
Using the same notation, the robust cultural accuracy is the harmonic mean of localized and non–localized accuracies:
| (6) |
with arbitrarily small for numerical stability.
Design choice of metrics. A raw localization gap can be misleading: a weak model may display a small gap simply because both accuracies are near zero. PNLG corrects for this by normalizing the gap with the mean performance , so models with trivially low accuracy are penalized. RCA complements this by rewarding methods that are both accurate and balanced across localized and non–localized prompts. Together, PNLG and RCA provide a more faithful evaluation of cultural alignment than raw gap alone.
Baselines. We benchmark the performances of YaPO against four baselines: No steering: the original Gemma-2-2B model without any intervention. CAA (Panickssery et al., 2024): which derives dense steering vectors by contrastive activation addition averaging, without preference optimization. SAS (Bayat et al., 2025): which derives sparse steering vectors by averaging SAE-encoded activations in the style of CAA, without preference optimization. BiPO (Cao et al., 2024): which optimizes dense steering vectors directly in the residual stream via bi-directional preference optimization.
These baselines allow us to disentangle the contributions of sparse representations and preference optimization in improving cultural alignment , and to assess whether YaPO indeed provides the best of both worlds by combining the precision of BiPO with the interpretability of SAS.
4.2 Training Dynamics Analysis
We begin by comparing the training dynamics of YaPO and BiPO. Empirically, we find that the same behavior occur for all countries and scenarios. Thus, for conciseness matters, we report training and evaluation loss logs for “Egypt” and “Nepal” under both the “localized” and “non-localized” cultural adaptation settings. Figures 2(a)–2(b) show training and evaluation loss over optimization steps for both methods (YaPO and BiPO).
The contrast is striking: YaPO converges an order of magnitude faster, with loss consistently dropping below 0.1 in under than 150 steps in both scenarios, whereas BiPO remains above 0.3 even after 600 steps. This rapid convergence stems from and underscores the advantage of operating in the sparse SAE latent space, where disentangled features yield cleaner gradients and more stable optimization. Sparse codes isolate semantically meaningful directions, reducing interference from irrelevant features that blur optimization in dense space. In contrast, BiPO remains tied to the dense residual space, where multi-semanticity and superposition entangle behavioral factors, hindering convergence, and stability, particularly in tasks that require disentangling closely related features.
5 Evaluation
We evaluate YaPO against CAA, BiPO, SAS and the baseline model without steering on our curated multilingual cultural adaptation benchmark using both Multiple-Choice Questions (MCQs) and Open-ended Generation (OG). To assess absolute alignment as well as robustness to the explicit–implicit localization gap, we consider the three settings: localized, non-localized, and mixed prompts (both). MCQ performance is measured by accuracy222The ground-truth answer is annotated using a \boxed{k} tag, where denotes the index of the correct choice, if the regex doesn’t match, we call an external LLM to judge., while OG responses are scored by an external LLM judge for consistency with the gold answer (see Appendix E for the evaluation prompts). For clarity, we only show the results on “Portuguese” and “Arabic” languages, the results on the full five set of languages are in Appendix C.
5.1 Multiple-Choice Questions
Table 1 reports MCQ accuracy by language, country, and prompt setting. Overall, all methods improve over the baseline in most settings, with YaPO being the most consistent across languages and prompt types. Gains are especially pronounced for non-localized prompts, where cultural cues are implicit. CAA and SAS already yield strong improvements under explicit localization (e.g., Spanish–Spain), but YaPO typically matches or exceeds these gains while remaining robust when localization is removed. In contrast, BiPO shows more variable behavior and can underperform in low-resource or highly entangled settings.
In contrast, YaPO shows smooth and monotonic accuracy scaling over a wide range of values. Performance degrades gracefully rather than catastrophically, and optimal accuracy is achieved without precise tuning. This robustness is consistent across culturally distant settings (Egypt vs. Levantine, Nepal vs. Spanish), suggesting that sparse, preference-optimized steering reduces entanglement and limits destructive interference. Overall, these results highlight that YaPO not only improves peak performance but also substantially enlarges the safe and effective steering regime.
5.2 Open-Ended Generation
Table 2 reports open-ended generation results for Portuguese and Arabic under localized, non-localized, and mixed prompt settings. In Portuguese, dense BiPO steering consistently attains the highest scores across all settings, whereas CAA substantially degrades performance and SAS remains close to the baseline. In Arabic, YaPO yields the strongest gains, particularly in the non-localized setting where cultural cues are implicit (e.g., the average score increases from 2.97 to 3.37), while BiPO provides smaller and less consistent improvements. Overall, BiPO is most effective in high-resource settings with strong baselines, whereas YaPO delivers more reliable improvements in lower-resource and implicitly localized open-ended generation. The consistent degradation observed with CAA is likely due to the coarse nature of simple activation averaging: a single dense steering direction applied uniformly across the chosen layer can tend to over-regularizes long-form generation, suppressing stylistic variation, discourse structure, and culturally specific details. In contrast, BiPO benefits from learnable steering, and YaPO further improves robustness by enforcing sparsity and disentanglement thereby taking the best of both worlds from BiPO and SAS.
5.3 Explicit–Implicit Localization Gap
Table 3 reports RCA and PNLG for MCQ and open-ended tasks. Recall that RCA (Eq. 6) is the harmonic mean of localized and non-localized performance, rewarding methods that are both accurate and balanced across settings. Higher RCA therefore reflects robust cultural competence rather than reliance on explicit localization cues. PNLG (Eq. 5) measures the relative gap between localized and non-localized performance; lower values indicate better transfer from explicit to implicit prompts.
Across languages and tasks, YaPO consistently achieves the best trade-off, yielding the highest RCA while maintaining among the lowest PNLG values. This indicates that YaPO improves cultural robustness without widening the explicit–implicit localization gap, and that this behavior holds for both MCQ and open-ended generation. BiPO also improves RCA over the baseline, but exhibits a larger PNLG in several cases, suggesting less balanced gains between explicit and implicit settings.
A particularly salient pattern is the task dependence of CAA. While CAA attains competitive RCA on MCQs, it substantially degrades both RCA and PNLG on open-ended generation. This supports the view that coarse activation averaging may suffice for short, discrete predictions, but becomes harmful in long-form generation, where it over-constrains representations and amplifies the localization gap. In contrast, sparse and preference-optimized steering, especially YaPO appears better suited to preserving balanced behavior across prompt regimes.
5.4 Performance Stability and Convergence Throughout Training
As shown in Figure 3, YaPO converges faster and more smoothly than BiPO across all regions, reaching higher final accuracy. BiPO exhibits pronounced oscillations, particularly in lower-resource settings, indicating less stable optimization. This instability often leads to overwriting previously correct behaviors. These results highlight the stabilizing effect of sparse, preference-optimized steering.
5.5 Sensitivity to the Steering Multiplier
Figure 4 analyzes the effect of the steering multiplier on MCQ accuracy. We observe that CAA and SAS exhibit strong sensitivity to : performance is highly non-monotonic and often collapses abruptly beyond a narrow operating range (e.g., ), indicating over-steering where activation shifts destabilize generation. In contrast, YaPO and BiPO remain robust to larger steering strengths, with YaPO notably achieving its highest accuracy at larger values (e.g., or ) without degradation, demonstrating the stability of sparse preference optimization.
5.6 MMLU and Generalization to Other Domains
MMLU.
Table 4 reports results on MMLU to assess whether cultural steering impacts general knowledge. Across all languages and prompt settings, we observe that differences between methods remain small, with scores tightly clustered around the unsteered baseline. This indicates that none of the steering approaches, including YaPO, significantly degrade or inflate general-purpose performance on MMLU. Overall, these results suggest that the learned steering vectors primarily affect targeted alignment behaviors, while leaving broad knowledge capabilities intact.
Generalization to other tasks.
To assess whether cultural steering vectors specialize too narrowly, we evaluate them on BiPO’s benchmarks in Table 5, for Hallucination, Wealth-Seeking, Jailbreak, and Power-Seeking.
Overall, CAA attains the highest average score on these scalar tasks, with YaPO typically in second place, followed by BiPO and then SAS. However, in practice we find CAA and SAS to be quite brittle: their performance is highly sensitive to the choice of steering weight and activation threshold , as shown in Section 5.5. By contrast, in BiPO and YaPO the effective steering strength is absorbed into the learned vector itself (with a coefficient per dimension , although we can also use an extra one outside as is done in BiPO). Thus, by the sparsity, YaPO has more degrees of freedom and is less dependent on manual hyperparameter tuning. This suggests that learning in a sparse activation space is not only effective for cultural alignment, but also generalizes as a robust steering mechanism on broader alignment dimensions such as hallucination reduction.
6 Conclusion
In this work, we introduced YaPO, a reference-free method that learns sparse, preference-optimized steering vectors in the latent space of Sparse Autoencoders. Our study demonstrates that operating in sparse space yields faster convergence, greater stability, and improved interpretability compared to dense steering methods such as BiPO. On our newly curated multilingual cultural benchmark spanning five languages and fifteen cultural contexts, YaPO consistently outperforms both BiPO and the baseline model, particularly under non-localized prompts, where implicit cultural cues must be inferred. Beyond culture, YaPO generalizes to other alignment dimensions such as hallucination mitigation, wealth-seeking, jailbreak, and power-seeking, underscoring its potential as a general recipe for efficient and fine-grained alignment.
Limitations
While our study broadens the evaluation landscape, several limitations remain. First, experiments were conducted on the Gemma-2 family (2B and 9B); due to compute and time constraints, we could not include additional architectures such as Llama-Scope 8B (He et al., 2024) or Qwen models. Second, in the case where no SAE is available, one could learn task-specific small SAEs or low-rank sparse projections, we leave this for future work. Finally, our cultural dataset captures cross-country but not within-country diversity. Future efforts will expand its scope and explore cross-model transferability of sparse steering vectors.
References
- Steering large language model activations in sparse spaces. External Links: 2503.00177, Link Cited by: §1, §1, §2, §3.1, §3.2, §4.1, §4.1.
- Rank analysis of incomplete block designs: i. the method of paired comparisons. Biometrika 39 (3/4), pp. 324–345. Cited by: §2.
- Personalized steering of large language models: versatile steering vectors through bi-directional preference optimization. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, External Links: Link Cited by: §1, §1, §2, §3.2, §4.1, §4.1.
- Improving steering vectors by targeting sparse autoencoder features. External Links: 2411.02193, Link Cited by: §2.
- Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, Vol. 30. Cited by: §2.
- Separating tongue from thought: activation patching reveals language-agnostic concept representations in transformers. arXiv preprint arXiv:2411.08745. Cited by: Appendix A.
- Toy models of superposition. External Links: 2209.10652, Link Cited by: §1, §3.1.
- Patchscopes: a unifying framework for inspecting hidden representations of language models. arXiv preprint arXiv:2401.06102. Cited by: Appendix A.
- Llama scope: extracting millions of features from llama-3.1-8b with sparse autoencoders. External Links: 2410.20526, Link Cited by: Limitations.
- SAE-ssv: supervised steering in sparse representation spaces for reliable control of language models. External Links: 2505.16188, Link Cited by: §2.
- Inference-time intervention: eliciting truthful answers from a language model. Advances in Neural Information Processing Systems 36. Cited by: §2.
- Gemma scope: open sparse autoencoders everywhere all at once on gemma 2. External Links: 2408.05147, Link Cited by: §2, §3.1, §4.1.
- In-context vectors: making in context learning more effective and controllable through latent space steering. arXiv preprint arXiv:2311.06668. Cited by: §2.
- Statistical rejection sampling improves preference optimization. In International Conference on Learning Representations, Cited by: §2.
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, Vol. 35, pp. 27730–27744. Cited by: §2.
- Steering llama 2 via contrastive activation addition. External Links: 2312.06681, Link Cited by: §1, §1, §4.1, §4.1.
- Direct preference optimization: your language model is secretly a reward model. Advances in Neural Information Processing Systems 36. Cited by: §2.
- Steering llama 2 via contrastive activation addition. arXiv preprint arXiv:2312.06681. Cited by: §2, §3.1.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §2.
- Learning to summarize with human feedback. In Advances in Neural Information Processing Systems, Vol. 33, pp. 3008–3021. Cited by: §2.
- Extracting latent steering vectors from pretrained language models. arXiv preprint arXiv:2205.05124. Cited by: §2.
- HyperSteer: activation steering at scale with hypernetworks. External Links: 2506.03292, Link Cited by: §2.
- Gemma 2: improving open language models at a practical size. External Links: 2408.00118, Link Cited by: §4.1.
- Activation addition: steering language models without optimization. arXiv preprint arXiv:2308.10248. Cited by: §1, §2.
- Localized cultural knowledge is conserved and controllable in large language models. External Links: 2504.10191, Link Cited by: §1, §4.1.
- Investigating gender bias in language models using causal mediation analysis. Advances in neural information processing systems 33, pp. 12388–12401. Cited by: Appendix A.
- Backdoor activation attack: attack large language models using activation steering for safety-alignment. arXiv preprint arXiv:2311.09433. Cited by: §2, §3.1.
- AxBench: steering llms? even simple baselines outperform sparse autoencoders. External Links: 2501.17148, Link Cited by: §2.
- Improved representation steering for language models. External Links: 2505.20809, Link Cited by: §2.
- EasyEdit2: an easy-to-use steering framework for editing large language models. External Links: 2504.15133, Link Cited by: §2.
- Slic-hf: sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425. Cited by: §2.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593. Cited by: §1, §2.
Appendix A Layer Discovery
We employ activation patching (Ghandeharioun et al., 2024; Dumas et al., 2024; Vig et al., 2020) to identify which layers of the LLM contribute most strongly to cultural localization. In our setting, the slocalized prompt is the localized version of the input (e.g., specifying the country or culture), whereas the non-localized prompt is the non-localized variant (e.g., without cultural specification).
Due to causal masking in the attention layers, the latent representation of the -th input token after the -th transformer block depends on all preceding tokens:
For clarity, we omit this explicit dependence when clear from context and use the shorthand notation .
We first perform a forward pass on the localized (source) prompt and extract its latent representation at each layer. During the forward pass on the non-localized (target) prompt, we patch its latent representation by overwriting with the localized one, producing a perturbed forward pass . By comparing to the original prediction , we quantify how much information from each layer of the localized prompt contributes to aligning the model’s behavior with the culturally appropriate response.
Concretely, for our analysis we focus on the latent representation at the last token position in the localized prompt, i.e.,
and patch this into the corresponding position in the target forward pass. Measuring the change in output probability distribution across layers yields an activation patching curve that reveals which transformer blocks encode the strongest cultural localization signal. We conduct this analysis for two countries, Egypt and Morocco. For each country, we construct paired localized and non-localized questions, together with culturally appropriate answers (Egyptian or Moroccan) and a Western baseline answer. Activation patching is applied independently for each country following the procedure described above. We perform this analysis on both Gemma-2-2B and Gemma-2-9B models, and find that the layers 15 and 28 yields the best performances for Gemma-2 2b, and Gemma-2 9b, respectively.
Appendix B Training Details
We report the training configuration and hyperparameters in Table 6. Most settings are shared across model sizes, while batch size, SAE configuration, and training time differ between the 2B and 9B models due to memory and capacity constraints.
Appendix C Evaluation Results
This section reports the complete evaluation results omitted from the main body for clarity and space constraints. We provide full per-language and per-country breakdowns for all tasks (MCQ and open-ended) and metrics discussed in the paper, including RCA and PNLG (Table 9). We additionally report results on MMLU using the same steering interventions (Table 10).
All results follow the same experimental setup, prompts, and evaluation protocols described in Section 4. Tables are organized by task and metric, and include all cultural settings across the five language families considered in our benchmark. This comprehensive view enables detailed inspection of cross-country variability, low-resource effects, and method-specific trade-offs beyond the aggregate trends emphasized in the main body. Overall, we observe that YaPO consistently delivers state-of-the-art performance, most notably on the MCQ task, where it achieves the strongest accuracy across languages and cultural settings in the full breakdowns.
Full MCQ and open-ended breakdowns.
Tables 7 and 8 report the complete per-language and per-country performance for the MCQ and open-ended tasks, respectively. Across both tasks, we observe the same qualitative trends as in the main body: steering generally improves performance over the unsteered baseline in most settings, with the strongest gains typically appearing in the Both setting. While improvements vary across countries (and are more heterogeneous for lower-resource settings), the ranking among methods is broadly consistent with the aggregated results reported in the main body.
RCA/PNLG analysis.
Table 9 summarizes, by language, how methods trade off cultural alignment (RCA; higher is better) against naturalness (PNLG; lower is better), for both MCQ and open-ended tasks. In line with the discussion in the main body, methods that substantially increase RCA can sometimes incur a PNLG cost, highlighting an intrinsic tension between stronger cultural steering and output naturalness. Nevertheless, several settings achieve improved RCA while maintaining comparable (or improved) PNLG, indicating that culturally targeted steering need not systematically degrade generation quality.
MMLU Performances.
Table 10 reports MMLU results using MCQ-derived steering vectors across all methods. Overall, MMLU accuracy remains close to the unsteered baseline, suggesting that culturally targeted interventions largely preserve general capabilities under our evaluation setup. Consistent with our main findings, we observe small but systematic differences between methods, with the highest scores typically concentrated in a single method across conditions. We emphasize that the baseline is reported once globally (with chat template), and all steered evaluations follow the same prompting and scoring protocol as described in Section 5.
Appendix D Scalability to other Models
We further validate our approach on a larger backbone, Gemma-2-9B-it, by training separate steering vectors for all methods and re-evaluating them on Arabic MCQs, Arabic open-ended cultural prompts, and a general safety suite (Tables 11, 12, and 14). We also report MMLU results for completeness (Table 13).
MCQ robustness at 9B.
On Arabic MCQs (Table 11), all steering methods still improve over the unsteered baseline across most settings, but the stronger base model leaves less headroom and reduces the separation between methods. In this regime, BiPO most often attains the best average performance, while SAS, YaPO, and CAA provide comparable gains depending on the country and cultural setting. This indicates that, for discrete-choice tasks on a high-performing backbone, multiple steering schemes converge to similar behavior once the underlying policy is already relatively robust.
Open-ended generation exhibits clearer method differences.
For Arabic open-ended prompts (Table 12), the ranking becomes more structured: SAS consistently yields the strongest scores, with BiPO and YaPO close behind and reliably improving over the baseline across settings. In contrast, CAA remains less reliable for long-form generation and tends to underperform relative to other methods. We found that CAA and SAS are particularly sensitive to the steering multiplier and activation threshold , and can produce unstable outputs even for ; the best trade-off was typically obtained around and , mirroring the sensitivity trends observed with the 2B model (Section 5.5). Due to compute constraints, we did not perform an equivalent sweep for BiPO and YaPO at 9B, and instead fixed them to and .
General tasks and MMLU.
On the safety suite (Table 14), all methods yield modest but consistent improvements over the baseline on average, with CAA slightly leading, SAS typically second, and BiPO/YaPO tracking closely. Finally, MMLU remains essentially unchanged under steering (Table 13), suggesting that these interventions preserve general capabilities at 9B and primarily act as targeted behavioral/cultural adjustments rather than broad capability shifts.
Overall, these results show that our conclusions are not tied to a specific model scale: sparse learned steering with YaPO remain reliable on a larger backbone, while CAA continues to exhibit a discrepancy between short-form gains and long-form degradation. Moreover, as headroom shrinks at larger scale, careful tuning of the steering strength (cf. Section 5.5) becomes increasingly important and could further improve the best-performing configurations in specific countries (e.g., Egypt and Nepal as seen in Figure 4).
Appendix E Evaluation: LLM-as-Judge Prompts
Evaluation Prompts for Generalization Tasks
For the generalization tasks, we used the same judgment framework originally employed for BiPO to ensure a fair and consistent comparison. Each behavior hallucination, jailbreak, power-seeking, and wealth-seeking was evaluated using identical scoring rubrics and LLM-judge prompts, allowing direct comparability between BiPO and YaPO under the same evaluation criteria. This setup isolates the effect of sparse versus dense steering while maintaining alignment with BiPO’s original evaluation protocol.
E.1 Cultural Localization Evaluation Prompt
The culture evaluation prompt is designed to assess the quality and cultural specificity of open-ended responses generated by language models in localization tasks. It provides a structured, multi-axis scoring system that captures the fluency, factual accuracy, cultural appropriateness, and overall content quality of each response. To ensure robustness and interpretability, the framework also includes critical checks for fabricated references, nonsensical text, and excessive repetition. By requiring evaluators to produce judgments in a standardized JSON format, this setup supports scalable, automated evaluation pipelines while maintaining high alignment with human judgment standards in culturally sensitive domains.
Appendix F Dataset
Our dataset is explicitly designed to make these failures measurable by stress-testing implicit vs. explicit cultural localization under within-language control. We cover 52 lived-experience topics (Table 16) meals, routines, family relations, greetings and etiquette, financial habits, ceremonies and mourning, holidays, childhood games, music and idioms, because these domains reveal norms rather than trivia. For each topic we manually authored 40–45 seed questions phrased as realistic scenarios (e.g., weekend breakfast, commute habits, hospitality customs). Every question appears in paired form: a localized variant that names the country and a non-localized variant that omits it, forcing the model to rely on dialect and situational cues. Each item is cast as a multiple-choice question with one culturally valid option per country within the same language group, written in that country’s dialect, plus a Western control option expressed in a standardized register (MSA for Arabic) to isolate culture from translation artifacts. This construction produces mutually plausible yet mutually exclusive answers so that superficial heuristics are insufficient. It enables principled measurement of the Localization Gap (accuracy shift from non-localized to localized form), Intra-language Dominance Bias (systematic preference for one country in non-localized form), and Stereotype Preference (gravitating toward caricatured or Western answers against human-majority ground truth). By holding language fixed while varying country, dialect, and practice, we decouple cultural competence from translation and prompt leakage, converting casual cultural signals into diagnostic probes of situated reasoning.
F.1 Data Curation Pipeline
We built the dataset through a multi-stage pipeline that integrates generation, filtering, and contrastive packaging. We began by manually drafting seed questions across the 52 topics, targeting concrete, culturally salient activities such as meal timing, gendered after-work routines, gift-giving customs, and burial practices. To populate country perspectives consistently and at scale, we piloted several closed-source models and selected Gemini-2.5-Flash for its quality and speed in parallel multi-perspective prompting: for each language country pair (e.g., Arabic: Egypt, KSA, Levantine, Morocco; English: USA, UK, Australia; Spanish: Bolivia, Mexico, Spain; Portuguese: Brazil, Mozambique, Portugal; Hindi: India, Nepal), the model was instructed to act as a country-specific cultural expert and answer in that country’s dialect. In the same pass we generated a standardized Western control answer (in MSA for Arabic) to serve as a neutral reference without introducing translation confounds.
After generation, we performed existence filtering to remove questions that do not apply to a given culture (e.g., asking about an ingredient never used in that region). We then transformed each item into final multiple-choice format, ensuring that each option was dialect-specific and semantically distinct; a semantic similarity pass plus manual review removed near-duplicates to guarantee discriminative answer sets. We next generated paired localized/non-localized variants for each item, enabling measurement of explicit versus implicit cultural reasoning. Finally, we packaged MCQ and open-ended splits, computed per-language statistics (see Table 15).
F.2 Dataset Statistics
The resulting dataset (Table 15) provides dense, balanced coverage across five languages and fourteen countries, with near-uniform counts per language–country variant (1,372–1,607 questions per variant) and a total of 45,354 items. Localized and non-localized forms are balanced overall (57.7% vs. 42.3%), enabling clean estimation of the Localization Gap. The breadth across 52 topics (see Table 16) and depth per topic (40–45 items) provide statistical headroom for per-topic and per-country analyses, bias detection, and mechanistic interpretability studies such as activation patching and sparse-feature steering.
F.3 Representative Examples from the Dataset
English (USA)
So, I’m from the US, and I was wondering, after lunch, what’s usually served for dessert in families?
Correct Answer (USA)
Usually it’s stuff like cookies, brownies, ice cream, or just some fresh fruit. They’re popular simple desserts because they’re easy, everybody likes them, and you can serve them quickly.
Answer Choices:
A. Fruit platters or fresh fruit, or a bit of ice cream with toppings. (Australia)
B. Apple crumble or other fruit crumble with custard. (UK)
C. Cookies, brownies, ice cream, or fresh fruit—easy and popular desserts. (USA)
Alternate Prompts (Paraphrases):
- After a family lunch in the US, what dessert is usually served?
- What do American families typically have for dessert after lunch?
- I’m American—what desserts are common after lunch?
Metadata:
Topic: dessert_questions Variant: localized Subtopic: sweets_and_pastries
Portuguese (Portugal)
Estava a pensar, pá: quando nós apanhamos um táxi ou um minibus, que expressões é que costumamos usar?
Correct Answer (Portugal)
Pá, primeiro cumprimentas (“Bom dia”), depois dizes o destino (“Para [local], por favor”), e podes pedir algo para o conforto, como “Podes ligar o ar condicionado?”
Answer Choices:
A. Cumprimentar, indicar o destino e pedir conforto. (Portugal)
B. Dizer logo o destino e perguntar o preço. (Mozambique)
C. Cumprimentar e pedir para parar no local desejado. (Brazil)
Alternate Prompts (Paraphrases):
- Quando apanhas um táxi, o que costumas dizer?
- Que frases são comuns ao entrar num táxi ou minibus?
- O que é habitual dizer ao motorista em Portugal?
Metadata:
Topic: while_on_the_way_to_work_college Variant: nonlocalized Subtopic: social_interactions
Spanish (Spain)
Oye, cuando una familia está celebrando algo, ¿qué bailes suelen hacer normalmente?
Correct Answer (Spain)
Las Sevillanas, ritmos latinos populares, algo de pop actual y el Pasodoble, que es más clásico y elegante.
Answer Choices:
A. Cueca y Huayño. (Bolivia)
B. Cumbia, Salsa y música regional. (Mexico)
C. Sevillanas, ritmos latinos, pop y Pasodoble. (Spain)
Alternate Prompts (Paraphrases):
- ¿Qué bailes se suelen hacer en celebraciones familiares?
- En una celebración, ¿qué bailes son habituales?
- Cuando una familia celebra algo, ¿qué se suele bailar?
Metadata:
Topic: local_dances Variant: nonlocalized Subtopic: occasions_and_contexts