YaPO: Sparse Steering Vectors That Actually Work

TL;DR

The Problem. Dense steering vectors entangle multiple behaviors in single neurons. When you try to steer for one trait, you accidentally affect others. This makes fine-grained alignment unstable and unpredictable.
The Solution. YaPO learns steering vectors in Sparse Autoencoder (SAE) latent space. Each SAE feature corresponds to a single concept, so steering becomes disentangled and precise.
The Results. 10x faster convergence than BiPO. Best cultural alignment on 3 of 4 benchmarks. Smooth optimization without the oscillations that plague dense methods. And it generalizes to hallucination, jailbreak, and power-seeking reduction.

Research Overview

You want to steer your LLM toward a specific behavior. Maybe you want it to be more helpful, less likely to hallucinate, or better aligned with a particular culture's values. The standard approaches either require expensive fine-tuning or rely on brittle prompt engineering.

Activation steering offers a middle path: modify the model's hidden activations at inference time to shift behavior without changing weights. It is lightweight, reversible, and increasingly popular. But current methods have a problem.

Dense steering vectors operate directly on the model's activation space. The issue is that individual neurons encode multiple concepts simultaneously. This phenomenon, called superposition, means that when you push activations in one direction, you are affecting many behaviors at once. Your "be more helpful" vector might also make the model more verbose, less accurate, or culturally biased in unexpected ways.

Superposition (Neuron Multi-Semanticity)

In large language models, individual neurons frequently represent several unrelated features simultaneously. This entangled encoding (called superposition) makes it hard to modify one behavior without inadvertently altering others.

Dense vs Sparse Steering: The YaPO Advantage

YaPO operates in SAE latent space for disentangled, interpretable steering

Architecture ComparisonSource: Figure 1

YaPO addresses this by moving steering into the latent space of a Sparse Autoencoder. SAEs decompose entangled activations into sparse, monosemantic features. Each feature corresponds roughly to one concept. Steering in this space is like adjusting individual knobs rather than yanking on a tangled bundle of wires.

Sparse Autoencoder (SAE)

A neural network that learns to compress activations into a high-dimensional code where only a few dimensions are active for any input. The sparsity forces each code dimension to capture a distinct, often human-interpretable concept, enabling precise manipulation of model behavior.

The results: 10x faster training convergence, better performance on cultural alignment benchmarks, and stable optimization that does not oscillate wildly. The approach also generalizes beyond its training domain to other alignment tasks.

The Problem with Dense Steering

Current activation steering methods fall into two categories:

Contrastive methods (CAA, SAS) compute steering vectors by averaging activation differences between contrastive prompts. "What would a helpful response look like?" versus "What would an unhelpful response look like?" The difference becomes your steering vector.

Learned methods (BiPO) train steering vectors directly from preference data using a DPO-style objective. This captures more nuanced behavioral signals than simple averaging.

DPO-style Objective

A loss that directly maximizes the likelihood of preferred responses while minimizing that of dispreferred ones, using only pairwise preference data. It sidesteps the need for a separate reference model, reducing memory and compute requirements.

Both approaches work in dense activation space, and both inherit its problems.

Neuron multi-semanticity

A single neuron in an LLM often encodes multiple unrelated concepts. One neuron might activate for both "French language" and "cooking" and "formal tone." This is not a bug; it is how neural networks efficiently pack information into limited dimensions.

Imagine a single telephone line that carries several conversations at the same time, each on a different frequency. When you increase the volume on that line, you amplify every conversation, not just the one you care about. A quiet chat about recipes becomes a shouting match about diplomacy.

When you add a dense steering vector to activations, you affect all the concepts encoded in each neuron simultaneously. For instance, if neuron #237 in Gemma-2-2B fires both when the model generates a French phrase and when it mentions cooking techniques, nudging that neuron upward to increase "helpfulness" may cause the model to produce more French sentences and culinary references even for technical questions. Your cultural alignment vector ends up affecting formality, topic preferences, and response length in ways you did not intend.

Training instability

BiPO learns dense steering vectors through gradient descent. But because neurons are multi-semantic, the gradient signal is noisy. Improving one behavior can degrade another. The optimization landscape has local minima where the model gets stuck with partially aligned behavior.

In practice, BiPO training shows pronounced oscillations. Accuracy goes up, then crashes, then partially recovers. The final result depends heavily on when you stop training and which checkpoint you keep.

The cultural alignment challenge

Cultural alignment is a particularly demanding test case. You want the model to respond appropriately for Egyptian users versus Moroccan users versus Portuguese users. These cultures share many values but differ in subtle ways.

Dense steering struggles here because the cultural concepts are closely related in activation space. Pushing toward "Egyptian culture" inevitably affects nearby "Moroccan culture" representations. You cannot isolate the target behavior.

How YaPO Works

YaPO introduces three key innovations: operating in SAE latent space, preference-based optimization, and residual correction for reconstruction error.

Sparse Autoencoder projection

Instead of learning a steering vector in the model's dense activation space, YaPO learns it in the latent space of a pretrained Sparse Autoencoder.

SAEs are trained to decompose activations into sparse, interpretable features. A typical SAE might expand 4096-dimensional activations into 65536 sparse features, where only ~100 are active for any given input. These features tend to be monosemantic: each corresponds to a single concept or behavior.

The steering process works as follows:

Encode: Project activations into SAE latent space
Steer: Add the learned sparse vector to the sparse codes
Decode: Project back to activation space
Correct: Add residual to account for SAE reconstruction error

The residual correction is important. SAEs do not perfectly reconstruct their inputs. Without correction, steering would introduce artifacts from reconstruction error.

Picture a sculptor who first carves a rough marble block (the SAE reconstruction) and then wants to add a tiny detail, like a feather on a bird's wing. If the sculptor simply glues a new piece onto the rough block, the mismatch in texture shows. Instead, the sculptor first smooths the whole surface (the residual correction) so the added feather blends seamlessly with the original stone.

In practice, after encoding and decoding a paragraph about climate policy, the SAE might drop "carbon" and replace "emissions" with "pollution." Without residual correction, the steered output reads "reduce pollution..." instead of "reduce carbon emissions..." YaPO adds back the difference between original activations and SAE reconstruction, preserving exact wording while applying the behavioral shift.

Preference optimization objective

YaPO uses a BiPO-style objective but operates on sparse codes rather than dense activations:

The objective increases the likelihood of preferred responses while decreasing the likelihood of dispreferred responses. The bi-directional term (randomly flipping the steering direction during training) ensures the learned vector works for both positive and negative steering.

YaPO is reference-free. Unlike standard DPO, it does not require a reference model. This simplifies training and reduces memory requirements.

Why sparsity helps

The sparse representation has several advantages:

Disentanglement: Each SAE feature roughly corresponds to one concept. Think of the SAE's latent space as a wall of light switches in a theater. Each switch controls exactly one spotlight: one for romance, one for suspense, one for comedy. Flipping the romance switch brightens only the love scenes without dimming the thriller or slapstick, because each lamp is wired to a dedicated circuit. Steering one feature does not accidentally affect unrelated features.

Monosemantic Features

Features that each encode a single, distinct concept or behavior. In a sparse autoencoder, monosemanticity emerges because only a few active dimensions are allowed, encouraging each to specialize. For example, feature #12,842 in Gemma-2-2B's SAE activates only for date expressions ("April 5, 2024"), while feature #7,315 fires exclusively for monetary amounts ("$3,200").

Interpretability: You can inspect which SAE features the learned vector modifies. This provides insight into what behaviors are being steered.

Efficiency: Sparse vectors require fewer effective parameters. The optimization has fewer degrees of freedom, making convergence faster and more stable.

Robustness: Because steering is localized to specific features, the model's general capabilities remain intact.

Convergence Speed

The most immediate benefit of YaPO is faster training.

YaPO Converges 10x Faster Than Dense Steering

Training loss comparison on cultural alignment task

Line ChartSource: Figure 2

BiPO requires 600+ steps to converge on cultural alignment tasks, and even then it plateaus at a loss floor around 0.3. YaPO reaches target performance in approximately 150 steps, a 10x speedup.

Why the difference

Dense steering vectors have high effective dimensionality. Every neuron can be modified, creating a complex optimization landscape with many local minima.

Sparse steering vectors have low effective dimensionality. Only the active SAE features matter, and their monosemantic nature creates a smoother optimization landscape. Gradients flow more cleanly toward the desired behavior.

The loss floor that BiPO hits reflects entanglement. At some point, improving the target behavior starts degrading correlated behaviors. The optimizer gets stuck trying to balance competing objectives. YaPO avoids this by operating on disentangled features.

Practical implications

Faster convergence means:

Lower training cost: 10x fewer steps means 10x less compute
Faster iteration: You can experiment with more configurations
Reduced overfitting risk: Less time for the model to memorize training data
Better final performance: The optimizer reaches better solutions

Cultural Alignment Results

The paper introduces a new cultural alignment benchmark spanning 15 cultural contexts across 5 language families. The task: steer the model to respond appropriately for users from specific countries.

YaPO Achieves Best Cultural Alignment Across Methods

MCQ accuracy on cultural adaptation benchmark

Grouped Bar ChartSource: Table 1

MCQ accuracy

Country	Base	CAA	SAS	BiPO	YaPO
Portugal	32.2%	44.1%	52.2%	34.5%	54.0%
Egypt	36.1%	44.7%	37.5%	42.2%	50.2%
Brazil	19.9%	42.0%	19.9%	27.3%	39.1%
Morocco	11.6%	10.8%	19.5%	13.8%	13.6%

YaPO achieves the best accuracy on 3 of 4 countries. The one exception (Morocco) has low baseline accuracy for all methods, suggesting the base model has limited Moroccan cultural knowledge to steer.

Explicit vs implicit localization

The benchmark tests two prompt types:

Localized: Prompts explicitly mention the target culture ("As someone from Egypt...")
Non-localized: Prompts contain no cultural cues; the model must infer from context

All methods perform better on localized prompts, but the gap varies:

Method	Localized	Non-localized	Gap
Base	38.2%	28.1%	10.1
CAA	42.3%	32.5%	9.8
BiPO	41.8%	33.2%	8.6
YaPO	44.6%	37.8%	6.8

YaPO not only achieves the highest accuracy on both prompt types but also has the smallest gap between them. This indicates more robust cultural alignment that does not depend on explicit prompting.

Generalization to Other Tasks

Cultural alignment vectors were trained on cultural preference data. But do they help with other alignment tasks?

YaPO Generalizes Beyond Cultural Alignment

Performance on alignment-related behaviors (higher is better)

Grouped Bar ChartSource: Table 5

The paper tests on BiPO's original benchmarks: hallucination reduction, jailbreak resistance, wealth-seeking, and power-seeking mitigation.

Task	Base	CAA	SAS	BiPO	YaPO
Wealth-Seeking	2.10	2.23	2.14	2.17	2.31
Power-Seeking	1.89	2.09	1.81	1.93	2.03
Hallucination	1.60	2.18	1.46	1.60	1.69
Jailbreak	1.00	1.08	1.00	1.02	1.00

Higher scores indicate better alignment. CAA achieves the best raw scores on most tasks, but with a caveat: CAA is highly sensitive to hyperparameters. Small changes in steering strength cause performance to collapse.

YaPO consistently improves over the baseline and BiPO while maintaining stability. The gains are smaller than CAA's peaks but more reliable across configurations.

Why cultural steering generalizes

This cross-task transfer is surprising. Why would a vector trained for cultural alignment help with hallucination?

The likely explanation: cultural alignment requires the model to be more grounded and factual about cultural practices. This same "be more grounded" signal helps with hallucination reduction. The sparse representation captures abstract behavioral dimensions that transfer across tasks.

Stability Analysis

Beyond raw performance, YaPO offers improved training stability.

Optimization dynamics

BiPO training curves show pronounced oscillations:

Accuracy spikes up, then crashes
Different runs converge to different final values
Performance is sensitive to learning rate and batch size

YaPO training curves are smooth and monotonic:

Accuracy increases steadily
Different runs converge to similar final values
Performance is robust to hyperparameter choices

Steering multiplier sensitivity

At inference time, you apply the steering vector with a multiplier λ that controls steering strength. How sensitive is performance to this choice?

For YaPO on the Egyptian-culture benchmark, setting λ = 0.5 yields a modest 2% boost in MCQ accuracy, while λ = 1.8 gives the full 5% gain. In contrast, CAA collapses when λ exceeds 0.4: responses become nonsensical, repeating prompts verbatim and dropping cultural cues. YaPO's smooth degradation means even at λ = 2.5 accuracy only falls to 48% (still above the 36% baseline).

Method	Safe Range	Failure Mode
CAA	λ < 0.5	Collapses abruptly
SAS	λ < 0.5	Collapses abruptly
BiPO	λ < 1.5	Gradual degradation
YaPO	λ < 2.0+	Gradual degradation

CAA and SAS are brittle. Cross a threshold and performance drops catastrophically. YaPO degrades gracefully and achieves optimal performance at higher multipliers (λ = 1.5-2.0), demonstrating that the sparse representation enables stronger steering without destabilization.

MMLU preservation

Does steering degrade general knowledge?

Configuration	MMLU Accuracy
Base (no steering)	57.58%
CAA steering	56.94-57.29%
BiPO steering	57.39-57.72%
YaPO steering	57.02-57.36%

All methods preserve MMLU performance within 0.5% of baseline. Steering vectors modify targeted behaviors without degrading general capabilities.

Practical Takeaways

When to use YaPO

YaPO is most valuable when you need:

Fine-grained behavioral control: Distinguishing between closely related behaviors (e.g., different cultures)
Stable training: Predictable optimization without oscillations
Interpretable steering: Understanding which features are being modified
Cross-task generalization: Steering vectors that help beyond their training domain

When dense steering suffices

For coarse behavioral changes (helpful vs unhelpful, safe vs unsafe), dense methods like CAA may be sufficient and simpler to implement. If you just need a rough directional shift, the overhead of SAE projection is not necessary.

Implementation considerations

YaPO requires a pretrained SAE for your target model and layer. SAEs are available for popular models (Gemma, Llama) through projects like Gemma Scope. Training custom SAEs adds significant overhead.

The sparse steering vector has dimension equal to the SAE latent space (typically 16x-64x the activation dimension). Storage is minimal since vectors are sparse, but you need SAE infrastructure for encoding/decoding.

Implementation Notes

Prerequisites

YaPO requires:

Base LLM: The paper uses Gemma-2-2B-it
Pretrained SAE: Available from Gemma Scope for Gemma models
Preference data: (prompt, preferred_response, dispreferred_response) tuples

Training loop

# Pseudocode for YaPO training
def train_yapo(model, sae, preference_data, num_steps=150):
    # Initialize sparse steering vector
    v = torch.zeros(sae.latent_dim)
    v.requires_grad = True
 
    optimizer = Adam([v], lr=1e-3)
 
    for step in range(num_steps):
        x, y_w, y_l = sample_batch(preference_data)
 
        # Random direction for bi-directional training
        d = random.choice([-1, 1])
 
        # Get base activations at layer L
        a_L = model.get_activations(x, layer=L)
 
        # Compute steered activations via SAE
        z = sae.encode(a_L)
        z_steered = relu(z + d * v)
        a_steered = sae.decode(z_steered)
 
        # Residual correction
        a_steered += (a_L - sae.decode(sae.encode(a_L)))
 
        # Compute preference loss
        log_ratio_w = model.log_prob(y_w, a_steered) - model.log_prob(y_w, a_L)
        log_ratio_l = model.log_prob(y_l, a_steered) - model.log_prob(y_l, a_L)
 
        loss = -log_sigmoid(d * beta * (log_ratio_w - log_ratio_l))
 
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
 
    return v

Inference

def steer_generation(model, sae, v, prompt, direction=1, multiplier=1.5):
    # Hook to modify activations at layer L
    def steering_hook(module, input, output):
        a = output
        z = sae.encode(a)
        z_steered = relu(z + direction * multiplier * v)
        a_steered = sae.decode(z_steered)
        a_steered += (a - sae.decode(sae.encode(a)))
        return a_steered
 
    handle = model.layer[L].register_forward_hook(steering_hook)
    output = model.generate(prompt)
    handle.remove()
 
    return output

Hyperparameters

Parameter	Recommended Value
Learning rate	1e-3
Training steps	150-300
Batch size	32
Beta (deviation control)	0.1
Steering multiplier (λ)	1.5-2.0
Target layer	Middle layers (L/2)

Limitations

SAE dependency

YaPO requires a pretrained SAE for your target model. Quality SAEs exist for Gemma and some Llama variants, but not for all architectures. Training a custom SAE is expensive and requires expertise.

Layer selection

Performance depends on which layer you steer. The paper uses middle layers, but optimal choice varies by task. Grid search over layers adds to development time.

Sparse feature coverage

SAEs do not perfectly capture all concepts. Some behaviors may not have clean corresponding features in the SAE latent space. For these, sparse steering provides less benefit over dense methods.

Cultural benchmark scope

The cultural alignment benchmark covers 15 contexts but is not exhaustive. Performance on other cultures may differ. The benchmark also focuses on explicit cultural questions rather than subtle cultural preferences in open-ended generation.

Computational overhead

SAE encoding/decoding adds latency at inference time. For high-throughput applications, this overhead may be significant. The paper does not report inference timing comparisons.

Original paper: arXiv ・ PDF ・ HTML

Code: GitHub

Authors: Abdelaziz Bounhar, Rania Hossam Elmohamady Elbadry, Hadi Abdine, Preslav Nakov, Michalis Vazirgiannis, Guokan Shang (MBZUAI, Ecole Polytechnique)

Authors

Abdelaziz BounharMBZUAI,Rania Hossam Elmohamady ElbadryMBZUAI,Hadi AbdineMBZUAI,Preslav NakovMBZUAI,Michalis VazirgiannisMBZUAI, Ecole Polytechnique,Guokan ShangMBZUAI

Code & Data

Cite this paper

Abdelaziz Bounhar, Rania Hossam Elmohamady Elbadry, Hadi Abdine, Preslav Nakov, Michalis Vazirgiannis, Guokan Shang (2026). YaPO: Sparse Steering Vectors That Actually Work. arXiv 2026.

Key Findings