-
The Problem. Dense steering vectors entangle multiple behaviors in single neurons. When you try to steer for one trait, you accidentally affect others. This makes fine-grained alignment unstable and unpredictable.
-
The Solution. YaPO learns steering vectors in Sparse Autoencoder (SAE) latent space. Each SAE feature corresponds to a single concept, so steering becomes disentangled and precise.
-
The Results. 10x faster convergence than BiPO. Best cultural alignment on 3 of 4 benchmarks. Smooth optimization without the oscillations that plague dense methods. And it generalizes to hallucination, jailbreak, and power-seeking reduction.
Research Overview
You want to steer your LLM toward a specific behavior. Maybe you want it to be more helpful, less likely to hallucinate, or better aligned with a particular culture's values. The standard approaches either require expensive fine-tuning or rely on brittle prompt engineering.
Activation steering offers a middle path: modify the model's hidden activations at inference time to shift behavior without changing weights. It is lightweight, reversible, and increasingly popular. But current methods have a problem.
Dense steering vectors operate directly on the model's activation space. The issue is that individual neurons encode multiple concepts simultaneously. This phenomenon, called superposition, means that when you push activations in one direction, you are affecting many behaviors at once. Your "be more helpful" vector might also make the model more verbose, less accurate, or culturally biased in unexpected ways.
In large language models, individual neurons frequently represent several unrelated features simultaneously. This entangled encoding (called superposition) makes it hard to modify one behavior without inadvertently altering others.
YaPO operates in SAE latent space for disentangled, interpretable steering
YaPO addresses this by moving steering into the latent space of a Sparse Autoencoder. SAEs decompose entangled activations into sparse, monosemantic features. Each feature corresponds roughly to one concept. Steering in this space is like adjusting individual knobs rather than yanking on a tangled bundle of wires.
A neural network that learns to compress activations into a high-dimensional code where only a few dimensions are active for any input. The sparsity forces each code dimension to capture a distinct, often human-interpretable concept, enabling precise manipulation of model behavior.
The results: 10x faster training convergence, better performance on cultural alignment benchmarks, and stable optimization that does not oscillate wildly. The approach also generalizes beyond its training domain to other alignment tasks.
The Problem with Dense Steering
Current activation steering methods fall into two categories:
Contrastive methods (CAA, SAS) compute steering vectors by averaging activation differences between contrastive prompts. "What would a helpful response look like?" versus "What would an unhelpful response look like?" The difference becomes your steering vector.
Learned methods (BiPO) train steering vectors directly from preference data using a DPO-style objective. This captures more nuanced behavioral signals than simple averaging.
A loss that directly maximizes the likelihood of preferred responses while minimizing that of dispreferred ones, using only pairwise preference data. It sidesteps the need for a separate reference model, reducing memory and compute requirements.
Both approaches work in dense activation space, and both inherit its problems.
Neuron multi-semanticity
A single neuron in an LLM often encodes multiple unrelated concepts. One neuron might activate for both "French language" and "cooking" and "formal tone." This is not a bug; it is how neural networks efficiently pack information into limited dimensions.
Imagine a single telephone line that carries several conversations at the same time, each on a different frequency. When you increase the volume on that line, you amplify every conversation, not just the one you care about. A quiet chat about recipes becomes a shouting match about diplomacy.
When you add a dense steering vector to activations, you affect all the concepts encoded in each neuron simultaneously. For instance, if neuron #237 in Gemma-2-2B fires both when the model generates a French phrase and when it mentions cooking techniques, nudging that neuron upward to increase "helpfulness" may cause the model to produce more French sentences and culinary references even for technical questions. Your cultural alignment vector ends up affecting formality, topic preferences, and response length in ways you did not intend.
Training instability
BiPO learns dense steering vectors through gradient descent. But because neurons are multi-semantic, the gradient signal is noisy. Improving one behavior can degrade another. The optimization landscape has local minima where the model gets stuck with partially aligned behavior.
In practice, BiPO training shows pronounced oscillations. Accuracy goes up, then crashes, then partially recovers. The final result depends heavily on when you stop training and which checkpoint you keep.
The cultural alignment challenge
Cultural alignment is a particularly demanding test case. You want the model to respond appropriately for Egyptian users versus Moroccan users versus Portuguese users. These cultures share many values but differ in subtle ways.
Dense steering struggles here because the cultural concepts are closely related in activation space. Pushing toward "Egyptian culture" inevitably affects nearby "Moroccan culture" representations. You cannot isolate the target behavior.
How YaPO Works
YaPO introduces three key innovations: operating in SAE latent space, preference-based optimization, and residual correction for reconstruction error.
Sparse Autoencoder projection
Instead of learning a steering vector in the model's dense activation space, YaPO learns it in the latent space of a pretrained Sparse Autoencoder.
SAEs are trained to decompose activations into sparse, interpretable features. A typical SAE might expand 4096-dimensional activations into 65536 sparse features, where only ~100 are active for any given input. These features tend to be monosemantic: each corresponds to a single concept or behavior.
The steering process works as follows:
- Encode: Project activations into SAE latent space
- Steer: Add the learned sparse vector to the sparse codes
- Decode: Project back to activation space
- Correct: Add residual to account for SAE reconstruction error
The residual correction is important. SAEs do not perfectly reconstruct their inputs. Without correction, steering would introduce artifacts from reconstruction error.
Picture a sculptor who first carves a rough marble block (the SAE reconstruction) and then wants to add a tiny detail, like a feather on a bird's wing. If the sculptor simply glues a new piece onto the rough block, the mismatch in texture shows. Instead, the sculptor first smooths the whole surface (the residual correction) so the added feather blends seamlessly with the original stone.
In practice, after encoding and decoding a paragraph about climate policy, the SAE might drop "carbon" and replace "emissions" with "pollution." Without residual correction, the steered output reads "reduce pollution..." instead of "reduce carbon emissions..." YaPO adds back the difference between original activations and SAE reconstruction, preserving exact wording while applying the behavioral shift.
Preference optimization objective
YaPO uses a BiPO-style objective but operates on sparse codes rather than dense activations:
The objective increases the likelihood of preferred responses while decreasing the likelihood of dispreferred responses. The bi-directional term (randomly flipping the steering direction during training) ensures the learned vector works for both positive and negative steering.
YaPO is reference-free. Unlike standard DPO, it does not require a reference model. This simplifies training and reduces memory requirements.
Why sparsity helps
The sparse representation has several advantages:
Disentanglement: Each SAE feature roughly corresponds to one concept. Think of the SAE's latent space as a wall of light switches in a theater. Each switch controls exactly one spotlight: one for romance, one for suspense, one for comedy. Flipping the romance switch brightens only the love scenes without dimming the thriller or slapstick, because each lamp is wired to a dedicated circuit. Steering one feature does not accidentally affect unrelated features.
Features that each encode a single, distinct concept or behavior. In a sparse autoencoder, monosemanticity emerges because only a few active dimensions are allowed, encouraging each to specialize. For example, feature #12,842 in Gemma-2-2B's SAE activates only for date expressions ("April 5, 2024"), while feature #7,315 fires exclusively for monetary amounts ("$3,200").
Interpretability: You can inspect which SAE features the learned vector modifies. This provides insight into what behaviors are being steered.
Efficiency: Sparse vectors require fewer effective parameters. The optimization has fewer degrees of freedom, making convergence faster and more stable.
Robustness: Because steering is localized to specific features, the model's general capabilities remain intact.
Convergence Speed
The most immediate benefit of YaPO is faster training.
Training loss comparison on cultural alignment task
BiPO requires 600+ steps to converge on cultural alignment tasks, and even then it plateaus at a loss floor around 0.3. YaPO reaches target performance in approximately 150 steps, a 10x speedup.
Why the difference
Dense steering vectors have high effective dimensionality. Every neuron can be modified, creating a complex optimization landscape with many local minima.
Sparse steering vectors have low effective dimensionality. Only the active SAE features matter, and their monosemantic nature creates a smoother optimization landscape. Gradients flow more cleanly toward the desired behavior.
The loss floor that BiPO hits reflects entanglement. At some point, improving the target behavior starts degrading correlated behaviors. The optimizer gets stuck trying to balance competing objectives. YaPO avoids this by operating on disentangled features.
Practical implications
Faster convergence means:
- Lower training cost: 10x fewer steps means 10x less compute
- Faster iteration: You can experiment with more configurations
- Reduced overfitting risk: Less time for the model to memorize training data
- Better final performance: The optimizer reaches better solutions
Cultural Alignment Results
The paper introduces a new cultural alignment benchmark spanning 15 cultural contexts across 5 language families. The task: steer the model to respond appropriately for users from specific countries.
MCQ accuracy on cultural adaptation benchmark
MCQ accuracy
| Country | Base | CAA | SAS | BiPO | YaPO |
|---|---|---|---|---|---|
| Portugal | 32.2% | 44.1% | 52.2% | 34.5% | 54.0% |
| Egypt | 36.1% | 44.7% | 37.5% | 42.2% | 50.2% |
| Brazil | 19.9% | 42.0% | 19.9% | 27.3% | 39.1% |
| Morocco | 11.6% | 10.8% | 19.5% | 13.8% | 13.6% |
YaPO achieves the best accuracy on 3 of 4 countries. The one exception (Morocco) has low baseline accuracy for all methods, suggesting the base model has limited Moroccan cultural knowledge to steer.
Explicit vs implicit localization
The benchmark tests two prompt types:
- Localized: Prompts explicitly mention the target culture ("As someone from Egypt...")
- Non-localized: Prompts contain no cultural cues; the model must infer from context
All methods perform better on localized prompts, but the gap varies:
| Method | Localized | Non-localized | Gap |
|---|---|---|---|
| Base | 38.2% | 28.1% | 10.1 |
| CAA | 42.3% | 32.5% | 9.8 |
| BiPO | 41.8% | 33.2% | 8.6 |
| YaPO | 44.6% | 37.8% | 6.8 |
YaPO not only achieves the highest accuracy on both prompt types but also has the smallest gap between them. This indicates more robust cultural alignment that does not depend on explicit prompting.
Generalization to Other Tasks
Cultural alignment vectors were trained on cultural preference data. But do they help with other alignment tasks?
Performance on alignment-related behaviors (higher is better)
The paper tests on BiPO's original benchmarks: hallucination reduction, jailbreak resistance, wealth-seeking, and power-seeking mitigation.
| Task | Base | CAA | SAS | BiPO | YaPO |
|---|---|---|---|---|---|
| Wealth-Seeking | 2.10 | 2.23 | 2.14 | 2.17 | 2.31 |
| Power-Seeking | 1.89 | 2.09 | 1.81 | 1.93 | 2.03 |
| Hallucination | 1.60 | 2.18 | 1.46 | 1.60 | 1.69 |
| Jailbreak | 1.00 | 1.08 | 1.00 | 1.02 | 1.00 |
Higher scores indicate better alignment. CAA achieves the best raw scores on most tasks, but with a caveat: CAA is highly sensitive to hyperparameters. Small changes in steering strength cause performance to collapse.
YaPO consistently improves over the baseline and BiPO while maintaining stability. The gains are smaller than CAA's peaks but more reliable across configurations.
Why cultural steering generalizes
This cross-task transfer is surprising. Why would a vector trained for cultural alignment help with hallucination?
The likely explanation: cultural alignment requires the model to be more grounded and factual about cultural practices. This same "be more grounded" signal helps with hallucination reduction. The sparse representation captures abstract behavioral dimensions that transfer across tasks.
Stability Analysis
Beyond raw performance, YaPO offers improved training stability.
Optimization dynamics
BiPO training curves show pronounced oscillations:
- Accuracy spikes up, then crashes
- Different runs converge to different final values
- Performance is sensitive to learning rate and batch size
YaPO training curves are smooth and monotonic:
- Accuracy increases steadily
- Different runs converge to similar final values
- Performance is robust to hyperparameter choices
Steering multiplier sensitivity
At inference time, you apply the steering vector with a multiplier λ that controls steering strength. How sensitive is performance to this choice?
For YaPO on the Egyptian-culture benchmark, setting λ = 0.5 yields a modest 2% boost in MCQ accuracy, while λ = 1.8 gives the full 5% gain. In contrast, CAA collapses when λ exceeds 0.4: responses become nonsensical, repeating prompts verbatim and dropping cultural cues. YaPO's smooth degradation means even at λ = 2.5 accuracy only falls to 48% (still above the 36% baseline).
| Method | Safe Range | Failure Mode |
|---|---|---|
| CAA | λ < 0.5 | Collapses abruptly |
| SAS | λ < 0.5 | Collapses abruptly |
| BiPO | λ < 1.5 | Gradual degradation |
| YaPO | λ < 2.0+ | Gradual degradation |
CAA and SAS are brittle. Cross a threshold and performance drops catastrophically. YaPO degrades gracefully and achieves optimal performance at higher multipliers (λ = 1.5-2.0), demonstrating that the sparse representation enables stronger steering without destabilization.
MMLU preservation
Does steering degrade general knowledge?
| Configuration | MMLU Accuracy |
|---|---|
| Base (no steering) | 57.58% |
| CAA steering | 56.94-57.29% |
| BiPO steering | 57.39-57.72% |
| YaPO steering | 57.02-57.36% |
All methods preserve MMLU performance within 0.5% of baseline. Steering vectors modify targeted behaviors without degrading general capabilities.
Practical Takeaways
When to use YaPO
YaPO is most valuable when you need:
- Fine-grained behavioral control: Distinguishing between closely related behaviors (e.g., different cultures)
- Stable training: Predictable optimization without oscillations
- Interpretable steering: Understanding which features are being modified
- Cross-task generalization: Steering vectors that help beyond their training domain
When dense steering suffices
For coarse behavioral changes (helpful vs unhelpful, safe vs unsafe), dense methods like CAA may be sufficient and simpler to implement. If you just need a rough directional shift, the overhead of SAE projection is not necessary.
Implementation considerations
YaPO requires a pretrained SAE for your target model and layer. SAEs are available for popular models (Gemma, Llama) through projects like Gemma Scope. Training custom SAEs adds significant overhead.
The sparse steering vector has dimension equal to the SAE latent space (typically 16x-64x the activation dimension). Storage is minimal since vectors are sparse, but you need SAE infrastructure for encoding/decoding.
Implementation Notes
Prerequisites
YaPO requires:
- Base LLM: The paper uses Gemma-2-2B-it
- Pretrained SAE: Available from Gemma Scope for Gemma models
- Preference data: (prompt, preferred_response, dispreferred_response) tuples
Training loop
# Pseudocode for YaPO training
def train_yapo(model, sae, preference_data, num_steps=150):
# Initialize sparse steering vector
v = torch.zeros(sae.latent_dim)
v.requires_grad = True
optimizer = Adam([v], lr=1e-3)
for step in range(num_steps):
x, y_w, y_l = sample_batch(preference_data)
# Random direction for bi-directional training
d = random.choice([-1, 1])
# Get base activations at layer L
a_L = model.get_activations(x, layer=L)
# Compute steered activations via SAE
z = sae.encode(a_L)
z_steered = relu(z + d * v)
a_steered = sae.decode(z_steered)
# Residual correction
a_steered += (a_L - sae.decode(sae.encode(a_L)))
# Compute preference loss
log_ratio_w = model.log_prob(y_w, a_steered) - model.log_prob(y_w, a_L)
log_ratio_l = model.log_prob(y_l, a_steered) - model.log_prob(y_l, a_L)
loss = -log_sigmoid(d * beta * (log_ratio_w - log_ratio_l))
optimizer.zero_grad()
loss.backward()
optimizer.step()
return vInference
def steer_generation(model, sae, v, prompt, direction=1, multiplier=1.5):
# Hook to modify activations at layer L
def steering_hook(module, input, output):
a = output
z = sae.encode(a)
z_steered = relu(z + direction * multiplier * v)
a_steered = sae.decode(z_steered)
a_steered += (a - sae.decode(sae.encode(a)))
return a_steered
handle = model.layer[L].register_forward_hook(steering_hook)
output = model.generate(prompt)
handle.remove()
return outputHyperparameters
| Parameter | Recommended Value |
|---|---|
| Learning rate | 1e-3 |
| Training steps | 150-300 |
| Batch size | 32 |
| Beta (deviation control) | 0.1 |
| Steering multiplier (λ) | 1.5-2.0 |
| Target layer | Middle layers (L/2) |
Limitations
SAE dependency
YaPO requires a pretrained SAE for your target model. Quality SAEs exist for Gemma and some Llama variants, but not for all architectures. Training a custom SAE is expensive and requires expertise.
Layer selection
Performance depends on which layer you steer. The paper uses middle layers, but optimal choice varies by task. Grid search over layers adds to development time.
Sparse feature coverage
SAEs do not perfectly capture all concepts. Some behaviors may not have clean corresponding features in the SAE latent space. For these, sparse steering provides less benefit over dense methods.
Cultural benchmark scope
The cultural alignment benchmark covers 15 contexts but is not exhaustive. Performance on other cultures may differ. The benchmark also focuses on explicit cultural questions rather than subtle cultural preferences in open-ended generation.
Computational overhead
SAE encoding/decoding adds latency at inference time. For high-throughput applications, this overhead may be significant. The paper does not report inference timing comparisons.
Original paper: arXiv ・ PDF ・ HTML
Code: GitHub
Authors: Abdelaziz Bounhar, Rania Hossam Elmohamady Elbadry, Hadi Abdine, Preslav Nakov, Michalis Vazirgiannis, Guokan Shang (MBZUAI, Ecole Polytechnique)
Cite this paper
Abdelaziz Bounhar, Rania Hossam Elmohamady Elbadry, Hadi Abdine, Preslav Nakov, Michalis Vazirgiannis, Guokan Shang (2026). YaPO: Sparse Steering Vectors That Actually Work. arXiv 2026.