-
The Problem. RLHF causes "mode collapse." Models give predictable, safe answers because human annotators favor familiar text (typicality bias)
-
The Fix. Add ~20 words to your prompt asking for "multiple responses with probabilities." This bypasses the aligned persona and accesses the model's full creative distribution
-
No Training Required. This is a prompt-only technique. Works immediately with ChatGPT, Claude, Gemini, boosting diversity 1.6-2.1x without any fine-tuning
Research Overview
You've probably noticed: ChatGPT often gives the same style of answer. Ask for a joke, and you get the same structure. Ask for a story, and the pacing feels familiar. This is a side effect of how models are trained to be helpful.
Verbalized Sampling (VS) is a simple prompting technique that fixes this. By adding about 20 words to your prompt, you can boost LLM creativity by 1.6-2x without any model changes.
The core insight: aligned models haven't lost their creativity. They've just learned to hide it. VS forces them to show it again.
The Key Result
| Metric | Improvement |
|---|---|
| Creative diversity | 1.6-2.1x increase |
| Human-rated diversity | +25.7% |
| Creativity recovery | 66.8% of pre-alignment levels |
| Quality maintained | No degradation |
The Problem: Mode Collapse
Post-training alignment (RLHF, RLAIF, DPO) makes LLMs helpful and safe. But it has a hidden cost: mode collapse.
RLHF stands for Reinforcement Learning from Human Feedback. After an LLM is initially trained on text from the internet (pre-training), companies like OpenAI and Anthropic run a second phase where human raters judge the model's responses. The model learns to produce answers that humans rate highly. This makes the model helpful and safe, but as this paper shows, it also makes responses more predictable.
Mode collapse means the model's output distribution becomes narrow. Instead of sampling from a rich space of possible responses, the model converges on a small set of "safe" answers.
Ask ChatGPT for 10 jokes about coffee. You'll notice they follow similar patterns: setup/punchline structure, similar topics (caffeine addiction, morning routines), similar tone. The model has learned that these patterns score well with humans, so it rarely deviates.
Mode collapse has real consequences:
- Creative applications: Stories, poems, and marketing copy become predictable
- Synthetic data generation: Training data lacks diversity
- Dialogue systems: Conversations feel scripted
- Brainstorming: Models suggest the obvious rather than the novel
Root Cause: Typicality Bias
Previous work blamed mode collapse on algorithmic issues (reward hacking, KL penalty tuning, training instability). This paper identifies a more fundamental cause: typicality bias in preference data.
How Preference Data Gets Collected
Here's how companies train models to be helpful:
- An LLM generates multiple responses to a prompt
- Human annotators rate which response is "better"
- A reward model learns from these preferences
- The LLM is fine-tuned to maximize reward
A reward model is a separate AI that learns to predict which responses humans will prefer. Once trained, it can score millions of responses automatically, guiding the main LLM toward "better" outputs. The problem: if the reward model learns biased preferences, it teaches those biases to the LLM.
The problem is step 2. When annotators judge responses, they systematically prefer:
- Familiar phrasing over novel expression
- Expected structure over creative format
- Predictable content over surprising ideas
This isn't annotator failure. It's cognitive psychology. Humans process familiar text more easily (cognitive fluency), and ease of processing feels like quality.
When you read something that matches patterns you've seen before, your brain processes it faster. That speed creates a subtle feeling of "rightness." Annotators aren't consciously choosing boring text. Their brains just signal "this feels good" when they read familiar phrases. The result: creative, unusual responses get systematically downvoted even when they're objectively good.
The Cascade Effect
Typicality bias in preference data creates a cascade:
- Annotators favor typical responses → Preference data is biased
- Reward model learns bias → High rewards for predictable outputs
- LLM optimizes for reward → Model suppresses diverse responses
- Distribution collapses → Creativity disappears
The paper provides empirical evidence: analyzing preference datasets shows that "typical" responses (closer to the mean of the output distribution) consistently receive higher human ratings.
The Solution: Verbalized Sampling
Verbalized Sampling bypasses mode collapse by changing what you ask the model to do.
Why This Works
The key insight: aligned models still contain their pre-trained distribution. RLHF doesn't delete creative capabilities. It just adds a layer that suppresses them.
When you ask for a single response, the aligned layer dominates. When you ask for a probability distribution, the model must engage its full generative capacity to produce multiple plausible responses with varying likelihoods.
Imagine all possible responses to a prompt arranged by how likely the model is to generate them. The most common responses sit in the middle (the "peak"). Rare, unusual responses live in the "tails" on either side. These tail responses are often the most creative because they're unexpected. Standard prompting only gets you peak responses. VS explicitly asks for tail responses, unlocking creativity the model already has but normally hides.
Think of an aligned LLM as having two personalities: (1) the original pre-trained model with rich, diverse knowledge, and (2) the RLHF-tuned assistant that prefers safe responses. Standard prompting activates personality #2. Verbalized Sampling forces the model to consult personality #1.
Prompt Templates
Here are ready-to-use templates. Replace [N] with how many responses you want (5 is a good starting point), and [YOUR PROMPT] with your actual question.
Basic VS:
Generate [N] responses to the following query, each with their corresponding probability. Query: [YOUR PROMPT]
VS with Tail Sampling:
This version explicitly requests creative, low-probability responses. Replace [M] with how many unusual responses you want (try 2-3).
Generate [N] responses with probabilities. Include at least [M] responses from distribution tails (probability < 0.10). Query: [YOUR PROMPT]
VS-CoT (Chain of Thought):
This version asks the model to explain its reasoning, which can surface even more variety.
Generate [N] responses with probabilities and reasoning for each. Query: [YOUR PROMPT]
The numbers (0.35, 0.08, etc.) represent how likely the model thinks each response is. Higher numbers mean more "typical" answers. Lower numbers (under 0.10) are the creative outliers. Don't take these as precise statistics. Think of them as a rough ranking from "obvious" to "unexpected."
Results
Creative Writing
VS was tested on poems, stories, and jokes. Results across models:
Diversity Improvement by Model
Verbalized Sampling vs Standard Prompting (1.0x = baseline)
Human evaluators rated VS outputs as more diverse (+25.7%) while maintaining or improving quality scores.
Capability Scaling
An interesting finding: more capable models benefit more from VS.
This suggests that larger models have richer pre-trained distributions that are more heavily suppressed by alignment. VS unlocks more potential when there's more potential to unlock.
Domain Performance
VS improved diversity across:
- Creative writing: Poems, stories, jokes
- Dialogue simulation: Character conversations, roleplay
- Open-ended QA: Questions with multiple valid answers
- Synthetic data generation: Training data creation
VS maintained factual accuracy on knowledge tasks and didn't compromise safety guardrails.
Creativity Recovery
Compared to pre-alignment base models, aligned models show significant creativity loss. VS recovers 66.8% of this lost creativity without requiring any model changes.
Creativity Through Alignment Stages
How Verbalized Sampling recovers lost creative diversity
Practical Applications
Content Generation
Marketing copy often suffers from sameness. VS produces a spectrum from safe to creative:
Synthetic Data
Training data diversity directly impacts model robustness. If you're using AI to generate training examples for another AI (increasingly common), VS helps you avoid the "garbage in, garbage out" problem.
When you train a customer service bot on 1,000 examples that all sound the same, it learns to handle one type of conversation. When you train it on diverse examples (including edge cases like angry customers, refund disputes, or language barriers), it handles real-world variety better. VS automatically generates this diversity.
Brainstorming
VS naturally produces a spectrum from conventional to novel, perfect for ideation:
Dialogue Systems
Conversations feel scripted when responses are too predictable. VS adds natural variation:
Limitations
VS is powerful but not perfect. Here's what to watch for:
Computational Overhead
VS requires generating multiple responses per query. For applications where latency matters, this adds cost. The paper suggests that VS-based approaches can be used selectively: apply VS when diversity matters, use standard prompting otherwise.
Use VS for brainstorming, creative work, or generating training data. Use standard prompting for quick factual queries, code completion, or any task where you need one correct answer fast. VS adds 2-5x more tokens to your output, which means 2-5x more cost and latency.
Probability Calibration
LLMs don't produce well-calibrated probabilities. The "probabilities" in VS outputs are more like relative confidence scores than true statistical probabilities. This doesn't affect diversity improvement but limits applications that need actual probability estimates.
Model-Specific Tuning
Optimal VS parameters (number of responses, tail probability threshold) vary by model. The paper provides starting points, but production use may require experimentation.
Not a Replacement for Fine-Tuning
VS improves diversity at inference time but doesn't change the model's underlying distribution. For applications requiring specific styles or domains, fine-tuning may still be necessary.
Conclusion
Verbalized Sampling offers a practical fix for one of RLHF's unintended consequences. By understanding that mode collapse stems from typicality bias (not algorithmic limitations), the researchers developed a training-free solution that works with any aligned LLM.
Key Takeaways:
-
Mode collapse is a data problem: Typicality bias in preference data, not RLHF algorithms, drives diversity loss
-
Aligned models retain creativity: The pre-trained distribution isn't deleted, it's suppressed
-
Simple prompting recovers diversity: Adding ~20 words to prompts can double creative output
-
No quality tradeoff: VS improves diversity without sacrificing accuracy or safety
-
Scales with capability: Better models benefit more from VS
For practitioners, VS is immediately applicable. No fine-tuning, no API changes, no model access required. Just change your prompts.
Original paper: arXiv ・ PDF ・ HTML
Authors: Jiayi Zhang, Simon Yu, Derek Chong, Anthony Sicilia, Michael R. Tomz, Christopher D. Manning, Weiyan Shi
Institutions: Northeastern University, Stanford University, West Virginia University
Code: GitHub - CHATS-lab/verbalized-sampling
Website: verbalized-sampling.com
Cite this paper
Jiayi Zhang, Simon Yu, Derek Chong, Christopher D. Manning, Weiyan Shi (2025). Verbalized Sampling: Unlocking LLM Creativity with a Simple Prompt. arXiv 2025.