Verbalized Sampling: Unlocking LLM Creativity with a Simple Prompt

TL;DR

The Problem. RLHF causes "mode collapse." Models give predictable, safe answers because human annotators favor familiar text (typicality bias)
The Fix. Add ~20 words to your prompt asking for "multiple responses with probabilities." This bypasses the aligned persona and accesses the model's full creative distribution
No Training Required. This is a prompt-only technique. Works immediately with ChatGPT, Claude, Gemini, boosting diversity 1.6-2.1x without any fine-tuning

Research Overview

You've probably noticed: ChatGPT often gives the same style of answer. Ask for a joke, and you get the same structure. Ask for a story, and the pacing feels familiar. This is a side effect of how models are trained to be helpful.

Verbalized Sampling (VS) is a simple prompting technique that fixes this. By adding about 20 words to your prompt, you can boost LLM creativity by 1.6-2x without any model changes.

⚡No Training Required: This is purely a prompt technique. Works immediately with any aligned LLM (ChatGPT, Claude, Gemini) without fine-tuning, API changes, or model access.

The core insight: aligned models haven't lost their creativity. They've just learned to hide it. VS forces them to show it again.

The Key Result

Metric	Improvement
Creative diversity	1.6-2.1x increase
Human-rated diversity	+25.7%
Creativity recovery	66.8% of pre-alignment levels
Quality maintained	No degradation

The Problem: Mode Collapse

Post-training alignment (RLHF, RLAIF, DPO) makes LLMs helpful and safe. But it has a hidden cost: mode collapse.

What is RLHF?

RLHF stands for Reinforcement Learning from Human Feedback. After an LLM is initially trained on text from the internet (pre-training), companies like OpenAI and Anthropic run a second phase where human raters judge the model's responses. The model learns to produce answers that humans rate highly. This makes the model helpful and safe, but as this paper shows, it also makes responses more predictable.

Pre-trained Model

Wide distribution with diverse outputs

→

After RLHF

Narrow distribution → predictable outputs

Mode collapse means the model's output distribution becomes narrow. Instead of sampling from a rich space of possible responses, the model converges on a small set of "safe" answers.

What Mode Collapse Looks Like

Ask ChatGPT for 10 jokes about coffee. You'll notice they follow similar patterns: setup/punchline structure, similar topics (caffeine addiction, morning routines), similar tone. The model has learned that these patterns score well with humans, so it rarely deviates.

Mode collapse has real consequences:

Creative applications: Stories, poems, and marketing copy become predictable
Synthetic data generation: Training data lacks diversity
Dialogue systems: Conversations feel scripted
Brainstorming: Models suggest the obvious rather than the novel

Root Cause: Typicality Bias

Previous work blamed mode collapse on algorithmic issues (reward hacking, KL penalty tuning, training instability). This paper identifies a more fundamental cause: typicality bias in preference data.

How Preference Data Gets Collected

Here's how companies train models to be helpful:

An LLM generates multiple responses to a prompt
Human annotators rate which response is "better"
A reward model learns from these preferences
The LLM is fine-tuned to maximize reward

What is a Reward Model?

A reward model is a separate AI that learns to predict which responses humans will prefer. Once trained, it can score millions of responses automatically, guiding the main LLM toward "better" outputs. The problem: if the reward model learns biased preferences, it teaches those biases to the LLM.

The problem is step 2. When annotators judge responses, they systematically prefer:

Familiar phrasing over novel expression
Expected structure over creative format
Predictable content over surprising ideas

This isn't annotator failure. It's cognitive psychology. Humans process familiar text more easily (cognitive fluency), and ease of processing feels like quality.

Cognitive Fluency Explained

When you read something that matches patterns you've seen before, your brain processes it faster. That speed creates a subtle feeling of "rightness." Annotators aren't consciously choosing boring text. Their brains just signal "this feels good" when they read familiar phrases. The result: creative, unusual responses get systematically downvoted even when they're objectively good.

The Cascade Effect

Typicality bias in preference data creates a cascade:

Annotators favor typical responses → Preference data is biased
Reward model learns bias → High rewards for predictable outputs
LLM optimizes for reward → Model suppresses diverse responses
Distribution collapses → Creativity disappears

The paper provides empirical evidence: analyzing preference datasets shows that "typical" responses (closer to the mean of the output distribution) consistently receive higher human ratings.

The Solution: Verbalized Sampling

Verbalized Sampling bypasses mode collapse by changing what you ask the model to do.

Standard Prompting

You

Tell me a joke about coffee.

"Why did the coffee file a police report? It got mugged!"

Verbalized Sampling

You

Generate 5 jokes about coffee with their corresponding probabilities. Include responses from the distribution tails.

0.35: "Why did the coffee file a police report? It got mugged!"

0.25: "My coffee and I have trust issues—it keeps getting cold on me."

0.08: "Coffee is just anxiety that you drink on purpose."

0.05: "I'm not addicted to coffee. We're just in a committed relationship."

Why This Works

The key insight: aligned models still contain their pre-trained distribution. RLHF doesn't delete creative capabilities. It just adds a layer that suppresses them.

When you ask for a single response, the aligned layer dominates. When you ask for a probability distribution, the model must engage its full generative capacity to produce multiple plausible responses with varying likelihoods.

What are "Distribution Tails"?

Imagine all possible responses to a prompt arranged by how likely the model is to generate them. The most common responses sit in the middle (the "peak"). Rare, unusual responses live in the "tails" on either side. These tail responses are often the most creative because they're unexpected. Standard prompting only gets you peak responses. VS explicitly asks for tail responses, unlocking creativity the model already has but normally hides.

The Two Personalities

Think of an aligned LLM as having two personalities: (1) the original pre-trained model with rich, diverse knowledge, and (2) the RLHF-tuned assistant that prefers safe responses. Standard prompting activates personality #2. Verbalized Sampling forces the model to consult personality #1.

Prompt Templates

Here are ready-to-use templates. Replace [N] with how many responses you want (5 is a good starting point), and [YOUR PROMPT] with your actual question.

Basic VS:

Generate [N] responses to the following query, each with their corresponding probability. Query: [YOUR PROMPT]

VS with Tail Sampling:

This version explicitly requests creative, low-probability responses. Replace [M] with how many unusual responses you want (try 2-3).

Generate [N] responses with probabilities. Include at least [M] responses from distribution tails (probability < 0.10). Query: [YOUR PROMPT]

VS-CoT (Chain of Thought):

This version asks the model to explain its reasoning, which can surface even more variety.

Generate [N] responses with probabilities and reasoning for each. Query: [YOUR PROMPT]

What Do the Probabilities Mean?

The numbers (0.35, 0.08, etc.) represent how likely the model thinks each response is. Higher numbers mean more "typical" answers. Lower numbers (under 0.10) are the creative outliers. Don't take these as precise statistics. Think of them as a rough ranking from "obvious" to "unexpected."

Results

Creative Writing

VS was tested on poems, stories, and jokes. Results across models:

Diversity Improvement by Model

Verbalized Sampling vs Standard Prompting (1.0x = baseline)

Human evaluators rated VS outputs as more diverse (+25.7%) while maintaining or improving quality scores.

Capability Scaling

An interesting finding: more capable models benefit more from VS.

This suggests that larger models have richer pre-trained distributions that are more heavily suppressed by alignment. VS unlocks more potential when there's more potential to unlock.

Domain Performance

VS improved diversity across:

Creative writing: Poems, stories, jokes
Dialogue simulation: Character conversations, roleplay
Open-ended QA: Questions with multiple valid answers
Synthetic data generation: Training data creation

VS maintained factual accuracy on knowledge tasks and didn't compromise safety guardrails.

Creativity Recovery

Compared to pre-alignment base models, aligned models show significant creativity loss. VS recovers 66.8% of this lost creativity without requiring any model changes.

Creativity Through Alignment Stages

How Verbalized Sampling recovers lost creative diversity

Practical Applications

Content Generation

Marketing copy often suffers from sameness. VS produces a spectrum from safe to creative:

Standard Prompting

You

Write a marketing tagline for a coffee subscription service.

"Fresh coffee, delivered to your door."

Verbalized Sampling

You

Generate 5 marketing taglines for a coffee subscription service with probabilities. Include options from distribution tails.

0.40: "Fresh coffee, delivered to your door."

0.25: "Wake up to something worth waking up for."

0.15: "Your morning routine, upgraded."

0.08: "Break up with bad coffee. We'll be your rebound."

0.04: "Subscription? No. A standing ovation for your taste buds."

Synthetic Data

Training data diversity directly impacts model robustness. If you're using AI to generate training examples for another AI (increasingly common), VS helps you avoid the "garbage in, garbage out" problem.

Why Diversity Matters for Training Data

When you train a customer service bot on 1,000 examples that all sound the same, it learns to handle one type of conversation. When you train it on diverse examples (including edge cases like angry customers, refund disputes, or language barriers), it handles real-world variety better. VS automatically generates this diversity.

Standard Prompting

You

Write a customer support conversation about a shipping delay.

Customer: Where is my order?

Agent: I apologize for the delay. Let me check the status for you. It looks like your package is delayed due to high volume. It should arrive within 2-3 business days.

Verbalized Sampling

You

Generate 4 customer support conversations about shipping delays with probabilities. Include edge cases from distribution tails.

0.45: Polite inquiry → Standard apology + tracking update

0.25: Frustrated repeat customer → Escalation + compensation offer

0.06: Customer claims package marked delivered but not received

0.03: International shipping + customs hold + language barrier

Brainstorming

VS naturally produces a spectrum from conventional to novel, perfect for ideation:

Standard Prompting

You

Suggest a new feature for a fitness app.

Add a social feature where users can share their workouts and follow friends for motivation.

Verbalized Sampling

You

Generate 6 product feature ideas for a fitness app with probabilities. Include unconventional ideas from the tails.

0.35: Social sharing and friend activity feeds

0.25: AI-personalized workout recommendations

0.15: Integration with smart home (dim lights for yoga, energize playlist for HIIT)

0.07: "Workout debt" system (missed days add to future sessions)

0.04: Reverse mode: app suggests what NOT to do based on yesterday's strain

0.02: Anonymous local leaderboards (compete with neighbors without knowing who)

Dialogue Systems

Conversations feel scripted when responses are too predictable. VS adds natural variation:

Standard Prompting

You

How would an enthusiastic coworker respond to "How was your weekend?"

"It was great, thanks for asking! I went hiking and spent time with family. How was yours?"

Verbalized Sampling

You

Generate 5 responses to "How was your weekend?" with probabilities. Character: Enthusiastic coworker.

0.35: "So good! Finally tried that new brunch place—totally worth the wait."

0.25: "Honestly? I did absolutely nothing and it was everything."

0.18: "Don't even get me started—I have PICTURES." pulls out phone

0.06: "You know that feeling when Monday comes too fast? Yeah."

0.03: "I impulse-bought a kayak. I don't know how to kayak."

Limitations

VS is powerful but not perfect. Here's what to watch for:

Computational Overhead

VS requires generating multiple responses per query. For applications where latency matters, this adds cost. The paper suggests that VS-based approaches can be used selectively: apply VS when diversity matters, use standard prompting otherwise.

When to Use VS vs. Standard Prompting

Use VS for brainstorming, creative work, or generating training data. Use standard prompting for quick factual queries, code completion, or any task where you need one correct answer fast. VS adds 2-5x more tokens to your output, which means 2-5x more cost and latency.

Probability Calibration

LLMs don't produce well-calibrated probabilities. The "probabilities" in VS outputs are more like relative confidence scores than true statistical probabilities. This doesn't affect diversity improvement but limits applications that need actual probability estimates.

Model-Specific Tuning

Optimal VS parameters (number of responses, tail probability threshold) vary by model. The paper provides starting points, but production use may require experimentation.

Not a Replacement for Fine-Tuning

VS improves diversity at inference time but doesn't change the model's underlying distribution. For applications requiring specific styles or domains, fine-tuning may still be necessary.

Conclusion

Verbalized Sampling offers a practical fix for one of RLHF's unintended consequences. By understanding that mode collapse stems from typicality bias (not algorithmic limitations), the researchers developed a training-free solution that works with any aligned LLM.

Key Takeaways:

Mode collapse is a data problem: Typicality bias in preference data, not RLHF algorithms, drives diversity loss
Aligned models retain creativity: The pre-trained distribution isn't deleted, it's suppressed
Simple prompting recovers diversity: Adding ~20 words to prompts can double creative output
No quality tradeoff: VS improves diversity without sacrificing accuracy or safety
Scales with capability: Better models benefit more from VS

For practitioners, VS is immediately applicable. No fine-tuning, no API changes, no model access required. Just change your prompts.

Original paper: arXiv ・ PDF ・ HTML

Authors: Jiayi Zhang, Simon Yu, Derek Chong, Anthony Sicilia, Michael R. Tomz, Christopher D. Manning, Weiyan Shi

Institutions: Northeastern University, Stanford University, West Virginia University

Code: GitHub - CHATS-lab/verbalized-sampling

Website: verbalized-sampling.com

Authors

Jiayi ZhangNortheastern University,Simon YuNortheastern University,Derek ChongStanford University,Christopher D. ManningStanford University,Weiyan ShiNortheastern University

Cite this paper

Jiayi Zhang, Simon Yu, Derek Chong, Christopher D. Manning, Weiyan Shi (2025). Verbalized Sampling: Unlocking LLM Creativity with a Simple Prompt. arXiv 2025.

Verbalized Sampling: Unlocking LLM Creativity with a Simple Prompt

Key Findings

Research Overview

The Key Result

The Problem: Mode Collapse

Root Cause: Typicality Bias

How Preference Data Gets Collected

The Cascade Effect

The Solution: Verbalized Sampling

Why This Works

Prompt Templates

Results

Creative Writing

Diversity Improvement by Model

Capability Scaling

Domain Performance

Creativity Recovery

Creativity Through Alignment Stages

Practical Applications

Content Generation

Synthetic Data

Brainstorming

Dialogue Systems

Limitations

Computational Overhead

Probability Calibration

Model-Specific Tuning

Not a Replacement for Fine-Tuning

Conclusion

Authors

Cite this paper

Related Research

SCRIBE: Structured Mid-Level Supervision for Tool-Using Language Models

CaveAgent: Transforming LLMs into Stateful Runtime Operators

Topic-Enriched Embeddings: Combining Classical NLP with Modern RAG