Tekta.ai LogoTektaai
arXiv 2025October 1, 2025

Verbalized Sampling: Unlocking LLM Creativity with a Simple Prompt

Jiayi Zhanget al.

Post-training alignment methods like RLHF make LLMs helpful and safe, but they unintentionally cause mode collapse, where models favor narrow, predictable responses. This paper identifies typicality bias in preference data as the root cause: annotators systematically favor familiar text, and reward models learn to suppress creative output. Verbalized Sampling (VS) is a training-free fix that asks models to generate multiple responses with probabilities, forcing them to access their full pre-trained distribution rather than defaulting to safe answers.

Categories:Large Language ModelsMachine Learning

Key Findings

1

RLHF causes mode collapse due to typicality bias in human preference data, not algorithmic limitations

2

Verbalized Sampling increases diversity by 1.6-2.1x over direct prompting without sacrificing quality

3

Adding ~20 words to prompts recovers 66.8% of creativity lost during alignment

4

More capable models benefit more from Verbalized Sampling

5

Training-free method works at inference time with no fine-tuning required

6

Effective across creative writing, dialogue simulation, QA, and synthetic data generation

TL;DR
  1. The Problem. RLHF causes "mode collapse." Models give predictable, safe answers because human annotators favor familiar text (typicality bias)

  2. The Fix. Add ~20 words to your prompt asking for "multiple responses with probabilities." This bypasses the aligned persona and accesses the model's full creative distribution

  3. No Training Required. This is a prompt-only technique. Works immediately with ChatGPT, Claude, Gemini, boosting diversity 1.6-2.1x without any fine-tuning

Research Overview

You've probably noticed: ChatGPT often gives the same style of answer. Ask for a joke, and you get the same structure. Ask for a story, and the pacing feels familiar. This is a side effect of how models are trained to be helpful.

Verbalized Sampling (VS) is a simple prompting technique that fixes this. By adding about 20 words to your prompt, you can boost LLM creativity by 1.6-2x without any model changes.

No Training Required: This is purely a prompt technique. Works immediately with any aligned LLM (ChatGPT, Claude, Gemini) without fine-tuning, API changes, or model access.

The core insight: aligned models haven't lost their creativity. They've just learned to hide it. VS forces them to show it again.

The Key Result

MetricImprovement
Creative diversity1.6-2.1x increase
Human-rated diversity+25.7%
Creativity recovery66.8% of pre-alignment levels
Quality maintainedNo degradation

The Problem: Mode Collapse

Post-training alignment (RLHF, RLAIF, DPO) makes LLMs helpful and safe. But it has a hidden cost: mode collapse.

What is RLHF?

RLHF stands for Reinforcement Learning from Human Feedback. After an LLM is initially trained on text from the internet (pre-training), companies like OpenAI and Anthropic run a second phase where human raters judge the model's responses. The model learns to produce answers that humans rate highly. This makes the model helpful and safe, but as this paper shows, it also makes responses more predictable.

Pre-trained Model
Wide distribution with diverse outputs
After RLHF
Narrow distribution → predictable outputs

Mode collapse means the model's output distribution becomes narrow. Instead of sampling from a rich space of possible responses, the model converges on a small set of "safe" answers.

What Mode Collapse Looks Like

Ask ChatGPT for 10 jokes about coffee. You'll notice they follow similar patterns: setup/punchline structure, similar topics (caffeine addiction, morning routines), similar tone. The model has learned that these patterns score well with humans, so it rarely deviates.

Mode collapse has real consequences:

  • Creative applications: Stories, poems, and marketing copy become predictable
  • Synthetic data generation: Training data lacks diversity
  • Dialogue systems: Conversations feel scripted
  • Brainstorming: Models suggest the obvious rather than the novel

Root Cause: Typicality Bias

Previous work blamed mode collapse on algorithmic issues (reward hacking, KL penalty tuning, training instability). This paper identifies a more fundamental cause: typicality bias in preference data.

How Preference Data Gets Collected

Here's how companies train models to be helpful:

  1. An LLM generates multiple responses to a prompt
  2. Human annotators rate which response is "better"
  3. A reward model learns from these preferences
  4. The LLM is fine-tuned to maximize reward
What is a Reward Model?

A reward model is a separate AI that learns to predict which responses humans will prefer. Once trained, it can score millions of responses automatically, guiding the main LLM toward "better" outputs. The problem: if the reward model learns biased preferences, it teaches those biases to the LLM.

The problem is step 2. When annotators judge responses, they systematically prefer:

  • Familiar phrasing over novel expression
  • Expected structure over creative format
  • Predictable content over surprising ideas

This isn't annotator failure. It's cognitive psychology. Humans process familiar text more easily (cognitive fluency), and ease of processing feels like quality.

Cognitive Fluency Explained

When you read something that matches patterns you've seen before, your brain processes it faster. That speed creates a subtle feeling of "rightness." Annotators aren't consciously choosing boring text. Their brains just signal "this feels good" when they read familiar phrases. The result: creative, unusual responses get systematically downvoted even when they're objectively good.

The Cascade Effect

Typicality bias in preference data creates a cascade:

  1. Annotators favor typical responses → Preference data is biased
  2. Reward model learns bias → High rewards for predictable outputs
  3. LLM optimizes for reward → Model suppresses diverse responses
  4. Distribution collapses → Creativity disappears

The paper provides empirical evidence: analyzing preference datasets shows that "typical" responses (closer to the mean of the output distribution) consistently receive higher human ratings.

The Solution: Verbalized Sampling

Verbalized Sampling bypasses mode collapse by changing what you ask the model to do.

Standard Prompting
You
Tell me a joke about coffee.
AI

"Why did the coffee file a police report? It got mugged!"

Verbalized Sampling
You
Generate 5 jokes about coffee with their corresponding probabilities. Include responses from the distribution tails.
AI
0.35: "Why did the coffee file a police report? It got mugged!"
0.25: "My coffee and I have trust issues—it keeps getting cold on me."
0.08: "Coffee is just anxiety that you drink on purpose."
0.05: "I'm not addicted to coffee. We're just in a committed relationship."

Why This Works

The key insight: aligned models still contain their pre-trained distribution. RLHF doesn't delete creative capabilities. It just adds a layer that suppresses them.

When you ask for a single response, the aligned layer dominates. When you ask for a probability distribution, the model must engage its full generative capacity to produce multiple plausible responses with varying likelihoods.

What are "Distribution Tails"?

Imagine all possible responses to a prompt arranged by how likely the model is to generate them. The most common responses sit in the middle (the "peak"). Rare, unusual responses live in the "tails" on either side. These tail responses are often the most creative because they're unexpected. Standard prompting only gets you peak responses. VS explicitly asks for tail responses, unlocking creativity the model already has but normally hides.

The Two Personalities

Think of an aligned LLM as having two personalities: (1) the original pre-trained model with rich, diverse knowledge, and (2) the RLHF-tuned assistant that prefers safe responses. Standard prompting activates personality #2. Verbalized Sampling forces the model to consult personality #1.

Prompt Templates

Here are ready-to-use templates. Replace [N] with how many responses you want (5 is a good starting point), and [YOUR PROMPT] with your actual question.

Basic VS:

Generate [N] responses to the following query, each with their corresponding probability. Query: [YOUR PROMPT]

VS with Tail Sampling:

This version explicitly requests creative, low-probability responses. Replace [M] with how many unusual responses you want (try 2-3).

Generate [N] responses with probabilities. Include at least [M] responses from distribution tails (probability < 0.10). Query: [YOUR PROMPT]

VS-CoT (Chain of Thought):

This version asks the model to explain its reasoning, which can surface even more variety.

Generate [N] responses with probabilities and reasoning for each. Query: [YOUR PROMPT]

What Do the Probabilities Mean?

The numbers (0.35, 0.08, etc.) represent how likely the model thinks each response is. Higher numbers mean more "typical" answers. Lower numbers (under 0.10) are the creative outliers. Don't take these as precise statistics. Think of them as a rough ranking from "obvious" to "unexpected."

Results

Creative Writing

VS was tested on poems, stories, and jokes. Results across models:

Diversity Improvement by Model

Verbalized Sampling vs Standard Prompting (1.0x = baseline)

Human evaluators rated VS outputs as more diverse (+25.7%) while maintaining or improving quality scores.

Capability Scaling

An interesting finding: more capable models benefit more from VS.

This suggests that larger models have richer pre-trained distributions that are more heavily suppressed by alignment. VS unlocks more potential when there's more potential to unlock.

Domain Performance

VS improved diversity across:

  • Creative writing: Poems, stories, jokes
  • Dialogue simulation: Character conversations, roleplay
  • Open-ended QA: Questions with multiple valid answers
  • Synthetic data generation: Training data creation

VS maintained factual accuracy on knowledge tasks and didn't compromise safety guardrails.

Creativity Recovery

Compared to pre-alignment base models, aligned models show significant creativity loss. VS recovers 66.8% of this lost creativity without requiring any model changes.

Creativity Through Alignment Stages

How Verbalized Sampling recovers lost creative diversity

Practical Applications

Content Generation

Marketing copy often suffers from sameness. VS produces a spectrum from safe to creative:

Standard Prompting
You
Write a marketing tagline for a coffee subscription service.
AI

"Fresh coffee, delivered to your door."

Verbalized Sampling
You
Generate 5 marketing taglines for a coffee subscription service with probabilities. Include options from distribution tails.
AI
0.40: "Fresh coffee, delivered to your door."
0.25: "Wake up to something worth waking up for."
0.15: "Your morning routine, upgraded."
0.08: "Break up with bad coffee. We'll be your rebound."
0.04: "Subscription? No. A standing ovation for your taste buds."

Synthetic Data

Training data diversity directly impacts model robustness. If you're using AI to generate training examples for another AI (increasingly common), VS helps you avoid the "garbage in, garbage out" problem.

Why Diversity Matters for Training Data

When you train a customer service bot on 1,000 examples that all sound the same, it learns to handle one type of conversation. When you train it on diverse examples (including edge cases like angry customers, refund disputes, or language barriers), it handles real-world variety better. VS automatically generates this diversity.

Standard Prompting
You
Write a customer support conversation about a shipping delay.
AI

Customer: Where is my order?

Agent: I apologize for the delay. Let me check the status for you. It looks like your package is delayed due to high volume. It should arrive within 2-3 business days.

Verbalized Sampling
You
Generate 4 customer support conversations about shipping delays with probabilities. Include edge cases from distribution tails.
AI
0.45: Polite inquiry → Standard apology + tracking update
0.25: Frustrated repeat customer → Escalation + compensation offer
0.06: Customer claims package marked delivered but not received
0.03: International shipping + customs hold + language barrier

Brainstorming

VS naturally produces a spectrum from conventional to novel, perfect for ideation:

Standard Prompting
You
Suggest a new feature for a fitness app.
AI

Add a social feature where users can share their workouts and follow friends for motivation.

Verbalized Sampling
You
Generate 6 product feature ideas for a fitness app with probabilities. Include unconventional ideas from the tails.
AI
0.35: Social sharing and friend activity feeds
0.25: AI-personalized workout recommendations
0.15: Integration with smart home (dim lights for yoga, energize playlist for HIIT)
0.07: "Workout debt" system (missed days add to future sessions)
0.04: Reverse mode: app suggests what NOT to do based on yesterday's strain
0.02: Anonymous local leaderboards (compete with neighbors without knowing who)

Dialogue Systems

Conversations feel scripted when responses are too predictable. VS adds natural variation:

Standard Prompting
You
How would an enthusiastic coworker respond to "How was your weekend?"
AI

"It was great, thanks for asking! I went hiking and spent time with family. How was yours?"

Verbalized Sampling
You
Generate 5 responses to "How was your weekend?" with probabilities. Character: Enthusiastic coworker.
AI
0.35: "So good! Finally tried that new brunch place—totally worth the wait."
0.25: "Honestly? I did absolutely nothing and it was everything."
0.18: "Don't even get me started—I have PICTURES." pulls out phone
0.06: "You know that feeling when Monday comes too fast? Yeah."
0.03: "I impulse-bought a kayak. I don't know how to kayak."

Limitations

VS is powerful but not perfect. Here's what to watch for:

Computational Overhead

VS requires generating multiple responses per query. For applications where latency matters, this adds cost. The paper suggests that VS-based approaches can be used selectively: apply VS when diversity matters, use standard prompting otherwise.

When to Use VS vs. Standard Prompting

Use VS for brainstorming, creative work, or generating training data. Use standard prompting for quick factual queries, code completion, or any task where you need one correct answer fast. VS adds 2-5x more tokens to your output, which means 2-5x more cost and latency.

Probability Calibration

LLMs don't produce well-calibrated probabilities. The "probabilities" in VS outputs are more like relative confidence scores than true statistical probabilities. This doesn't affect diversity improvement but limits applications that need actual probability estimates.

Model-Specific Tuning

Optimal VS parameters (number of responses, tail probability threshold) vary by model. The paper provides starting points, but production use may require experimentation.

Not a Replacement for Fine-Tuning

VS improves diversity at inference time but doesn't change the model's underlying distribution. For applications requiring specific styles or domains, fine-tuning may still be necessary.

Conclusion

Verbalized Sampling offers a practical fix for one of RLHF's unintended consequences. By understanding that mode collapse stems from typicality bias (not algorithmic limitations), the researchers developed a training-free solution that works with any aligned LLM.

Key Takeaways:

  1. Mode collapse is a data problem: Typicality bias in preference data, not RLHF algorithms, drives diversity loss

  2. Aligned models retain creativity: The pre-trained distribution isn't deleted, it's suppressed

  3. Simple prompting recovers diversity: Adding ~20 words to prompts can double creative output

  4. No quality tradeoff: VS improves diversity without sacrificing accuracy or safety

  5. Scales with capability: Better models benefit more from VS

For practitioners, VS is immediately applicable. No fine-tuning, no API changes, no model access required. Just change your prompts.


Original paper: arXivPDFHTML

Authors: Jiayi Zhang, Simon Yu, Derek Chong, Anthony Sicilia, Michael R. Tomz, Christopher D. Manning, Weiyan Shi

Institutions: Northeastern University, Stanford University, West Virginia University

Code: GitHub - CHATS-lab/verbalized-sampling

Website: verbalized-sampling.com

Authors

Jiayi ZhangNortheastern University,Simon YuNortheastern University,Derek ChongStanford University,Christopher D. ManningStanford University,Weiyan ShiNortheastern University

Cite this paper

Jiayi Zhang, Simon Yu, Derek Chong, Christopher D. Manning, Weiyan Shi (2025). Verbalized Sampling: Unlocking LLM Creativity with a Simple Prompt. arXiv 2025.

Related Research