-
The Problem. LLMs produce outputs that answer the question but ignore constraints. "Write a 200-word summary" returns 450 words. "Include exactly 3 bullet points" returns 5. Conceptually correct, procedurally broken.
-
The Solution. A team of specialized LLM agents that separate task descriptions from constraints, score compliance quantitatively, and iteratively edit the prompt until the generator follows instructions. No retraining required.
-
The Results. On hard prompts (baseline ~30% compliance), the workflow achieves 53.74% improvement rate. Simply separating constraints from tasks improves compliance from 82% to 91.5% before any iteration begins.
Non-compliant outputs create rework. A chatbot that ignores formatting rules forces manual editing. An API that returns 500-word responses when you asked for 100 breaks downstream parsing. This workflow lets you fix prompts once during development, then reuse the optimized versions across thousands of generations. The ROI is front-loaded: invest in prompt curation, save on production firefighting.
Research overview
You've written the perfect prompt. It describes exactly what you want. It specifies the format, the length, the tone, and the structure. You hit enter. The LLM returns something useful but completely ignores half your requirements.
This is the instruction following problem, and it costs engineering teams more time than they'd like to admit.
The ability of an LLM to not just generate relevant content, but to satisfy formal constraints: word limits, bullet counts, required sections, forbidden topics, specific formatting rules. A model can be "helpful" while still being non-compliant.
The paper from Capital One's Card Intelligence team attacks this problem with an insight that seems obvious in hindsight: treat prompt refinement as a collaborative engineering task, not a single-shot generation.
Think of it like a magazine editing team. The writer drafts the piece, the fact-checker scores each claim, the editor translates feedback into suggestions, the managing editor decides what to fix first, and a copy editor makes the actual changes. Each person has a narrow job. The loop continues until the piece passes muster.
If you've followed the buzz around "reasoning models" like OpenAI's o1, you'll recognize a pattern here. Those models internalize slow, deliberate thinking (System 2) into their weights. This paper achieves similar benefits externally: the multi-agent loop forces the system to think before speaking, evaluate its own output, and iteratively correct mistakes. You get reasoning behavior without waiting for a new model release.
Their system deploys five specialized agents working in a loop:
| Agent | Role |
|---|---|
| Generator | Produces candidate responses |
| Evaluator | Scores each constraint 0-10 |
| Translator | Converts scores to actionable feedback |
| Planner | Decides which constraint to fix and how |
| Editor | Modifies the constraint text |
The workflow continues until compliance hits a target, patience runs out, or the maximum iterations are reached.
What makes this different from "just prompt better"? The system uses quantitative feedback. Instead of vague "try again" signals, each constraint gets a numeric score. The planner can see exactly which requirements are failing and by how much. This precision enables intelligent, targeted edits rather than random rewrites.
The compliance problem
How bad is instruction following in practice? In the paper's experiments with Llama 3.1 8B and Mixtral-8x7B, these models only follow complex instructions about 82% of the time. That sounds acceptable until you consider what 18% failure means at scale: nearly 1 in 5 outputs needs manual review or correction.
Consider a prompt for generating hotel reviews:
Write exactly 10 hotel reviews. Each review should be between 50-100 words. Include at least one review mentioning room service. Do not mention any specific hotel chains by name.
A typical LLM might return 8 reviews, one with 150 words, none mentioning room service, and two naming Marriott. The content is useful, readable, and completely non-compliant.
Training objectives optimize for next-token prediction, not constraint satisfaction. The model learns to produce plausible text, not to count bullets or measure word lengths. Constraints require a different kind of attention than content generation.
The problem compounds in production systems:
| Use Case | Constraint Type | Failure Mode |
|---|---|---|
| Legal documents | Exact section headings | Missing required sections |
| API responses | JSON structure | Invalid syntax, extra fields |
| Marketing copy | Word limits | Overlong content, truncation |
| Report generation | Citation format | Inconsistent references |
| Chatbot responses | Tone requirements | Professional/casual mismatch |
Manual prompt iteration can fix individual cases, but it doesn't scale. Each new edge case requires human attention. The authors wanted a system that could automatically diagnose and repair compliance failures.
Core innovation
The paper introduces three key ideas that work together:
1. Constraint decoupling
Traditional prompts mix task descriptions with constraints in a single block. The authors separate them:
Before (coupled):
Write a 200-word professional summary of the quarterly report, including exactly 3 key metrics, formatted as bullet points, and ending with a call to action.
After (decoupled):
Task: Write a professional summary of the quarterly report ending with a call to action.
Constraints:
- The summary must be exactly 200 words
- Include exactly 3 key metrics
- Format the metrics as bullet points
This separation alone improved compliance from 82% to 91.5% across both tested models. Why? The model can attend to constraints as a distinct checklist rather than parsing them from flowing prose.
Think of it like a chef following a recipe. A cramped sticky note that mixes ingredients, cooking times, and plating instructions together is hard to follow. But a recipe card with the method in one section and a separate checklist for timing and presentation? The chef can focus on the cooking while glancing at the checklist to hit every specification.
2. Quantitative scoring
Instead of binary pass/fail, each constraint gets a score from 0-10, normalized to [0,1]:
| Constraint | Score | Interpretation |
|---|---|---|
| Word count (200) | 0.8 | Close, slightly over |
| 3 key metrics | 0.3 | Only 1 metric found |
| Bullet formatting | 1.0 | Perfect compliance |
The planner sees these scores and can prioritize: fix the metric constraint first, it's the worst offender.
Example walkthrough. For the constraint "Exactly 3 bullet points" the evaluator returns a score of 0.2 (only 1 bullet found). The translator converts this to: "Bullet-point constraint failed: 1/3 bullets present." The planner chooses the Rephrase action and outputs: "Provide exactly three bullet points, each on its own line starting with a dash." After the next iteration, the evaluator raises the score to 0.9. The loop continues or terminates based on the remaining constraints.
3. Strategic editing
The system doesn't randomly rewrite constraints. It chooses from four actions:
| Action | When to Use | Example |
|---|---|---|
| Rephrase | Constraint is unclear | "be concise" → "use 50-75 words" |
| Split | Constraint bundles multiple requirements | "3 short bullets" → "exactly 3 bullets" + "each under 20 words" |
| Merge | Related constraints confuse the model | Combine formatting rules |
| Reorder | Critical constraints get buried | Move failing constraint to top |
Editing Action Distribution
Rephrasing dominates, especially in successful cases
In practice, rephrasing dominated (88% of actions). This suggests that clarity is the main issue. Models can follow precise instructions; they struggle with ambiguous ones.
Multi-agent architecture
The workflow orchestrates seven components using LangGraph for state management. The first two (Input Prompt and Constraints Extractor) run once at the start. The remaining five agents loop until compliance is achieved.
Multi-Agent Instruction Refinement Workflow
A constraint-driven architecture for iterative content generation, evaluation, and refinement
Generator agent
Takes the current prompt version and produces three candidate responses. Using multiple outputs provides a better signal than a single sample.
Configuration:
- Temperature: 0.9
- Top-p (nucleus sampling): 0.95
- Outputs: 3 responses per iteration
Evaluator agent
Uses the LLM-as-judge pattern. For each response, it scores every constraint independently on a 0-10 scale.
Using a capable model to grade the output of another model (or itself). Think of it as having a senior developer review a junior's code. The judge doesn't generate the answer; it evaluates whether the answer meets requirements. This pattern lets you automate quality checks that would otherwise require human reviewers.
Validation results:
- Human annotator agreement: 96%
- LLM-judge vs human agreement: 79-81%
The per-constraint scoring is critical. Aggregate scores ("overall this response is 7/10") don't tell the planner which constraint needs work.
Translator agent
The bridge between numbers and actionable insight. The translator converts raw scores (8/10, 3/10) into qualitative feedback the planner can act on: "Constraint 2 is failing because the model included 5 bullet points instead of 3. Previous rephrase attempt did not help."
The translation includes:
- Which constraints improved or degraded since the last iteration
- The edit history (what was already tried)
- Concrete response excerpts showing where compliance failed
This context is why the planner knows what to fix, not just that something failed. Without the translator, the planner would be flying blind.
Planner agent
The strategic decision-maker. It receives the translator's summary and decides:
- Which constraint to modify
- Which editing action to apply
- What the new constraint text should look like
The planner generates three parallel strategies, evaluated simultaneously. Temperature 0.9 encourages diverse approaches.
Editor agent
Executes the planner's selected strategy. Takes the old constraint list and produces a new version with the specified modification. Greedy decoding (temperature 0) ensures deterministic execution.
How the agents collaborate
A single iteration flows like this:
1. Generator produces 3 responses from current prompt
2. Evaluator scores each constraint for each response
3. Translator summarizes: "Constraint 2 dropped from 0.7 to 0.4 after last edit"
4. Planner proposes: "Rephrase constraint 2 to be more specific"
5. Editor outputs: new constraint list
6. Loop back to Generator with updated prompt
Termination conditions:
- Perfect compliance achieved (all constraints score 1.0)
- Maximum iterations reached (N_max = 5)
- No improvement for patience threshold (P_max = 2 iterations)
The patience mechanism prevents infinite loops when a constraint is fundamentally unsatisfiable by the model.
Experimental results
The authors extended InfoBench with a "Translated Constraints" column, creating 500 test samples with explicitly decomposed constraints.
A curated benchmark of prompts designed to test LLMs' ability to follow detailed instructions and constraints. It provides a standardized set of tasks for measuring compliance, making it the go-to dataset for evaluating instruction-following improvements.
Compliance Improvement Across Stages
Progressive gains from constraint separation and workflow optimization
Baseline comparison
| Model | Without Constraints | With Constraints |
|---|---|---|
| Llama 3.1 8B | 82.1% | 91.5% |
| Mixtral-8x 7B | 81.8% | 91.4% |
Just adding explicit constraints improved compliance by ~10 percentage points. The multi-agent workflow then builds on this baseline.
Workflow improvements
The multi-agent workflow adds another layer of gains on top of constraint separation.
| Model | Gain | Success Rate |
|---|---|---|
| Mixtral-8x 7B | +13.1% | 41.0% |
| Llama 3.1 8B | +13.0% | 35.2% |
These numbers are for prompts that actually improved. The "Success Rate" shows what percentage of prompts saw any improvement at all.
Hard vs. easy prompts
The workflow shines on difficult cases:
| Prompt Difficulty | Baseline Compliance | Improvement Rate |
|---|---|---|
| Hard (~30% baseline) | 29.8% | 53.74% |
| Easy (~71% baseline) | 71.2% | 18.33% |
When prompts start with low compliance, there's more room for the system to help. Already-compliant prompts don't need optimization.
Iteration analysis
| Outcome | Average Iterations |
|---|---|
| Already compliant | 0 |
| No improvement | 2.00 |
| Compliance increased | 2.38 |
Successful improvements required slightly more iterations on average. This suggests that patience (continuing to refine) correlates with positive outcomes.
What works and what doesn't
Ablation: removing quantitative feedback
When the evaluator provided only qualitative feedback ("constraint 2 is failing") without scores, performance dropped.
| Model | With Scores | Without |
|---|---|---|
| Mixtral | 41.0% | 38.1% |
| Llama | 35.2% | 34.5% |
Quantitative scores provide meaningful signal, though the effect varies by model. The planner makes better decisions when it knows a constraint scored 0.3 versus 0.7.
Action effectiveness
The planner overwhelmingly chose rephrasing.
| Action | All | Improved |
|---|---|---|
| Rephrase | 0.88 | 1.81 |
| Split | 0.14 | 0.35 |
| Reorder | 0.12 | 0.18 |
| Merge | 0.00 | 0.00 |
Successful cases used rephrasing twice as often as average. Split operations also appeared more frequently when improvements happened, suggesting that decomposing complex constraints helps.
Merge was never selected. The initial constraint decomposition already separates concerns, leaving nothing to combine.
Implementation blueprint
The system requires no model fine-tuning. You can build this with off-the-shelf components.
Recommended tech stack
These are the tools the authors used to get their results.
| Component | Recommended |
|---|---|
| Orchestration | LangGraph |
| Generator | Llama 3 8B |
| Planner | Llama 3 70B |
| State | In-memory |
Alternatives: LangChain for orchestration, Mixtral for generator, GPT-4o for planner, Redis for distributed state.
LangGraph handles stateful graph execution, letting you define agent transitions as edges in a graph. Each node is an agent, each edge is a conditional transition based on scores or iteration counts. It's purpose-built for multi-agent loops.
Core workflow
The system operates in a single loop with five steps per iteration.
-
Generate candidates. The generator LLM produces 3 responses from the current prompt. Multiple outputs give statistical signal.
-
Score each constraint. The evaluator grades every constraint 0-10 for each response. Store scores in your state object.
-
Translate to feedback. Convert numeric scores to natural language. Include which constraints improved or degraded since last iteration.
-
Plan the edit. The planner picks one constraint and one action (rephrase, split, merge, reorder). Generate 3 strategies in parallel.
-
Execute the edit. The editor rewrites the constraint list. Loop back to step 1.
Termination: Stop when all constraints score 1.0, when you hit 5 iterations, or when 2 iterations pass with no improvement.
Key parameters
These produced the benchmark numbers. Start here, tune later.
| Parameter | Value |
|---|---|
| Generator temp | 0.9 |
| Planner temp | 0.9 |
| Editor temp | 0.0 |
| Max iterations | 5 |
| Patience | 2 |
| Samples per iter | 3 |
Data structures
Build your state object around this constraint schema:
{
"constraints": [
{
"id": 1,
"text": "Exactly 200 words",
"score": 0.85,
"history": [0.6, 0.75, 0.85],
"last_action": "rephrase"
},
{
"id": 2,
"text": "Include 3 bullet points",
"score": 0.3,
"history": [0.3],
"last_action": null
}
],
"iteration": 2,
"patience_counter": 0
}The history array tracks score progression. The planner uses this to avoid repeating failed strategies on the same constraint.
Prompt templates
The evaluator needs structured output for parsing:
Score each constraint 0-10.
Constraints: {constraints}
Response: {response}
Format: "C1: [score] - [reason]"
The planner needs context to avoid loops:
Current: {constraints}
Failed: {failing_constraints}
Previous edits: {edit_history}
Pick ONE action: REPHRASE/SPLIT/REORDER
Output: Action, Target, New text
Pitfalls and gotchas
These issues won't show up in unit tests but will hurt production.
Subjective constraints fail. Constraints like "be funny" or "sound professional" are hard for the judge to score consistently. The evaluator may give 6/10 one run and 8/10 the next, causing the planner to chase its tail. Stick to objectively verifiable constraints (word counts, required sections, forbidden terms).
Score parsing failures. LLMs sometimes return "8/10" instead of "8". Normalize with regex before comparison.
Infinite rephrase loops. The planner may keep rephrasing the same constraint without improving it. Track edit history and penalize repeated targets.
Evaluator drift. The LLM-as-judge may score differently across sessions. Use fixed seeds or ensemble multiple evaluations.
Cost explosion. A 5-iteration run burns 55 LLM calls. Cache optimized constraints and reuse them across similar tasks.
Cost considerations
Each iteration involves multiple model calls. Here's the breakdown.
| Agent | Calls | Model Size |
|---|---|---|
| Generator | 3 | Small (8B) |
| Evaluator | 3 | Large (70B) |
| Translator | 1 | Large (70B) |
| Planner | 3 | Large (70B) |
| Editor | 1 | Large (70B) |
For a 5-iteration run, that's 55 calls total. The latency adds up because agents run sequentially.
The Patience parameter is your cost control. Setting patience=2 means the workflow stops after 2 iterations with no improvement. This prevents wasting tokens on constraints that are fundamentally unsatisfiable. In the paper's experiments, most improvements happened within the first 2-3 iterations anyway. If you're not seeing gains by iteration 3, you probably won't see them at iteration 5.
Mitigation strategies:
- Run offline for prompt library curation (the primary use case)
- Cache optimized constraints and reuse across similar tasks
- Use smaller planner models (7B) if latency matters more than quality
- Batch multiple prompts through the same workflow
Limitations
The authors acknowledge several constraints:
Quality ceiling
The system is only as good as its initial constraint decomposition. If the original prompt's constraints are fundamentally unclear or contradictory, the workflow can't fix that.
Evaluation brittleness
LLM-as-judge has known biases:
- Verbose responses often score higher
- Self-generated content rates better than human content
- Edge cases get inconsistent scores
The 79-81% human agreement means 20% of scores are potentially wrong, propagating errors through the refinement loop.
Fixed action space
The four actions (rephrase, split, merge, reorder) can't handle all failure modes:
- Can't add new constraints the user forgot
- Can't remove constraints that conflict
- Can't restructure the entire prompt architecture
Computational overhead
Sequential multi-agent calls create latency. The paper doesn't quantify this, but production deployments need to consider whether the compliance gains justify the additional inference cost.
Model dependence
Results were validated on Llama and Mixtral. Other model families may require different planner strategies or evaluation calibration.
Best for: Curating prompt libraries, high-stakes compliance requirements, offline optimization before deployment. Less suitable for: Real-time applications, simple prompts, budget-constrained environments.
Business implications
For executives evaluating this approach, here's the trade-off calculus:
The cost side. Running a 5-iteration optimization loop burns 55 LLM calls per prompt. At typical API rates, that's $0.10-0.50 per prompt optimization. The workflow adds latency (sequential agent calls) and infrastructure complexity (LangGraph, state management).
The benefit side. In regulated industries (finance, legal, healthcare), "90% accuracy" isn't acceptable. A loan document that omits a required disclosure, or a medical summary that exceeds the word limit for a form field, creates compliance risk and rework. Manual review of non-compliant outputs costs $10-50 per instance in analyst time.
The math. If your prompts currently fail 18% of the time and each failure costs $20 to fix manually, you're spending $3.60 per generation on rework. Optimizing the prompt once for $0.50 and reusing it across thousands of generations is a clear win.
Before vs. after (1,000 generations):
- Before optimization: 18% failure rate x $20 rework = $3,600 in analyst time
- After optimization: One-time $0.50 optimization + under 2% failure rate x $20 = $40 total
That's a 99% reduction in compliance costs for a single prompt template.
When it doesn't work. If your prompts already hit 95%+ compliance, or if your domain tolerates occasional format violations, the optimization overhead isn't justified. This is a tool for high-stakes, high-volume scenarios where precision matters.
Practical applications
The multi-agent prompt refinement workflow fits several production scenarios:
Enterprise document generation: Legal, financial, and compliance documents require precise formatting. Run the workflow once to optimize prompt templates, then reuse them across thousands of generations.
Chatbot quality assurance: Before deploying a new conversational prompt, run it through the workflow to catch compliance failures early. Cheaper than fixing production bugs.
RAG pipeline hardening: Retrieval-augmented systems often have strict output format requirements for downstream parsing. Optimized prompts reduce parsing errors.
A/B testing prompt variants: Use the workflow to generate multiple compliant prompt versions, then test which performs best with real users.
The key insight is treating prompt engineering as a systematic, measurable process rather than ad-hoc iteration. The workflow provides a reproducible methodology that can be integrated into CI/CD pipelines for AI systems.
Cite this paper
Alberto Purpura, Li Wang, Sahil Badyal, Eugenio Beaufrand, Adam Faulkner (2026). From Vague to Precise: How Multi-Agent LLM Teams Fix Your Prompts. arXiv 2026.