From Vague to Precise: How Multi-Agent LLM Teams Fix Your Prompts

TL;DR

The Problem. LLMs produce outputs that answer the question but ignore constraints. "Write a 200-word summary" returns 450 words. "Include exactly 3 bullet points" returns 5. Conceptually correct, procedurally broken.
The Solution. A team of specialized LLM agents that separate task descriptions from constraints, score compliance quantitatively, and iteratively edit the prompt until the generator follows instructions. No retraining required.
The Results. On hard prompts (baseline ~30% compliance), the workflow achieves 53.74% improvement rate. Simply separating constraints from tasks improves compliance from 82% to 91.5% before any iteration begins.

Business impact

Non-compliant outputs create rework. A chatbot that ignores formatting rules forces manual editing. An API that returns 500-word responses when you asked for 100 breaks downstream parsing. This workflow lets you fix prompts once during development, then reuse the optimized versions across thousands of generations. The ROI is front-loaded: invest in prompt curation, save on production firefighting.

Research overview

You've written the perfect prompt. It describes exactly what you want. It specifies the format, the length, the tone, and the structure. You hit enter. The LLM returns something useful but completely ignores half your requirements.

This is the instruction following problem, and it costs engineering teams more time than they'd like to admit.

What is instruction following?

The ability of an LLM to not just generate relevant content, but to satisfy formal constraints: word limits, bullet counts, required sections, forbidden topics, specific formatting rules. A model can be "helpful" while still being non-compliant.

The paper from Capital One's Card Intelligence team attacks this problem with an insight that seems obvious in hindsight: treat prompt refinement as a collaborative engineering task, not a single-shot generation.

Think of it like a magazine editing team. The writer drafts the piece, the fact-checker scores each claim, the editor translates feedback into suggestions, the managing editor decides what to fix first, and a copy editor makes the actual changes. Each person has a narrow job. The loop continues until the piece passes muster.

System 2 reasoning, built architecturally

If you've followed the buzz around "reasoning models" like OpenAI's o1, you'll recognize a pattern here. Those models internalize slow, deliberate thinking (System 2) into their weights. This paper achieves similar benefits externally: the multi-agent loop forces the system to think before speaking, evaluate its own output, and iteratively correct mistakes. You get reasoning behavior without waiting for a new model release.

Their system deploys five specialized agents working in a loop:

Agent	Role
Generator	Produces candidate responses
Evaluator	Scores each constraint 0-10
Translator	Converts scores to actionable feedback
Planner	Decides which constraint to fix and how
Editor	Modifies the constraint text

The workflow continues until compliance hits a target, patience runs out, or the maximum iterations are reached.

What makes this different from "just prompt better"? The system uses quantitative feedback. Instead of vague "try again" signals, each constraint gets a numeric score. The planner can see exactly which requirements are failing and by how much. This precision enables intelligent, targeted edits rather than random rewrites.

The compliance problem

How bad is instruction following in practice? In the paper's experiments with Llama 3.1 8B and Mixtral-8x7B, these models only follow complex instructions about 82% of the time. That sounds acceptable until you consider what 18% failure means at scale: nearly 1 in 5 outputs needs manual review or correction.

Consider a prompt for generating hotel reviews:

Write exactly 10 hotel reviews. Each review should be between 50-100 words. Include at least one review mentioning room service. Do not mention any specific hotel chains by name.

A typical LLM might return 8 reviews, one with 150 words, none mentioning room service, and two naming Marriott. The content is useful, readable, and completely non-compliant.

Why do LLMs fail at constraints?

Training objectives optimize for next-token prediction, not constraint satisfaction. The model learns to produce plausible text, not to count bullets or measure word lengths. Constraints require a different kind of attention than content generation.

The problem compounds in production systems:

Use Case	Constraint Type	Failure Mode
Legal documents	Exact section headings	Missing required sections
API responses	JSON structure	Invalid syntax, extra fields
Marketing copy	Word limits	Overlong content, truncation
Report generation	Citation format	Inconsistent references
Chatbot responses	Tone requirements	Professional/casual mismatch

Manual prompt iteration can fix individual cases, but it doesn't scale. Each new edge case requires human attention. The authors wanted a system that could automatically diagnose and repair compliance failures.

Core innovation

The paper introduces three key ideas that work together:

1. Constraint decoupling

Traditional prompts mix task descriptions with constraints in a single block. The authors separate them:

Before (coupled):

Write a 200-word professional summary of the quarterly report, including exactly 3 key metrics, formatted as bullet points, and ending with a call to action.

After (decoupled):

Task: Write a professional summary of the quarterly report ending with a call to action.

Constraints:

The summary must be exactly 200 words
Include exactly 3 key metrics
Format the metrics as bullet points

This separation alone improved compliance from 82% to 91.5% across both tested models. Why? The model can attend to constraints as a distinct checklist rather than parsing them from flowing prose.

Think of it like a chef following a recipe. A cramped sticky note that mixes ingredients, cooking times, and plating instructions together is hard to follow. But a recipe card with the method in one section and a separate checklist for timing and presentation? The chef can focus on the cooking while glancing at the checklist to hit every specification.

2. Quantitative scoring

Instead of binary pass/fail, each constraint gets a score from 0-10, normalized to [0,1]:

Constraint	Score	Interpretation
Word count (200)	0.8	Close, slightly over
3 key metrics	0.3	Only 1 metric found
Bullet formatting	1.0	Perfect compliance

The planner sees these scores and can prioritize: fix the metric constraint first, it's the worst offender.

Example walkthrough. For the constraint "Exactly 3 bullet points" the evaluator returns a score of 0.2 (only 1 bullet found). The translator converts this to: "Bullet-point constraint failed: 1/3 bullets present." The planner chooses the Rephrase action and outputs: "Provide exactly three bullet points, each on its own line starting with a dash." After the next iteration, the evaluator raises the score to 0.9. The loop continues or terminates based on the remaining constraints.

3. Strategic editing

The system doesn't randomly rewrite constraints. It chooses from four actions:

Action	When to Use	Example
Rephrase	Constraint is unclear	"be concise" → "use 50-75 words"
Split	Constraint bundles multiple requirements	"3 short bullets" → "exactly 3 bullets" + "each under 20 words"
Merge	Related constraints confuse the model	Combine formatting rules
Reorder	Critical constraints get buried	Move failing constraint to top

Editing Action Distribution

Rephrasing dominates, especially in successful cases

In practice, rephrasing dominated (88% of actions). This suggests that clarity is the main issue. Models can follow precise instructions; they struggle with ambiguous ones.

Multi-agent architecture

The workflow orchestrates seven components using LangGraph for state management. The first two (Input Prompt and Constraints Extractor) run once at the start. The remaining five agents loop until compliance is achieved.

Multi-Agent Instruction Refinement Workflow

A constraint-driven architecture for iterative content generation, evaluation, and refinement

Generator agent

Takes the current prompt version and produces three candidate responses. Using multiple outputs provides a better signal than a single sample.

Configuration:

Temperature: 0.9
Top-p (nucleus sampling): 0.95
Outputs: 3 responses per iteration

Evaluator agent

Uses the LLM-as-judge pattern. For each response, it scores every constraint independently on a 0-10 scale.

What is LLM-as-judge?

Using a capable model to grade the output of another model (or itself). Think of it as having a senior developer review a junior's code. The judge doesn't generate the answer; it evaluates whether the answer meets requirements. This pattern lets you automate quality checks that would otherwise require human reviewers.

Validation results:

Human annotator agreement: 96%
LLM-judge vs human agreement: 79-81%

The per-constraint scoring is critical. Aggregate scores ("overall this response is 7/10") don't tell the planner which constraint needs work.

Translator agent

The bridge between numbers and actionable insight. The translator converts raw scores (8/10, 3/10) into qualitative feedback the planner can act on: "Constraint 2 is failing because the model included 5 bullet points instead of 3. Previous rephrase attempt did not help."

The translation includes:

Which constraints improved or degraded since the last iteration
The edit history (what was already tried)
Concrete response excerpts showing where compliance failed

This context is why the planner knows what to fix, not just that something failed. Without the translator, the planner would be flying blind.

Planner agent

The strategic decision-maker. It receives the translator's summary and decides:

Which constraint to modify
Which editing action to apply
What the new constraint text should look like

The planner generates three parallel strategies, evaluated simultaneously. Temperature 0.9 encourages diverse approaches.

Editor agent

Executes the planner's selected strategy. Takes the old constraint list and produces a new version with the specified modification. Greedy decoding (temperature 0) ensures deterministic execution.

How the agents collaborate

A single iteration flows like this:

1. Generator produces 3 responses from current prompt
2. Evaluator scores each constraint for each response
3. Translator summarizes: "Constraint 2 dropped from 0.7 to 0.4 after last edit"
4. Planner proposes: "Rephrase constraint 2 to be more specific"
5. Editor outputs: new constraint list
6. Loop back to Generator with updated prompt

Termination conditions:

Perfect compliance achieved (all constraints score 1.0)
Maximum iterations reached (N_max = 5)
No improvement for patience threshold (P_max = 2 iterations)

The patience mechanism prevents infinite loops when a constraint is fundamentally unsatisfiable by the model.

Experimental results

The authors extended InfoBench with a "Translated Constraints" column, creating 500 test samples with explicitly decomposed constraints.

What is InfoBench?

A curated benchmark of prompts designed to test LLMs' ability to follow detailed instructions and constraints. It provides a standardized set of tasks for measuring compliance, making it the go-to dataset for evaluating instruction-following improvements.

Compliance Improvement Across Stages

Progressive gains from constraint separation and workflow optimization

Baseline comparison

Model	Without Constraints	With Constraints
Llama 3.1 8B	82.1%	91.5%
Mixtral-8x 7B	81.8%	91.4%

Just adding explicit constraints improved compliance by ~10 percentage points. The multi-agent workflow then builds on this baseline.

Workflow improvements

The multi-agent workflow adds another layer of gains on top of constraint separation.

Model	Gain	Success Rate
Mixtral-8x 7B	+13.1%	41.0%
Llama 3.1 8B	+13.0%	35.2%

These numbers are for prompts that actually improved. The "Success Rate" shows what percentage of prompts saw any improvement at all.

Hard vs. easy prompts

The workflow shines on difficult cases:

Prompt Difficulty	Baseline Compliance	Improvement Rate
Hard (~30% baseline)	29.8%	53.74%
Easy (~71% baseline)	71.2%	18.33%

When prompts start with low compliance, there's more room for the system to help. Already-compliant prompts don't need optimization.

Iteration analysis

Outcome	Average Iterations
Already compliant	0
No improvement	2.00
Compliance increased	2.38

Successful improvements required slightly more iterations on average. This suggests that patience (continuing to refine) correlates with positive outcomes.

What works and what doesn't

Ablation: removing quantitative feedback

When the evaluator provided only qualitative feedback ("constraint 2 is failing") without scores, performance dropped.

Model	With Scores	Without
Mixtral	41.0%	38.1%
Llama	35.2%	34.5%

Quantitative scores provide meaningful signal, though the effect varies by model. The planner makes better decisions when it knows a constraint scored 0.3 versus 0.7.

Action effectiveness

The planner overwhelmingly chose rephrasing.

Action	All	Improved
Rephrase	0.88	1.81
Split	0.14	0.35
Reorder	0.12	0.18
Merge	0.00	0.00

Successful cases used rephrasing twice as often as average. Split operations also appeared more frequently when improvements happened, suggesting that decomposing complex constraints helps.

Merge was never selected. The initial constraint decomposition already separates concerns, leaving nothing to combine.

Implementation blueprint

The system requires no model fine-tuning. You can build this with off-the-shelf components.

Recommended tech stack

These are the tools the authors used to get their results.

Component	Recommended
Orchestration	LangGraph
Generator	Llama 3 8B
Planner	Llama 3 70B
State	In-memory

Alternatives: LangChain for orchestration, Mixtral for generator, GPT-4o for planner, Redis for distributed state.

Why LangGraph?

LangGraph handles stateful graph execution, letting you define agent transitions as edges in a graph. Each node is an agent, each edge is a conditional transition based on scores or iteration counts. It's purpose-built for multi-agent loops.

Core workflow

The system operates in a single loop with five steps per iteration.

Generate candidates. The generator LLM produces 3 responses from the current prompt. Multiple outputs give statistical signal.
Score each constraint. The evaluator grades every constraint 0-10 for each response. Store scores in your state object.
Translate to feedback. Convert numeric scores to natural language. Include which constraints improved or degraded since last iteration.
Plan the edit. The planner picks one constraint and one action (rephrase, split, merge, reorder). Generate 3 strategies in parallel.
Execute the edit. The editor rewrites the constraint list. Loop back to step 1.

Termination: Stop when all constraints score 1.0, when you hit 5 iterations, or when 2 iterations pass with no improvement.

Key parameters

These produced the benchmark numbers. Start here, tune later.

Parameter	Value
Generator temp	0.9
Planner temp	0.9
Editor temp	0.0
Max iterations	5
Patience	2
Samples per iter	3

Data structures

Build your state object around this constraint schema:

{
  "constraints": [
    {
      "id": 1,
      "text": "Exactly 200 words",
      "score": 0.85,
      "history": [0.6, 0.75, 0.85],
      "last_action": "rephrase"
    },
    {
      "id": 2,
      "text": "Include 3 bullet points",
      "score": 0.3,
      "history": [0.3],
      "last_action": null
    }
  ],
  "iteration": 2,
  "patience_counter": 0
}

The history array tracks score progression. The planner uses this to avoid repeating failed strategies on the same constraint.

Prompt templates

The evaluator needs structured output for parsing:

Score each constraint 0-10.
Constraints: {constraints}
Response: {response}
Format: "C1: [score] - [reason]"

The planner needs context to avoid loops:

Current: {constraints}
Failed: {failing_constraints}
Previous edits: {edit_history}
Pick ONE action: REPHRASE/SPLIT/REORDER
Output: Action, Target, New text

Pitfalls and gotchas

These issues won't show up in unit tests but will hurt production.

Subjective constraints fail. Constraints like "be funny" or "sound professional" are hard for the judge to score consistently. The evaluator may give 6/10 one run and 8/10 the next, causing the planner to chase its tail. Stick to objectively verifiable constraints (word counts, required sections, forbidden terms).

Score parsing failures. LLMs sometimes return "8/10" instead of "8". Normalize with regex before comparison.

Infinite rephrase loops. The planner may keep rephrasing the same constraint without improving it. Track edit history and penalize repeated targets.

Evaluator drift. The LLM-as-judge may score differently across sessions. Use fixed seeds or ensemble multiple evaluations.

Cost explosion. A 5-iteration run burns 55 LLM calls. Cache optimized constraints and reuse them across similar tasks.

Cost considerations

Each iteration involves multiple model calls. Here's the breakdown.

Agent	Calls	Model Size
Generator	3	Small (8B)
Evaluator	3	Large (70B)
Translator	1	Large (70B)
Planner	3	Large (70B)
Editor	1	Large (70B)

For a 5-iteration run, that's 55 calls total. The latency adds up because agents run sequentially.

The Patience parameter is your cost control. Setting patience=2 means the workflow stops after 2 iterations with no improvement. This prevents wasting tokens on constraints that are fundamentally unsatisfiable. In the paper's experiments, most improvements happened within the first 2-3 iterations anyway. If you're not seeing gains by iteration 3, you probably won't see them at iteration 5.

Mitigation strategies:

Run offline for prompt library curation (the primary use case)
Cache optimized constraints and reuse across similar tasks
Use smaller planner models (7B) if latency matters more than quality
Batch multiple prompts through the same workflow

Limitations

The authors acknowledge several constraints:

Quality ceiling

The system is only as good as its initial constraint decomposition. If the original prompt's constraints are fundamentally unclear or contradictory, the workflow can't fix that.

Evaluation brittleness

LLM-as-judge has known biases:

Verbose responses often score higher
Self-generated content rates better than human content
Edge cases get inconsistent scores

The 79-81% human agreement means 20% of scores are potentially wrong, propagating errors through the refinement loop.

Fixed action space

The four actions (rephrase, split, merge, reorder) can't handle all failure modes:

Can't add new constraints the user forgot
Can't remove constraints that conflict
Can't restructure the entire prompt architecture

Computational overhead

Sequential multi-agent calls create latency. The paper doesn't quantify this, but production deployments need to consider whether the compliance gains justify the additional inference cost.

Model dependence

Results were validated on Llama and Mixtral. Other model families may require different planner strategies or evaluation calibration.

When to use this approach

Best for: Curating prompt libraries, high-stakes compliance requirements, offline optimization before deployment. Less suitable for: Real-time applications, simple prompts, budget-constrained environments.

Business implications

For executives evaluating this approach, here's the trade-off calculus:

The cost side. Running a 5-iteration optimization loop burns 55 LLM calls per prompt. At typical API rates, that's $0.10-0.50 per prompt optimization. The workflow adds latency (sequential agent calls) and infrastructure complexity (LangGraph, state management).

The benefit side. In regulated industries (finance, legal, healthcare), "90% accuracy" isn't acceptable. A loan document that omits a required disclosure, or a medical summary that exceeds the word limit for a form field, creates compliance risk and rework. Manual review of non-compliant outputs costs $10-50 per instance in analyst time.

The math. If your prompts currently fail 18% of the time and each failure costs $20 to fix manually, you're spending $3.60 per generation on rework. Optimizing the prompt once for $0.50 and reusing it across thousands of generations is a clear win.

Before vs. after (1,000 generations):

Before optimization: 18% failure rate x $20 rework = $3,600 in analyst time
After optimization: One-time $0.50 optimization + under 2% failure rate x $20 = $40 total

That's a 99% reduction in compliance costs for a single prompt template.

When it doesn't work. If your prompts already hit 95%+ compliance, or if your domain tolerates occasional format violations, the optimization overhead isn't justified. This is a tool for high-stakes, high-volume scenarios where precision matters.

Practical applications

The multi-agent prompt refinement workflow fits several production scenarios:

Enterprise document generation: Legal, financial, and compliance documents require precise formatting. Run the workflow once to optimize prompt templates, then reuse them across thousands of generations.

Chatbot quality assurance: Before deploying a new conversational prompt, run it through the workflow to catch compliance failures early. Cheaper than fixing production bugs.

RAG pipeline hardening: Retrieval-augmented systems often have strict output format requirements for downstream parsing. Optimized prompts reduce parsing errors.

A/B testing prompt variants: Use the workflow to generate multiple compliant prompt versions, then test which performs best with real users.

The key insight is treating prompt engineering as a systematic, measurable process rather than ad-hoc iteration. The workflow provides a reproducible methodology that can be integrated into CI/CD pipelines for AI systems.

Authors

Alberto PurpuraCapital One,Li WangCapital One,Sahil BadyalCapital One,Eugenio BeaufrandCapital One,Adam FaulknerCapital One

Cite this paper

Alberto Purpura, Li Wang, Sahil Badyal, Eugenio Beaufrand, Adam Faulkner (2026). From Vague to Precise: How Multi-Agent LLM Teams Fix Your Prompts. arXiv 2026.

Key Findings