-
The Problem. Building multi-agent systems today means manually defining workflows: which agent handles what, when to replan, how to route failures. This is like writing a giant if-else tree that cannot anticipate every edge case.
-
The Solution. CORAL replaces predefined workflows with an "information flow orchestrator" that monitors task progress and coordinates agents through natural language. Agents communicate via a simple toolkit (send_message, wait_for_mention) instead of hardcoded routing rules.
-
The Results. When worker agents use weaker models (GPT-4.1 Mini), CORAL achieves 63.6% accuracy vs 55.2% for workflow-based OWL (+8.5 pp). The orchestrator catches edge cases that workflow systems miss, like partial data retrieval or semantic mismatches.
Research Overview
If you have built a multi-agent system, you know the black box problem. Your web agent fails to retrieve data, but you do not know why. Was the page unavailable? Did it retrieve partial data? Did it misinterpret the query entirely? The workflow marks the subtask as "failed" and triggers replanning, but you still have no insight into what actually happened. You are debugging a state machine that cannot explain itself.
A multi-agent system (MAS) uses multiple AI agents, each with specialized capabilities, to solve complex tasks. One agent might browse the web, another processes documents, a third writes code. The challenge is coordinating them: who does what, in what order, and what happens when something goes wrong.
Most multi-agent frameworks treat coordination as a workflow engineering problem. Systems like OWL, MetaGPT, and AutoAgent require developers to predefine task states and routing rules. The result is a fragile state machine: you enumerate every anticipated failure mode, write routing logic for each, and hope nothing falls through the cracks. When something does, the system either fails silently or triggers expensive replanning that re-executes work already completed.
CORAL takes a different approach. Instead of predefined workflows, it uses an "information flow orchestrator" that monitors task progress and coordinates agents dynamically through natural language. Think of it as replacing the state machine with an auditable checklist. The orchestrator does not follow a decision tree. It observes what agents produce, evaluates whether results meet the actual requirements, and explains its coordination decisions in natural language. When something goes wrong, you can read the orchestrator's reasoning to understand why.
Key results
| Configuration | CORAL | OWL (Baseline) | Difference |
|---|---|---|---|
| All agents: Grok 4.1 Fast | 64.2% | 64.2% | 0 pp |
| Main: Grok 4.1 Fast, Workers: GPT-4.1 Mini | 63.6% | 55.2% | +8.5 pp |
The performance gap appears when worker agents use weaker models. Weaker models produce more partial results and edge cases. Workflow-based systems cannot handle these gracefully because their predefined states do not account for every failure mode. The information flow orchestrator detects and corrects these issues in real time.
The Workflow Problem
Consider a concrete example from the GAIA benchmark. The task: "Find all U.S. Survivor winners through August 2023, including their names and birth dates."
A web agent retrieves all 46 winner names successfully. But for several winners, birth dates are not available on the pages it accessed. The agent returns results with some birth dates marked as "unknown."
GAIA is a benchmark for general-purpose AI assistants. It includes 165 tasks requiring web search, document processing, code execution, and multi-step reasoning. Each task has a single correct answer that can be objectively verified. Difficulty ranges from Level 1 (simple) to Level 3 (complex multi-step reasoning).
In a workflow-based system like OWL, the subtask is not explicitly marked as "failed" because names were retrieved. The workflow proceeds to the next step, which now operates on incomplete data. Downstream agents compute statistics or filter results based on birth dates, producing wrong answers.
In CORAL, the information flow orchestrator reviews the web agent's output. It notices that several entries lack birth dates, which violates the implicit success criteria of the original query. The orchestrator explicitly identifies this mismatch and instructs the web agent to search specifically for the missing birth dates before proceeding.
Workflow-Based vs Information-Flow Paradigm
Comparing traditional rule-based coordination with dynamic A2A orchestration
Before vs After: A Real Edge Case
Here is how a standard workflow agent and CORAL handle the same partial data scenario:
- Task: "Get birth dates for all Survivor winners"
- Web agent retrieves 46 names, 40 birth dates
- Subtask status: SUCCESS (names retrieved)
- Downstream agent calculates average age
- Result: Wrong answer (missing 6 birth dates)
- Debug insight: None (workflow logs show "success")
- Task: "Get birth dates for all Survivor winners"
- Web agent retrieves 46 names, 40 birth dates
- Orchestrator: "6 winners lack birth dates. This violates completeness for the original query."
- Orchestrator: "Web agent, search Wikipedia for Richard Hatch, Tina Wesson... birth dates."
- Web agent retrieves remaining 6 birth dates
- Result: Correct answer (all 46 birth dates)
The orchestrator catches the partial completion because it continuously evaluates results against the original requirements, not just the subtask definition.
Why workflows struggle
| Limitation | Consequence |
|---|---|
| Predefined states | Cannot represent partial completion or semantic mismatches |
| Manual routing rules | Engineers cannot anticipate every edge case |
| Binary success/failure | No mechanism for "mostly correct but needs refinement" |
| Static decomposition | Cannot adapt task breakdown based on intermediate results |
The fundamental issue is that real-world tasks have a continuous state space, but workflows discretize this into a finite set of predefined states. Complex tasks inevitably encounter states that fall between the cracks.
Architecture
CORAL organizes agents around a central information flow orchestrator that coordinates all communication.
Information-Flow-Orchestrated Multi-Agent Paradigm
Dynamic coordination through Agent-to-Agent (A2A) communication
Traditional coordinator agents follow predefined routing rules: "if subtask fails, trigger replanning." The information flow orchestrator has no predefined rules. It continuously monitors task progress, evaluates whether results actually satisfy requirements, and decides coordination actions based on the current state of execution.
A useful analogy: the orchestrator is like a Product Manager who coordinates a team of developers. The PM does not write code themselves. Instead, they review deliverables, check if they meet requirements, provide clarifying feedback, and occasionally reassign tasks when someone is stuck. The developers (agents) do the actual work; the PM (orchestrator) ensures the pieces fit together correctly.
Components
Information Flow Orchestrator. The central agent that receives tasks, monitors progress, and coordinates other agents. It maintains a complete history of all inter-agent communication and decides when to dispatch tasks, request clarification, or submit final answers.
Specialized Agents. Domain-specific agents handle particular capabilities:
- Planner: Decomposes complex tasks into subtasks
- Web Agent: Searches and retrieves information from the web
- Document Agent: Processes files, PDFs, and structured data
- Reasoning & Coding Agent: Performs calculations and writes code
A2A Communication Toolkit. A minimal API that enables natural language communication between agents. All coordination happens through this toolkit rather than hardcoded function calls.
Communication flow
- Task arrives at the information flow orchestrator
- Orchestrator evaluates task complexity and dispatches to appropriate agent(s)
- Agent executes and returns results via A2A toolkit
- Orchestrator evaluates results against task requirements
- If requirements met: proceed to next step or submit answer
- If not met: refine instructions, substitute agents, or escalate to planner
The orchestrator maintains a message history that captures all interactions. This history provides context for evaluating whether accumulated results satisfy the original task.
A2A Communication
The A2A (Agent-to-Agent) communication toolkit is intentionally minimal. It provides two operations:
# Agent waits for a message from another agent
wait_for_mention(agent_id) -> message
# Agent sends a message to another agent
send_messages(sender_id, recipient_id, content)Predefined workflows use structured data for coordination: task IDs, status codes, typed parameters. Natural language is more flexible. The orchestrator can express nuanced instructions like "search again, but focus on Wikipedia pages" or "the birth dates you found are correct, but you missed three winners from seasons 12-15."
Asymmetric Communication: A Critical Design Choice
CORAL enforces a hub-and-spoke pattern: all agents communicate exclusively with the information flow orchestrator. Agents cannot message each other directly. This asymmetric constraint is not a limitation; it is a deliberate architectural decision that prevents coordination chaos.
Without this constraint, agents could enter infinite loops or reach contradictory conclusions through conversations the orchestrator never sees. The hub-and-spoke pattern ensures there is always one source of truth: the orchestrator's message history.
What an infinite loop looks like:
- Planner (to Web Agent): "I need the list of AI papers to create a summary."
- Web Agent (to Planner): "Please specify which papers you want me to search for."
- Planner (to Web Agent): "I asked you to provide the list of papers."
- Web Agent (to Planner): "I cannot search without knowing what to search for."
- (Loop continues indefinitely, consuming tokens until timeout)
With hub-and-spoke, this cannot happen. The orchestrator receives each message and can break the deadlock: "Web Agent, search arXiv for 'multi-agent LLM coordination' papers from 2024-2025."
Think of it this way: too many cooks in the kitchen leads to chaos. The orchestrator acts as the head chef who coordinates all communication. Agents report to the orchestrator, not to each other. This keeps the system predictable and debuggable.
Message structure
Messages are natural language with implicit structure:
Inquiry: "Based on the search results, do we have complete birth date information for all 46 Survivor winners?"
Instruction: "Search specifically for the birth dates of Richard Hatch, Tina Wesson, and Ethan Zohn. Focus on their Wikipedia pages."
Response: "Found birth dates: Richard Hatch (April 8, 1961), Tina Wesson (December 26, 1961), Ethan Zohn (November 12, 1973)."
The orchestrator interprets these messages using the same LLM capabilities that power the agents themselves.
Emergent Coordination Patterns
Through case-level analysis of GAIA benchmark executions, the authors identified four distinct coordination patterns that emerge from the orchestrator's behavior. These patterns are not programmed explicitly. They arise from the orchestrator's continuous monitoring and adaptive decision-making.
Emergent Task Coordination Patterns
Four coordination strategies that emerge from the information flow orchestrator
Pattern 1: Direct Agent Dispatch
For simple, non-decomposable tasks, the orchestrator skips planning entirely. It directly assigns the task to an appropriate agent.
Example: "What is the capital of France?" The orchestrator dispatches directly to the web agent without invoking the planner. This avoids unnecessary decomposition overhead.
Benefit: Reduces token consumption and avoids the failure modes that excessive decomposition can introduce.
Pattern 2: Planner-Mediated Decomposition
For complex tasks that benefit from explicit structure, the orchestrator consults the planner to decompose the task into subtasks.
Example: "Compare the GDP growth rates of G7 countries over the past decade and identify which country had the most consistent growth."
The orchestrator recognizes this requires multiple data retrieval steps followed by analysis. It asks the planner to create a structured decomposition.
Benefit: Provides explicit task structure when beneficial, while remaining compatible with traditional workflow patterns.
Pattern 3: Instruction Refinement
When an agent struggles with a task, the orchestrator does not immediately escalate to replanning. Instead, it refines the original instruction and lets the same agent retry.
Example dialogue:
Web Agent: "Search results for 'Survivor winners birth dates' returned a list of 46 names but birth dates for only 40 winners. The remaining 6 entries show 'Date of birth: Unknown'."
Orchestrator: "The current results are incomplete. 6 winners lack birth dates: Richard Hatch, Tina Wesson, Ethan Zohn, Vecepia Towery, Brian Heidik, and Jenna Morasca. Please search specifically for each of these names on their Wikipedia pages and return the exact birth date strings (e.g., 'April 8, 1961')."
Web Agent: "Found: Richard Hatch (April 8, 1961), Tina Wesson (December 26, 1961), Ethan Zohn (November 12, 1973), Vecepia Towery (March 21, 1965), Brian Heidik (March 5, 1962), Jenna Morasca (February 15, 1981)."
Benefit: Maintains context continuity and avoids the token overhead of full replanning.
Pattern 4: Agent Substitution
When one agent cannot complete a task, the orchestrator reassigns it to a different agent without full replanning.
Example: The web agent cannot access a paywalled article. The orchestrator substitutes: "Document agent, can you process this PDF version instead?"
Benefit: Explores alternative execution paths without restarting the entire task.
Benchmark Results
CORAL was evaluated on the GAIA validation set (165 tasks) against OWL, a state-of-the-art workflow-based multi-agent system.
GAIA Benchmark: Pass@1 Accuracy
Performance across difficulty levels with different model configurations
Homogeneous configuration (all agents use Grok 4.1 Fast)
| Difficulty | CORAL | OWL | Difference |
|---|---|---|---|
| Level 1 (53 tasks) | 75.5% | 81.1% | -5.6 pp |
| Level 2 (86 tasks) | 61.6% | 58.1% | +3.5 pp |
| Level 3 (26 tasks) | 50.0% | 50.0% | 0 pp |
| Overall | 64.2% | 64.2% | 0 pp |
With strong models everywhere, both systems achieve identical overall accuracy. CORAL performs slightly worse on simple tasks (Level 1) where the overhead of dynamic coordination provides no benefit. It performs better on medium-complexity tasks (Level 2) where edge cases are more common.
Heterogeneous configuration (main agents: Grok 4.1 Fast, workers: GPT-4.1 Mini)
| Difficulty | CORAL | OWL | Difference |
|---|---|---|---|
| Level 1 (53 tasks) | 79.3% | 73.6% | +5.7 pp |
| Level 2 (86 tasks) | 60.5% | 51.2% | +9.3 pp |
| Level 3 (26 tasks) | 42.3% | 30.8% | +11.5 pp |
| Overall | 63.6% | 55.2% | +8.5 pp |
When worker agents use weaker models, the performance gap becomes substantial. Weaker models produce more partial results, errors, and edge cases. OWL's predefined workflows cannot handle these gracefully. CORAL's orchestrator detects and corrects issues in real time.
Token consumption
With homogeneous models, CORAL uses slightly more tokens due to A2A communication overhead. With heterogeneous models, CORAL actually uses fewer tokens on complex tasks (over 0.6M tokens). This is because workflow-based systems trigger full replanning on failures, which re-executes completed subtasks. The orchestrator often resolves issues through instruction refinement without re-execution.
Edge Case Handling
The authors identified three strategies that the information flow orchestrator uses to handle edge cases that workflow systems miss.
The orchestrator uses its own LLM reasoning as an implicit confidence check. It does not rely on explicit confidence scores from agents. Instead, it evaluates each response against the original task requirements, acting as a "quality assurance" layer. When results seem incomplete or semantically wrong, the orchestrator's reasoning triggers corrective action. This is why using a strong model for the orchestrator matters even when workers use weaker models.
Example of orchestrator reasoning:
When the web agent returns "The average age of Survivor winners is approximately 32 years," the orchestrator internally evaluates:
"The original task asked for 'the exact average age of all Survivor winners.' The response says 'approximately 32 years,' which indicates an estimate rather than a precise calculation. This does not satisfy the 'exact' requirement. I should ask the agent to compute the average using the actual birth dates rather than an approximation."
This reasoning happens automatically as part of the orchestrator's response generation. No explicit confidence scores or thresholds are needed.
Strategy 1: Dynamic Success Criteria
Scenario: Web agent retrieves Survivor winners but some birth dates are "unknown."
Workflow behavior: Subtask marked as success because names were retrieved. Downstream tasks proceed with incomplete data.
Orchestrator behavior: Detects that "unknown" entries violate the implicit requirement for complete birth date information. Explicitly refines success criteria and instructs the agent to search specifically for missing data.
Strategy 2: Semantic Assumption Auditing
Scenario: Task asks for albums released "before 1999." Agent returns albums from 1999.
Workflow behavior: Results passed through because the query structure was satisfied.
Orchestrator behavior: Audits the semantic assumption that "1999" satisfies "before 1999." Prunes invalid entries before they propagate to downstream tasks.
Strategy 3: Instruction Alignment Monitoring
Scenario: Task asks for reading rates in "words per day." Agent uses page counts as a proxy because word counts are unavailable.
Workflow behavior: Subtask marked as complete because a calculation was performed.
Orchestrator behavior: Detects the mismatch between requested metric and proxy used. Escalates to planner with specific instruction to retrieve actual word counts from external sources.
Implementation Blueprint
Business impact
The switch from workflow-based to information-flow coordination has measurable ROI:
| Metric | Before (Workflow) | After (CORAL) | Impact |
|---|---|---|---|
| Workflow maintenance | ~120 engineer-hours/quarter | ~70 hours/quarter | 40% reduction |
| Worker model costs | ~2,400/month (GPT-4o everywhere) | ~240/month (GPT-4.1 Mini workers) | 10x cheaper |
| Silent failure rate | ~12% of complex tasks | ~3.5% of complex tasks | 8.5 pp improvement |
| Debug turnaround | 2-4 hours per incident | 10-15 minutes per incident | 10x faster |
These estimates are based on the GAIA benchmark results extrapolated to production workloads. Your mileage will vary based on task complexity and current failure rates.
Cost analysis: When to use CORAL
CORAL is not always the right choice. The orchestrator adds communication overhead that does not pay off for simple tasks.
| Task Complexity | Token Overhead | Recommendation |
|---|---|---|
| Simple Q&A (Level 1) | +15-20% | Use standard agents. CORAL overhead not justified. |
| Medium complexity (Level 2) | +5-10% (often breaks even) | Use CORAL. Edge case handling pays for overhead. |
| Complex multi-step (Level 3) | -10-15% (saves tokens) | Use CORAL. Avoids expensive full replanning. |
Rule of thumb: If your agents fail more than 20% of the time on edge cases, switch to CORAL. If they fail less than 5%, the overhead is not worth it.
Recommended tech stack
| Component | Recommended | Alternative |
|---|---|---|
| Orchestrator LLM | Grok 4.1 Fast, GPT-4o | Claude 3.5 Sonnet |
| Worker LLMs | GPT-4.1 Mini | Llama 3.1 8B |
| A2A Framework | CORAL Protocol | Custom implementation |
| Web Search | Tavily, SerpAPI | Browser automation |
| Code Execution | E2B, Modal | Local sandbox |
Core data structures
The A2A toolkit requires two simple operations. Here is a minimal implementation:
class A2AToolkit:
def __init__(self):
self.messages = []
self.waiting = {} # agent_id -> Event
def send_message(self, sender, recipient, content):
"""Send message; wakes up waiting recipient."""
msg = {
"from": sender,
"to": recipient,
"content": content,
"ts": time.time()
}
self.messages.append(msg)
# Wake up recipient if blocking
if recipient in self.waiting:
self.waiting[recipient].set()
def wait_for_mention(self, agent_id, timeout=300):
"""Block until this agent receives a message.
This is the key coordination primitive.
Agents call this to wait for instructions
from the orchestrator. Without it, agents
would need to poll for new messages.
"""
event = threading.Event()
self.waiting[agent_id] = event
event.wait(timeout=timeout)
return self._get_latest(agent_id)Orchestrator prompt structure
The information flow orchestrator requires a prompt that specifies:
- Monitoring responsibility: Continuously evaluate whether agent outputs satisfy task requirements
- Coordination authority: Dispatch tasks, refine instructions, substitute agents
- Submission criteria: When accumulated results are sufficient to answer the original query
You are an information flow orchestrator. Your role is to:
1. MONITOR: Evaluate each agent response against the
original task requirements. Check for:
- Completeness (all requested data present?)
- Accuracy (semantic assumptions correct?)
- Alignment (methodology matches requirements?)
2. COORDINATE: Based on your evaluation:
- If complete: proceed to next step or submit
- If partial: refine instructions for same agent
- If stuck: substitute with different agent
- If complex: consult planner for decomposition
3. SUBMIT: When you have sufficient information to
answer the original query with confidence.
Available agents: planner, web_agent, doc_agent, code_agent
Key parameters
Temperature controls randomness in LLM outputs. At 0.0, the model is deterministic: the same prompt always produces the same response. For coordination, this is essential. If the orchestrator randomly decides to dispatch tasks differently each time, debugging becomes impossible and results are not reproducible. Save temperature creativity for content generation, not for routing decisions.
| Parameter | Value | Why It Matters |
|---|---|---|
| Temperature | 0.0 | Deterministic coordination. Same input always produces same routing decision. |
| Max replanning | 3 | Prevents infinite loops. After 3 attempts, task is marked as failed rather than retrying forever. |
| Execution timeout | 30 min | Bounds resource consumption. Complex GAIA tasks rarely exceed 20 minutes. |
| Message history | Full | Orchestrator needs complete context to evaluate partial results. Summarize only when hitting context limits. |
Common pitfalls
Overly aggressive decomposition. The orchestrator may invoke the planner for simple tasks. Monitor task complexity and prefer direct dispatch for straightforward queries.
Infinite refinement loops. The orchestrator may keep refining instructions without progress. Implement a refinement counter and escalate to replanning after 2-3 attempts.
Context overflow. Full message history can exceed context limits on long tasks. Implement summarization for completed subtask threads.
Resources
- GitHub: github.com/Coral-Protocol/Beyond-Rule-Based-Workflows
- CORAL Protocol: coral.dev
- GAIA Benchmark: huggingface.co/datasets/gaia-benchmark/GAIA
Limitations
Single orchestrator bottleneck. All coordination flows through one agent. For highly parallel tasks, this could become a throughput constraint.
No learning from experience. The orchestrator does not improve over time. Each task starts fresh without memory of successful coordination patterns from previous tasks.
Communication overhead. A2A messaging adds tokens compared to direct function calls in workflow systems. This overhead is only justified for tasks complex enough to benefit from dynamic coordination.
Evaluation scope. Results are on GAIA only. Performance on domain-specific benchmarks (code generation, scientific reasoning) may differ.
Model dependence. The orchestrator's effectiveness depends heavily on the underlying LLM's ability to evaluate task completion and generate appropriate coordination messages.
The Bottom Line
Should you switch to CORAL?
Use this decision framework:
| Your Situation | Recommendation |
|---|---|
| Edge case failures over 20% | Switch to CORAL. The orchestrator will catch partial completions and semantic mismatches that your workflows miss. |
| Edge case failures 5-20% | Evaluate complexity. If tasks are Level 2+ (multi-step, multi-agent), CORAL likely pays off. For simple tasks, stick with workflows. |
| Edge case failures under 5% | Keep current system. CORAL's overhead is not justified if your workflows already handle edge cases well. |
| Using expensive worker models | Switch to CORAL + cheaper workers. GPT-4.1 Mini workers with a strong orchestrator can match GPT-4o workflows at 10x lower cost. |
| Spending over 30% of time on workflow maintenance | Switch to CORAL. Eliminate the routing rule treadmill. |
Audience-specific takeaways
For ML engineers: Start with the reference implementation and adapt the orchestrator prompt to your domain. Focus on the monitoring prompt (what constitutes "complete" results for your use case).
For engineering managers: The 8.5 pp improvement matters most with heterogeneous models. Budget 1-2 weeks for integration. The real savings come from eliminating ongoing workflow maintenance, not just accuracy gains.
For researchers: The four emergent patterns (direct dispatch, planner-mediated decomposition, instruction refinement, agent substitution) provide a taxonomy for analyzing dynamic coordination. Future work could explore learning these patterns rather than letting them emerge implicitly.
Paper: arXiv:2601.09883 Authors: Xinxing Ren, Quagmire Zang, Caelum Forder, Suman Deb, Ahsen Tahir, et al. (Coral Protocol, multiple universities) Original paper: arXiv ・ PDF ・ HTML
Cite this paper
Xinxing Ren, Quagmire Zang, Caelum Forder, Suman Deb, Ahsen Tahir, Zekun Guo (2026). CORAL: Dynamic Multi-Agent Coordination Without Predefined Workflows. arXiv 2026.