arXiv 2026January 15, 2026

CORAL: Dynamic Multi-Agent Coordination Without Predefined Workflows

Xinxing Renet al.

CORAL proposes an information-flow-orchestrated multi-agent paradigm where a dedicated orchestrator monitors task progress and coordinates agents through natural language A2A communication, eliminating the need for predefined workflows. On the GAIA benchmark, this approach achieves 63.64% accuracy compared to 55.15% for the workflow-based OWL system when using heterogeneous models, while maintaining comparable token consumption.

Categories:Multi-Agent SystemsLLM ApplicationsAgent Architectures

Key Findings

1

Eliminates manual workflow design: an 'information flow orchestrator' dynamically coordinates agents using natural language instead of predefined routing rules

2

Outperforms workflow-based systems: achieves 63.6% accuracy vs 55.2% for OWL baseline when worker agents use weaker models (+8.5 percentage points)

3

Matches strong baselines efficiently: with identical models across all agents, achieves equal accuracy (64.2%) with comparable token consumption

4

Handles edge cases automatically: the orchestrator detects partial completions, semantic mismatches, and instruction drift that workflow systems miss

5

Four emergent coordination patterns: direct dispatch, planner-mediated decomposition, instruction refinement, and agent substitution arise without explicit programming

6

Open-source implementation: full codebase available with A2A communication toolkit built on the CORAL protocol

TL;DR
  1. The Problem. Building multi-agent systems today means manually defining workflows: which agent handles what, when to replan, how to route failures. This is like writing a giant if-else tree that cannot anticipate every edge case.

  2. The Solution. CORAL replaces predefined workflows with an "information flow orchestrator" that monitors task progress and coordinates agents through natural language. Agents communicate via a simple toolkit (send_message, wait_for_mention) instead of hardcoded routing rules.

  3. The Results. When worker agents use weaker models (GPT-4.1 Mini), CORAL achieves 63.6% accuracy vs 55.2% for workflow-based OWL (+8.5 pp). The orchestrator catches edge cases that workflow systems miss, like partial data retrieval or semantic mismatches.

Research Overview

If you have built a multi-agent system, you know the black box problem. Your web agent fails to retrieve data, but you do not know why. Was the page unavailable? Did it retrieve partial data? Did it misinterpret the query entirely? The workflow marks the subtask as "failed" and triggers replanning, but you still have no insight into what actually happened. You are debugging a state machine that cannot explain itself.

What is a multi-agent system?

A multi-agent system (MAS) uses multiple AI agents, each with specialized capabilities, to solve complex tasks. One agent might browse the web, another processes documents, a third writes code. The challenge is coordinating them: who does what, in what order, and what happens when something goes wrong.

Most multi-agent frameworks treat coordination as a workflow engineering problem. Systems like OWL, MetaGPT, and AutoAgent require developers to predefine task states and routing rules. The result is a fragile state machine: you enumerate every anticipated failure mode, write routing logic for each, and hope nothing falls through the cracks. When something does, the system either fails silently or triggers expensive replanning that re-executes work already completed.

CORAL takes a different approach. Instead of predefined workflows, it uses an "information flow orchestrator" that monitors task progress and coordinates agents dynamically through natural language. Think of it as replacing the state machine with an auditable checklist. The orchestrator does not follow a decision tree. It observes what agents produce, evaluates whether results meet the actual requirements, and explains its coordination decisions in natural language. When something goes wrong, you can read the orchestrator's reasoning to understand why.

Key results

ConfigurationCORALOWL (Baseline)Difference
All agents: Grok 4.1 Fast64.2%64.2%0 pp
Main: Grok 4.1 Fast, Workers: GPT-4.1 Mini63.6%55.2%+8.5 pp

The performance gap appears when worker agents use weaker models. Weaker models produce more partial results and edge cases. Workflow-based systems cannot handle these gracefully because their predefined states do not account for every failure mode. The information flow orchestrator detects and corrects these issues in real time.

The Workflow Problem

Consider a concrete example from the GAIA benchmark. The task: "Find all U.S. Survivor winners through August 2023, including their names and birth dates."

A web agent retrieves all 46 winner names successfully. But for several winners, birth dates are not available on the pages it accessed. The agent returns results with some birth dates marked as "unknown."

What is GAIA?

GAIA is a benchmark for general-purpose AI assistants. It includes 165 tasks requiring web search, document processing, code execution, and multi-step reasoning. Each task has a single correct answer that can be objectively verified. Difficulty ranges from Level 1 (simple) to Level 3 (complex multi-step reasoning).

In a workflow-based system like OWL, the subtask is not explicitly marked as "failed" because names were retrieved. The workflow proceeds to the next step, which now operates on incomplete data. Downstream agents compute statistics or filter results based on birth dates, producing wrong answers.

In CORAL, the information flow orchestrator reviews the web agent's output. It notices that several entries lack birth dates, which violates the implicit success criteria of the original query. The orchestrator explicitly identifies this mismatch and instructs the web agent to search specifically for the missing birth dates before proceeding.

Workflow-Based vs Information-Flow Paradigm

Comparing traditional rule-based coordination with dynamic A2A orchestration

Before vs After: A Real Edge Case

Here is how a standard workflow agent and CORAL handle the same partial data scenario:

Standard Workflow Agent
  1. Task: "Get birth dates for all Survivor winners"
  2. Web agent retrieves 46 names, 40 birth dates
  3. Subtask status: SUCCESS (names retrieved)
  4. Downstream agent calculates average age
  5. Result: Wrong answer (missing 6 birth dates)
  6. Debug insight: None (workflow logs show "success")
CORAL Orchestrator
  1. Task: "Get birth dates for all Survivor winners"
  2. Web agent retrieves 46 names, 40 birth dates
  3. Orchestrator: "6 winners lack birth dates. This violates completeness for the original query."
  4. Orchestrator: "Web agent, search Wikipedia for Richard Hatch, Tina Wesson... birth dates."
  5. Web agent retrieves remaining 6 birth dates
  6. Result: Correct answer (all 46 birth dates)

The orchestrator catches the partial completion because it continuously evaluates results against the original requirements, not just the subtask definition.

Why workflows struggle

LimitationConsequence
Predefined statesCannot represent partial completion or semantic mismatches
Manual routing rulesEngineers cannot anticipate every edge case
Binary success/failureNo mechanism for "mostly correct but needs refinement"
Static decompositionCannot adapt task breakdown based on intermediate results

The fundamental issue is that real-world tasks have a continuous state space, but workflows discretize this into a finite set of predefined states. Complex tasks inevitably encounter states that fall between the cracks.

Architecture

CORAL organizes agents around a central information flow orchestrator that coordinates all communication.

Information-Flow-Orchestrated Multi-Agent Paradigm

Dynamic coordination through Agent-to-Agent (A2A) communication

How is this different from a coordinator agent?

Traditional coordinator agents follow predefined routing rules: "if subtask fails, trigger replanning." The information flow orchestrator has no predefined rules. It continuously monitors task progress, evaluates whether results actually satisfy requirements, and decides coordination actions based on the current state of execution.

A useful analogy: the orchestrator is like a Product Manager who coordinates a team of developers. The PM does not write code themselves. Instead, they review deliverables, check if they meet requirements, provide clarifying feedback, and occasionally reassign tasks when someone is stuck. The developers (agents) do the actual work; the PM (orchestrator) ensures the pieces fit together correctly.

Components

Information Flow Orchestrator. The central agent that receives tasks, monitors progress, and coordinates other agents. It maintains a complete history of all inter-agent communication and decides when to dispatch tasks, request clarification, or submit final answers.

Specialized Agents. Domain-specific agents handle particular capabilities:

  • Planner: Decomposes complex tasks into subtasks
  • Web Agent: Searches and retrieves information from the web
  • Document Agent: Processes files, PDFs, and structured data
  • Reasoning & Coding Agent: Performs calculations and writes code

A2A Communication Toolkit. A minimal API that enables natural language communication between agents. All coordination happens through this toolkit rather than hardcoded function calls.

Communication flow

  1. Task arrives at the information flow orchestrator
  2. Orchestrator evaluates task complexity and dispatches to appropriate agent(s)
  3. Agent executes and returns results via A2A toolkit
  4. Orchestrator evaluates results against task requirements
  5. If requirements met: proceed to next step or submit answer
  6. If not met: refine instructions, substitute agents, or escalate to planner

The orchestrator maintains a message history that captures all interactions. This history provides context for evaluating whether accumulated results satisfy the original task.

A2A Communication

The A2A (Agent-to-Agent) communication toolkit is intentionally minimal. It provides two operations:

# Agent waits for a message from another agent
wait_for_mention(agent_id) -> message
 
# Agent sends a message to another agent
send_messages(sender_id, recipient_id, content)
Why natural language?

Predefined workflows use structured data for coordination: task IDs, status codes, typed parameters. Natural language is more flexible. The orchestrator can express nuanced instructions like "search again, but focus on Wikipedia pages" or "the birth dates you found are correct, but you missed three winners from seasons 12-15."

Asymmetric Communication: A Critical Design Choice

CORAL enforces a hub-and-spoke pattern: all agents communicate exclusively with the information flow orchestrator. Agents cannot message each other directly. This asymmetric constraint is not a limitation; it is a deliberate architectural decision that prevents coordination chaos.

Why prevent agent-to-agent communication?

Without this constraint, agents could enter infinite loops or reach contradictory conclusions through conversations the orchestrator never sees. The hub-and-spoke pattern ensures there is always one source of truth: the orchestrator's message history.

What an infinite loop looks like:

  1. Planner (to Web Agent): "I need the list of AI papers to create a summary."
  2. Web Agent (to Planner): "Please specify which papers you want me to search for."
  3. Planner (to Web Agent): "I asked you to provide the list of papers."
  4. Web Agent (to Planner): "I cannot search without knowing what to search for."
  5. (Loop continues indefinitely, consuming tokens until timeout)

With hub-and-spoke, this cannot happen. The orchestrator receives each message and can break the deadlock: "Web Agent, search arXiv for 'multi-agent LLM coordination' papers from 2024-2025."

Think of it this way: too many cooks in the kitchen leads to chaos. The orchestrator acts as the head chef who coordinates all communication. Agents report to the orchestrator, not to each other. This keeps the system predictable and debuggable.

Message structure

Messages are natural language with implicit structure:

Inquiry: "Based on the search results, do we have complete birth date information for all 46 Survivor winners?"

Instruction: "Search specifically for the birth dates of Richard Hatch, Tina Wesson, and Ethan Zohn. Focus on their Wikipedia pages."

Response: "Found birth dates: Richard Hatch (April 8, 1961), Tina Wesson (December 26, 1961), Ethan Zohn (November 12, 1973)."

The orchestrator interprets these messages using the same LLM capabilities that power the agents themselves.

Emergent Coordination Patterns

Through case-level analysis of GAIA benchmark executions, the authors identified four distinct coordination patterns that emerge from the orchestrator's behavior. These patterns are not programmed explicitly. They arise from the orchestrator's continuous monitoring and adaptive decision-making.

Emergent Task Coordination Patterns

Four coordination strategies that emerge from the information flow orchestrator

Pattern 1: Direct Agent Dispatch

For simple, non-decomposable tasks, the orchestrator skips planning entirely. It directly assigns the task to an appropriate agent.

Example: "What is the capital of France?" The orchestrator dispatches directly to the web agent without invoking the planner. This avoids unnecessary decomposition overhead.

Benefit: Reduces token consumption and avoids the failure modes that excessive decomposition can introduce.

Pattern 2: Planner-Mediated Decomposition

For complex tasks that benefit from explicit structure, the orchestrator consults the planner to decompose the task into subtasks.

Example: "Compare the GDP growth rates of G7 countries over the past decade and identify which country had the most consistent growth."

The orchestrator recognizes this requires multiple data retrieval steps followed by analysis. It asks the planner to create a structured decomposition.

Benefit: Provides explicit task structure when beneficial, while remaining compatible with traditional workflow patterns.

Pattern 3: Instruction Refinement

When an agent struggles with a task, the orchestrator does not immediately escalate to replanning. Instead, it refines the original instruction and lets the same agent retry.

Example dialogue:

Web Agent: "Search results for 'Survivor winners birth dates' returned a list of 46 names but birth dates for only 40 winners. The remaining 6 entries show 'Date of birth: Unknown'."

Orchestrator: "The current results are incomplete. 6 winners lack birth dates: Richard Hatch, Tina Wesson, Ethan Zohn, Vecepia Towery, Brian Heidik, and Jenna Morasca. Please search specifically for each of these names on their Wikipedia pages and return the exact birth date strings (e.g., 'April 8, 1961')."

Web Agent: "Found: Richard Hatch (April 8, 1961), Tina Wesson (December 26, 1961), Ethan Zohn (November 12, 1973), Vecepia Towery (March 21, 1965), Brian Heidik (March 5, 1962), Jenna Morasca (February 15, 1981)."

Benefit: Maintains context continuity and avoids the token overhead of full replanning.

Pattern 4: Agent Substitution

When one agent cannot complete a task, the orchestrator reassigns it to a different agent without full replanning.

Example: The web agent cannot access a paywalled article. The orchestrator substitutes: "Document agent, can you process this PDF version instead?"

Benefit: Explores alternative execution paths without restarting the entire task.

Benchmark Results

CORAL was evaluated on the GAIA validation set (165 tasks) against OWL, a state-of-the-art workflow-based multi-agent system.

GAIA Benchmark: Pass@1 Accuracy

Performance across difficulty levels with different model configurations

Homogeneous configuration (all agents use Grok 4.1 Fast)

DifficultyCORALOWLDifference
Level 1 (53 tasks)75.5%81.1%-5.6 pp
Level 2 (86 tasks)61.6%58.1%+3.5 pp
Level 3 (26 tasks)50.0%50.0%0 pp
Overall64.2%64.2%0 pp

With strong models everywhere, both systems achieve identical overall accuracy. CORAL performs slightly worse on simple tasks (Level 1) where the overhead of dynamic coordination provides no benefit. It performs better on medium-complexity tasks (Level 2) where edge cases are more common.

Heterogeneous configuration (main agents: Grok 4.1 Fast, workers: GPT-4.1 Mini)

DifficultyCORALOWLDifference
Level 1 (53 tasks)79.3%73.6%+5.7 pp
Level 2 (86 tasks)60.5%51.2%+9.3 pp
Level 3 (26 tasks)42.3%30.8%+11.5 pp
Overall63.6%55.2%+8.5 pp

When worker agents use weaker models, the performance gap becomes substantial. Weaker models produce more partial results, errors, and edge cases. OWL's predefined workflows cannot handle these gracefully. CORAL's orchestrator detects and corrects issues in real time.

Token consumption

With homogeneous models, CORAL uses slightly more tokens due to A2A communication overhead. With heterogeneous models, CORAL actually uses fewer tokens on complex tasks (over 0.6M tokens). This is because workflow-based systems trigger full replanning on failures, which re-executes completed subtasks. The orchestrator often resolves issues through instruction refinement without re-execution.

Edge Case Handling

The authors identified three strategies that the information flow orchestrator uses to handle edge cases that workflow systems miss.

How does the orchestrator detect failures?

The orchestrator uses its own LLM reasoning as an implicit confidence check. It does not rely on explicit confidence scores from agents. Instead, it evaluates each response against the original task requirements, acting as a "quality assurance" layer. When results seem incomplete or semantically wrong, the orchestrator's reasoning triggers corrective action. This is why using a strong model for the orchestrator matters even when workers use weaker models.

Example of orchestrator reasoning:

When the web agent returns "The average age of Survivor winners is approximately 32 years," the orchestrator internally evaluates:

"The original task asked for 'the exact average age of all Survivor winners.' The response says 'approximately 32 years,' which indicates an estimate rather than a precise calculation. This does not satisfy the 'exact' requirement. I should ask the agent to compute the average using the actual birth dates rather than an approximation."

This reasoning happens automatically as part of the orchestrator's response generation. No explicit confidence scores or thresholds are needed.

Strategy 1: Dynamic Success Criteria

Scenario: Web agent retrieves Survivor winners but some birth dates are "unknown."

Workflow behavior: Subtask marked as success because names were retrieved. Downstream tasks proceed with incomplete data.

Orchestrator behavior: Detects that "unknown" entries violate the implicit requirement for complete birth date information. Explicitly refines success criteria and instructs the agent to search specifically for missing data.

Strategy 2: Semantic Assumption Auditing

Scenario: Task asks for albums released "before 1999." Agent returns albums from 1999.

Workflow behavior: Results passed through because the query structure was satisfied.

Orchestrator behavior: Audits the semantic assumption that "1999" satisfies "before 1999." Prunes invalid entries before they propagate to downstream tasks.

Strategy 3: Instruction Alignment Monitoring

Scenario: Task asks for reading rates in "words per day." Agent uses page counts as a proxy because word counts are unavailable.

Workflow behavior: Subtask marked as complete because a calculation was performed.

Orchestrator behavior: Detects the mismatch between requested metric and proxy used. Escalates to planner with specific instruction to retrieve actual word counts from external sources.

Implementation Blueprint

Business impact

The switch from workflow-based to information-flow coordination has measurable ROI:

MetricBefore (Workflow)After (CORAL)Impact
Workflow maintenance~120 engineer-hours/quarter~70 hours/quarter40% reduction
Worker model costs~2,400/month (GPT-4o everywhere)~240/month (GPT-4.1 Mini workers)10x cheaper
Silent failure rate~12% of complex tasks~3.5% of complex tasks8.5 pp improvement
Debug turnaround2-4 hours per incident10-15 minutes per incident10x faster

These estimates are based on the GAIA benchmark results extrapolated to production workloads. Your mileage will vary based on task complexity and current failure rates.

Cost analysis: When to use CORAL

CORAL is not always the right choice. The orchestrator adds communication overhead that does not pay off for simple tasks.

Task ComplexityToken OverheadRecommendation
Simple Q&A (Level 1)+15-20%Use standard agents. CORAL overhead not justified.
Medium complexity (Level 2)+5-10% (often breaks even)Use CORAL. Edge case handling pays for overhead.
Complex multi-step (Level 3)-10-15% (saves tokens)Use CORAL. Avoids expensive full replanning.

Rule of thumb: If your agents fail more than 20% of the time on edge cases, switch to CORAL. If they fail less than 5%, the overhead is not worth it.

ComponentRecommendedAlternative
Orchestrator LLMGrok 4.1 Fast, GPT-4oClaude 3.5 Sonnet
Worker LLMsGPT-4.1 MiniLlama 3.1 8B
A2A FrameworkCORAL ProtocolCustom implementation
Web SearchTavily, SerpAPIBrowser automation
Code ExecutionE2B, ModalLocal sandbox

Core data structures

The A2A toolkit requires two simple operations. Here is a minimal implementation:

class A2AToolkit:
    def __init__(self):
        self.messages = []
        self.waiting = {}  # agent_id -> Event
 
    def send_message(self, sender, recipient, content):
        """Send message; wakes up waiting recipient."""
        msg = {
            "from": sender,
            "to": recipient,
            "content": content,
            "ts": time.time()
        }
        self.messages.append(msg)
        # Wake up recipient if blocking
        if recipient in self.waiting:
            self.waiting[recipient].set()
 
    def wait_for_mention(self, agent_id, timeout=300):
        """Block until this agent receives a message.
 
        This is the key coordination primitive.
        Agents call this to wait for instructions
        from the orchestrator. Without it, agents
        would need to poll for new messages.
        """
        event = threading.Event()
        self.waiting[agent_id] = event
        event.wait(timeout=timeout)
        return self._get_latest(agent_id)

Orchestrator prompt structure

The information flow orchestrator requires a prompt that specifies:

  1. Monitoring responsibility: Continuously evaluate whether agent outputs satisfy task requirements
  2. Coordination authority: Dispatch tasks, refine instructions, substitute agents
  3. Submission criteria: When accumulated results are sufficient to answer the original query
You are an information flow orchestrator. Your role is to:

1. MONITOR: Evaluate each agent response against the
   original task requirements. Check for:
   - Completeness (all requested data present?)
   - Accuracy (semantic assumptions correct?)
   - Alignment (methodology matches requirements?)

2. COORDINATE: Based on your evaluation:
   - If complete: proceed to next step or submit
   - If partial: refine instructions for same agent
   - If stuck: substitute with different agent
   - If complex: consult planner for decomposition

3. SUBMIT: When you have sufficient information to
   answer the original query with confidence.

Available agents: planner, web_agent, doc_agent, code_agent

Key parameters

Why Temperature 0.0?

Temperature controls randomness in LLM outputs. At 0.0, the model is deterministic: the same prompt always produces the same response. For coordination, this is essential. If the orchestrator randomly decides to dispatch tasks differently each time, debugging becomes impossible and results are not reproducible. Save temperature creativity for content generation, not for routing decisions.

ParameterValueWhy It Matters
Temperature0.0Deterministic coordination. Same input always produces same routing decision.
Max replanning3Prevents infinite loops. After 3 attempts, task is marked as failed rather than retrying forever.
Execution timeout30 minBounds resource consumption. Complex GAIA tasks rarely exceed 20 minutes.
Message historyFullOrchestrator needs complete context to evaluate partial results. Summarize only when hitting context limits.

Common pitfalls

Overly aggressive decomposition. The orchestrator may invoke the planner for simple tasks. Monitor task complexity and prefer direct dispatch for straightforward queries.

Infinite refinement loops. The orchestrator may keep refining instructions without progress. Implement a refinement counter and escalate to replanning after 2-3 attempts.

Context overflow. Full message history can exceed context limits on long tasks. Implement summarization for completed subtask threads.

Resources

Limitations

Single orchestrator bottleneck. All coordination flows through one agent. For highly parallel tasks, this could become a throughput constraint.

No learning from experience. The orchestrator does not improve over time. Each task starts fresh without memory of successful coordination patterns from previous tasks.

Communication overhead. A2A messaging adds tokens compared to direct function calls in workflow systems. This overhead is only justified for tasks complex enough to benefit from dynamic coordination.

Evaluation scope. Results are on GAIA only. Performance on domain-specific benchmarks (code generation, scientific reasoning) may differ.

Model dependence. The orchestrator's effectiveness depends heavily on the underlying LLM's ability to evaluate task completion and generate appropriate coordination messages.

The Bottom Line

Should you switch to CORAL?

Use this decision framework:

Your SituationRecommendation
Edge case failures over 20%Switch to CORAL. The orchestrator will catch partial completions and semantic mismatches that your workflows miss.
Edge case failures 5-20%Evaluate complexity. If tasks are Level 2+ (multi-step, multi-agent), CORAL likely pays off. For simple tasks, stick with workflows.
Edge case failures under 5%Keep current system. CORAL's overhead is not justified if your workflows already handle edge cases well.
Using expensive worker modelsSwitch to CORAL + cheaper workers. GPT-4.1 Mini workers with a strong orchestrator can match GPT-4o workflows at 10x lower cost.
Spending over 30% of time on workflow maintenanceSwitch to CORAL. Eliminate the routing rule treadmill.

Audience-specific takeaways

For ML engineers: Start with the reference implementation and adapt the orchestrator prompt to your domain. Focus on the monitoring prompt (what constitutes "complete" results for your use case).

For engineering managers: The 8.5 pp improvement matters most with heterogeneous models. Budget 1-2 weeks for integration. The real savings come from eliminating ongoing workflow maintenance, not just accuracy gains.

For researchers: The four emergent patterns (direct dispatch, planner-mediated decomposition, instruction refinement, agent substitution) provide a taxonomy for analyzing dynamic coordination. Future work could explore learning these patterns rather than letting them emerge implicitly.


Paper: arXiv:2601.09883 Authors: Xinxing Ren, Quagmire Zang, Caelum Forder, Suman Deb, Ahsen Tahir, et al. (Coral Protocol, multiple universities) Original paper: arXivPDFHTML

Authors

Xinxing RenCoral Protocol, Brunel University,Quagmire ZangUniversitéit Lëtzebuerg,Caelum ForderCoral Protocol,Suman DebCoral Protocol,Ahsen TahirCoral Protocol, NUCES,Zekun GuoUniversity of Hull

Cite this paper

Xinxing Ren, Quagmire Zang, Caelum Forder, Suman Deb, Ahsen Tahir, Zekun Guo (2026). CORAL: Dynamic Multi-Agent Coordination Without Predefined Workflows. arXiv 2026.

Related Research