Partial Info, Full Drama: When Agents Just Don’t “Get the Message”
AsymPuzl shows why effective signaling and shared understanding are critical to trustworthy, scalable multi-agent AI systems.
At the center of today’s rush toward agentic AI lies a deceptively simple question: Can multiple AI agents reliably work together when each sees only part of the picture? On paper, collaboration seems like a solved problem—LLMs can chat, debate, and hand tasks back and forth. But in practice, most “multi-agent” demos operate in open-ended chats with generous context, abundant hints, and no real penalties for misunderstanding one another. That’s a world away from how collaboration works in real enterprises, where different agents or systems hold partial, sometimes conflicting information and must coordinate through structured interactions.
This is the gap the research (from Dartmouth) addresses. The authors set out to answer a fundamental, often unexamined question: When two LLM agents each have incomplete information about a shared task, can they communicate effectively enough to reconstruct the truth and act on it? And further: How does the design of their communication environment—especially the feedback they receive—shape their ability to succeed or fail?
This problem matters because most real-world workflows already resemble the scenario the research studies. Underwriting agents, claims agents, routing agents, compliance agents, analytics agents—each holds different slices of data, different constraints, and different objectives. Whether they can synthesize their perspectives into a shared, correct understanding is the difference between automation that compounds value and automation that compounds errors.
To study this dynamic with scientific precision, the researchers built a minimal but powerful environment: AsymPuzl, a synthetic puzzle designed to isolate the pure mechanics of collaboration under information asymmetry. Each agent sees a different part of the puzzle—one sees positions and shapes, the other sees shapes and colors. Neither can solve the puzzle alone. To succeed, they must talk, infer, and iteratively update their internal hypothesis of the puzzle’s true configuration.
Crucially, the environment doesn’t just test whether agents can exchange messages; it tests whether they can use those messages productively. Every turn, an agent receives a structured prompt containing their private information, their current working hypothesis, the other agent’s last message, and—depending on the experiment—some form of feedback. Their response must include both a message to the partner and a set of structured “actions” to revise their hypothesis.
The researchers systematically vary the forms of feedback: no feedback at all, feedback only on one’s own correctness, detailed feedback on which parts are wrong, joint feedback on whether the two agents together have solved the puzzle, and combinations of these. They also vary puzzle size to measure how coordination scales with complexity.
Together, this framework isolates a core capability: whether LLM agents can establish a reliable communication protocol and converge on a correct shared understanding—without seeing the full world and without being told exactly how to cooperate.
Once the research framework was in place, the authors put a range of leading language models through a battery of controlled tests. The goal was not simply to determine whether a single model could solve a puzzle, but whether pairs of models could achieve coordinated problem-solving when each held only part of the necessary information. This subtle shift in evaluation—from intelligence to interaction—revealed behavioral patterns that are often invisible in traditional benchmarks.
The experiments varied three main factors: the model pair, the type of feedback the environment provided, and the complexity of the puzzle. By mixing and matching these conditions, the researchers could observe not only whether agents succeeded but how they behaved along the way.
One discovery stood out immediately: the difficulty of the task had almost nothing to do with the puzzle itself and everything to do with the communication between agents. When a single agent was given full information, even relatively modest models solved the task reliably. But when two agents had to coordinate under partial information, the same models that seemed competent in isolation began to struggle in diverse and surprising ways.
Some pairs adopted an almost surgical approach to coordination—exchanging just enough information to form accurate internal representations before updating their hypotheses efficiently. Others were indecisive, adjusting their beliefs tentatively without ever converging. Still others overcorrected, repeatedly undoing and redoing changes in a kind of collaborative thrashing. And a few barely cooperated at all, talking past each other despite having the tools to communicate effectively.
Feedback design proved to be one of the most influential factors shaping these behaviors. Certain forms of feedback, particularly simple signals about whether an agent’s own understanding was correct, often made coordination smoother. More elaborate or highly detailed feedback could paradoxically make things worse, overwhelming the agents and disrupting their ability to form stable communication patterns. This finding underscored a counterintuitive principle: more information is not always more helpful in multi-agent settings.
To determine whether a pair of agents had succeeded, the evaluation focused on whether each agent converged on a complete and accurate reconstruction of the puzzle within a fixed number of turns. The researchers also examined qualitative dimensions of the collaboration: Were agents making deliberate, minimal updates? Were they sending messages that meaningfully reduced uncertainty? Did their behavior remain stable as the puzzle grew more complex?
This richer style of evaluation painted a picture of multi-agent intelligence that goes beyond binary pass-fail metrics. It showed that success in collaborative AI is as much about forming effective communication protocols as it is about raw reasoning power, and that small differences in setup can profoundly reshape collective behavior.
The evaluation framework in this research goes beyond simply checking whether the final answer is correct. Instead, it examines the quality and stability of coordination. Success is defined not only by whether agents ultimately reconstruct the full puzzle, but also by whether they do so within a bounded number of turns and through coherent, intelligible updates. This makes the benchmark feel closer to real-world collaborative work, where speed, clarity, and the ability to avoid unnecessary churn are part of what “good” looks like.
To operationalize this, the researchers assessed several dimensions. They looked at how decisively and consistently each agent refined its internal representation of the puzzle. Steady, purposeful adjustments suggested that the pair had established a useful communication rhythm. By contrast, repeated reversals, aimless edits, or long stretches without meaningful action signaled a lack of shared understanding. Message content was another indicator: successful agents tended to communicate in ways that directly reduced uncertainty, while unsuccessful pairs often produced verbose but uninformative exchanges. These behavioral markers gave evaluators a way to differentiate between superficial progress and genuine cooperative reasoning.
Against this backdrop, the authors also acknowledged that their environment—useful as it is—comes with inherent constraints. The puzzle world is clean, structured, and free of noise. Each agent reliably receives the same type of input on every turn, and the environment keeps meticulous track of state on their behalf. In real deployment environments, data may be incomplete, contradictory, or delayed. Agents may have asynchronous access to information, or they may need to maintain their own memories rather than rely on the environment to restate the essentials each round. These gaps highlight areas where future iterations of the benchmark could push coordination tests closer to production realities.
The road ahead includes several promising extensions. Introducing ambiguity or noise would make the environment more reflective of messy enterprise data. Testing systems with more than two agents would expose coordination dynamics that only emerge in larger collectives. And exploring alternative forms of environmental feedback—especially ones that more closely mirror the lagging indicators common in real-world performance data—could help reveal how fragile or resilient multi-agent strategies truly are.
Yet even within its boundaries, the impact of the work is significant. It offers a controlled scaffold for probing a problem that is otherwise hard to isolate: how AI agents collaborate when none of them hold the full picture. This is the underlying challenge behind nearly every agentic workflow—from financial risk reviews to supply chain orchestration to multi-step decision support. By showing how different models behave under the same constraints, and how subtle shifts in feedback govern their cooperation, the research provides a vocabulary and a testing ground for designing better agent systems. It pushes the conversation away from “How smart is the model?” and toward a more relevant question: “How well can multiple models work together when real-world information is fragmented?”
Further Reading
- Cadet, X., Koh, E., & Chin, P. (2025, December 3). AsymPuzl: an asymmetric puzzle for multi-agent cooperation. arXiv.org. https://arxiv.org/abs/2512.03466
- Mallari, M. (2025, December 5). Signal strength: getting agents on the same wavelength. AI-First Product Management by Michael Mallari. https://michaelmallari.bitbucket.io/case-study/signal-strength-getting-agents-on-the-same-wavelength/