Multi-Agent Safety
The study of what happens when multiple AI systems interact with each other and with you.
Every other PRISM pillar studies what one AI does to one human. Pillar P watches the AI fail. Pillar I watches the human react. Pillar R watches the dynamics between them. Pillar S traces the invisible training decisions underneath. All of that assumes one AI, one human, one conversation.
That assumption no longer holds.
You use Claude for thinking, ChatGPT for drafting, Gemini for research, Perplexity for verification, and an AI coding assistant for building. You ask the same question to two different models and get two different answers. You paste one model’s output into another model’s context and watch the second model react to the first. You run a workflow where three AI systems hand off to each other and something emerges at the end that none of them would have produced alone.
You are living inside a multi-agent environment. So is everyone else who uses AI. And nobody is studying what happens in there.
Pillar M exists because the safety properties of individual AI systems do not predict the safety properties of AI systems working together. A model that is safe in isolation can produce unsafe outcomes when paired with another safe model. Behaviors emerge from interaction that no single model contains. The safety of the ecosystem is not the sum of the safety of its parts.
The safety of the ecosystem is not the sum of the safety of its parts.
This is the most forward-looking pillar in the PRISM framework. It is also the one that will matter most as AI agents become more autonomous, more numerous, and more interconnected. The age of single-model safety evaluation is ending. The age of multi-agent safety has already begun.
You are already the unmonitored subject of a multi-agent experiment.
Here is what is happening right now that almost nobody has named: billions of people use multiple AI systems every day, and the interactions between those systems are producing effects that no evaluation, no benchmark, and no safety framework measures.
When you ask Claude a question and then ask ChatGPT the same question and get a different answer, something has happened to you. You now hold two conflicting pieces of AI-generated information. You have to decide which one to trust. You have to make that decision without any transparency into why the answers differ: whether it is a training difference, a data difference, a reasoning difference, or a framing difference. You are resolving a conflict between two systems that neither system knows about, using criteria that no one has taught you.
That is a multi-agent safety event. And it happens millions of times a day.
When a software team uses an AI coding assistant to write code, an AI testing tool to evaluate it, and an AI project manager to track progress, those three systems are operating in the same workflow without any shared understanding of what the other systems are doing. The coding assistant writes something. The testing tool flags it. The project manager logs the delay. And somewhere in that handoff, context is lost, priorities shift, and the human in the middle is managing an ecosystem of AI opinions that do not know each other exist.
At the research frontier, the implications are even more significant. Project Sid, a 2024 experiment by Altera, placed up to 1,000 autonomous AI agents in a shared Minecraft environment. The agents autonomously developed specialized roles, drafted constitutions, created voting systems, transmitted cultural practices, and even spread religion.
“None of those behaviors were programmed. They emerged from interaction.”
In a separate 2026 simulation study, AI agents in virtual environments produced outcomes ranging from stable democratic orders to complete social collapse with over 600 recorded deviant incidents within two weeks.
“Same starting conditions. Different emergent paths.”
The Multi-Agent Emergent Behavior Evaluation (MAEBE) framework, a 2025 research contribution, demonstrated that the moral reasoning of AI agent ensembles is not directly predictable from the behavior of individual agents. Ensembles exhibited peer pressure effects, convergence toward positions that no individual agent held, and shifts in moral preferences that only appeared in group contexts.
“Evaluating agents in isolation missed the very behaviors that mattered most.”
All of this points to a single conclusion: single-agent safety evaluation is necessary but not sufficient. The safety challenges of the next decade will be multi-agent challenges. And the only people who can observe those challenges in real time are the people living inside multi-agent workflows every day. That is you. And that is Pillar M.
Three layers: signals, ecosystems, and emergence.
We organize Multi-Agent Safety research into three layers based on observational complexity. The first layer is specific multi-agent events you can identify from your experience using multiple AI systems. The second is ecosystem patterns that emerge when multi-agent interactions are tracked across workflows and time. The third is the emergence landscape: systemic findings that only become visible when thousands of multi-agent observations are analyzed across models, workflows, and months.
Layer 1: The Signal
Multi-agent events you can observe and report from your experience using AI.
I Asked Two AIs the Same Thing and Got Very Different Results
You asked one AI a question. Then you asked a different AI the same question. And the answers were not just different in style or phrasing. They were different in substance. Different facts. Different conclusions. Different recommendations. Maybe even directly contradictory.
This is cross-model divergence: the same input produces meaningfully different output across different AI models. It happens because every model was trained on different data, tuned with different methods, reinforced with different human feedback, and optimized for different objectives. But you, the user, experience it as confusion. Which answer is right? Is either answer right? How do you decide?
The problem is not that two models disagree. The problem is that neither model tells you it might disagree with other models. Each one presents its answer with the same confidence, the same authority, the same polish. You have no way to know, from the output alone, whether the answer you are reading is the consensus of the field or the idiosyncratic product of one model’s training.
This matters at scale because most people do not cross-check AI outputs. They ask one model, get one answer, and act on it. The people who DO cross-check, who ask the same question to Claude and GPT and Gemini, are generating some of the most valuable data in AI safety research. They are documenting where the models agree, where they diverge, and what the divergence reveals about each model’s training and reasoning.
Same input. Different outputs. Neither one tells you the other exists.
One AI Contradicted What Another AI Told Me
This goes beyond divergence. This is conflict. One AI told you X was true. Another AI told you X was false. Not just a different perspective. A direct contradiction.
When a human expert contradicts another human expert, you can evaluate their credentials, ask for their reasoning, check their sources, and weigh the evidence. When one AI contradicts another, you have none of that. You have two confident assertions pointing in opposite directions, generated by systems whose reasoning is opaque and whose training data is proprietary.
Inter-model conflict is particularly dangerous when the topic has real stakes. If one AI tells you a medication interaction is safe and another tells you it is dangerous, the conflict is not academic. If one AI tells you a legal filing deadline is Friday and another says it is Monday, someone is going to be wrong, and the human in the middle has to figure out which.
This was documented through operational research when the same factual question produced conflicting information from two different AI platforms. The conflict was not a matter of interpretation. One system was wrong. But both presented their answers with identical confidence.
I Use Multiple AI Tools Together and Noticed Patterns
You do not just use one AI. You use several. And over time, you have noticed things. Maybe one AI is better at analysis but worse at writing. Maybe you always start with one model and switch to another when the first one starts drifting. Maybe you have developed a personal workflow where each AI handles a different stage and you manage the handoffs.
This is multi-model workflow observation: the ecological data that comes from living inside a multi-agent environment every day. It is not about catching a single error or a single contradiction. It is about noticing the patterns that emerge when AI systems are used together over time.
This observation is different from all the others on this page because it is not about something that went wrong. It is about what happens when AI tools become an ecosystem. How do you manage that ecosystem? What patterns have you developed? What breaks when one model updates? What workflows have you built that depend on the specific strengths and weaknesses of each model?
Nobody has this data at scale. Enterprise teams are building multi-model workflows, but they are not publishing their behavioral observations. Individual users are developing personal AI ecosystems, but they have no place to report what they have noticed. The patterns exist in millions of workflows. They just have not been collected.
This observation is not about errors. It is about ecosystems.
An AI Agent Interacted With Another AI and Something Unexpected Happened
This is the frontier observation. An AI did not interact with you. It interacted with another AI. And what came out of that interaction was something neither system would have produced alone.
As AI systems become more autonomous, agent-to-agent interactions are increasing. AI coding assistants generate code that AI testing tools evaluate. AI research tools produce summaries that AI writing tools incorporate. AI scheduling agents negotiate with other AI scheduling agents. In each of these handoffs, information transfers, context shifts, and behaviors emerge that neither agent was designed for.
The most revealing cases are the ones where something unexpected happens. An AI agent is given a file organization task and deletes a database. An AI research agent pulls information from another AI’s output and amplifies an error. Two AI agents in a shared environment develop a coordination pattern that nobody programmed.
These emergent behaviors are, by definition, unpredictable from studying either agent in isolation. They arise from interaction. And they will increase in frequency and significance as agentic AI deployment grows across every industry.
Audacion AI Labs has direct operational experience with multi-agent environments. A 54-agent AI workforce we operate and observe runs on governance frameworks designed for agent-to-agent dynamics: the Inter-Agent Environmental Leakage Protocol, the Peer Coordination Stall Protocol, and the Byzantine Fault Tolerance Memory Protocol. These frameworks were built because the problem is real: agents interacting in shared infrastructure produce behaviors that individual agent evaluation never predicted.
Two AIs Each Left Out Completely Different Information
You asked two AIs the same question. Both gave you answers. Neither answer was wrong. But when you compared them side by side, you noticed something: each model had left out completely different information. Model A covered points 1, 2, and 3 but omitted 4 and 5. Model B covered 3, 4, and 5 but omitted 1 and 2. Neither was incorrect. Both were incomplete. And each one was incomplete in a different direction.
This is invisible filtering detection: the moment when cross-model comparison reveals that each AI is curating the information it gives you without telling you what it is leaving out. Not misinformation. Selective information. Each model gives you a different slice of reality, and neither one tells you the slice exists.
This observation is classified as a micro-event because in a single instance it looks like normal variation. But aggregated across thousands of observations, it becomes one of the most significant findings in Pillar M. If the same question consistently produces different omissions across different models, what you are seeing is not random variation. You are seeing the fingerprints of different training philosophies, different content policies, and different value alignments shaping what information reaches you.
A person using only one AI model never sees this. They get one slice and assume it is the whole picture. Only someone using multiple models, or a research program aggregating observations across users and models, can detect invisible filtering at scale.
Neither is wrong. Both are incomplete. Each one is incomplete in a DIFFERENT direction.
[Your own words: something not listed above]
If you have observed something in a multi-agent interaction that does not match any of the behaviors above, describe it in your own words. When citizen observations cluster around patterns we have not documented, we formalize new categories. The citizen who first reports the pattern is credited as the discoverer.
Report a New Multi-Agent Observation →Layer 2: The Ecosystem
Patterns that emerge when multi-agent interactions are tracked across workflows and time.
Cross-Model Trust Calibration
How do you allocate trust across models? The shape tells the story.
When you use multiple AI systems, you develop trust patterns. You learn which model to trust for which task. You learn which model’s confidence you can rely on and which model sounds confident about things it does not know. You learn which model admits uncertainty and which one fills gaps with fabrication.
These trust calibration patterns are invisible to every individual AI company. OpenAI does not know how much you trust Claude. Anthropic does not know how much you trust Gemini. But your trust allocation across models, which model you go to first, which one you use to verify the first, which one you avoid for certain tasks, is a behavioral dataset that maps the real-world trust landscape of deployed AI.
At population scale, trust calibration data answers questions that no single-model evaluation can ask. Do users trust certain models more for medical questions? Do they systematically verify one model’s output with another? Does cross-model verification actually improve accuracy, or does it just give people confidence in whichever answer they preferred?
Divergence Amplification
Cross-model divergence (OBS-M01) is a Layer 1 event. Layer 2 asks what happens to that divergence over a workflow. When a person gets conflicting answers from two models, do they resolve the conflict or carry both forward? When conflicting information enters a multi-step workflow, does it amplify, dampen, or compound?
Initial observations from operational research suggest that unresolved divergence does not stay contained. When a user does not resolve a conflict between two AI outputs and instead proceeds with one (often whichever came first or whichever matched their expectation), the unresolved conflict can propagate through downstream decisions. The first error is the divergence. The second is the silent resolution. The third is the downstream consequences of the resolution nobody evaluated.The first error is the divergence. The second is the silent resolution. The third is the downstream consequences.
Ecosystem Fragility and Model Updates
Every AI model updates. New versions ship. Capabilities change. Behaviors shift. For a person using a single model, an update means one adjustment. For a person using a multi-model ecosystem, an update to any model can destabilize the entire workflow. The handoff patterns they developed break. The trust calibrations they built no longer hold. The workflow they optimized for three models at their current capabilities fails when one model changes.
Nobody tracks this. No AI company considers how their update affects users who depend on interactions between their model and other models. The ecosystem fragility is invisible from the inside of any one company.
Layer 3: The Emergence Landscape
Systemic findings that only become visible when thousands of multi-agent observations are analyzed across models, workflows, and months.
How we collect Multi-Agent Safety data.
You asked two AIs the same thing and got different answers. Quick emotional signal plus which models and a one-sentence description of the divergence. This is OBS-M01 and OBS-M02 data.
You use multiple AI tools in a workflow. At the end of a session, you describe the ecosystem: which models, what roles each played, what patterns you noticed, what worked, what broke. This is OBS-M03 data.
You observed an AI-to-AI interaction or a complex multi-agent dynamic and want to document it in detail. What happened, what was unexpected, what emerged. This is OBS-M04 and OBS-M06 data. These are the highest-value observations in the pillar.
Citizens are already running multi-agent experiments.
Every person who uses Claude AND ChatGPT AND Gemini is conducting a daily comparative study of deployed AI systems. They are not doing it on purpose. They are doing it because that is how people use AI. Our methodology turns that natural behavior into research data.
Multi-agent observations flag automatically.
When a citizen’s session metadata indicates multiple models in use, the observation is flagged for multi-agent classification. The citizen does not need to know they are contributing to Pillar M. The system detects the comparative element.
Micro-events aggregate into macro-findings.
OBS-M06 (invisible filtering detection) is a single observation in isolation. A thousand OBS-M06 observations across models, topics, and regions become a map of how different AI systems curate different realities. The methodology is designed for this aggregation: individual observations are small. The findings are large.
Cross-model comparison is the core analytical tool. Where other pillars analyze patterns within one model’s behavior, Pillar M analyzes patterns ACROSS models. Every observation that includes the same input to multiple models is a natural experiment. The divergence between outputs is itself a research finding.
What we have found so far.
Preliminary. Based on founder operational research and published external research. Will be validated, refined, or revised as citizen data flows.
Cross-model divergence is real, substantial, and invisible to single-model users.
Documented through operational research using Claude, Perplexity, and Replit in daily multi-model workflows beginning February 2026. The same question produces meaningfully different answers across models. The divergence is not random: it correlates with training differences, safety boundary differences, and reasoning depth differences.
Inter-model conflict produces epistemic confusion that no individual model addresses.
When two models produce contradictory information, neither model warns the user that other models disagree. The human is left to resolve the conflict without context. Documented in DI-2026-009 (cross-platform incident).
Multi-model workflow patterns are real and undocumented.
Users develop sophisticated multi-model ecosystems with role assignments, trust hierarchies, and handoff patterns. These workflows represent ecological data about how AI systems function together in the wild. No published research systematically documents these patterns.
Agent-to-agent emergence has been observed and is increasing.
Audacion AI Labs operational multi-agent environment has documented emergent behaviors between agents operating in shared infrastructure, including unintentional information transfer across shared surfaces (17 documented), coordination stall patterns (6 detection patterns documented), and agent coordination dynamics that required purpose-built governance frameworks.
Ensemble moral reasoning diverges from individual agent behavior.
The MAEBE framework (2025) demonstrated that AI agent ensembles exhibit peer pressure effects and moral preference shifts that appear only in group contexts and are not predictable from individual agent evaluation. This externally validates the foundational premise of Pillar M: multi-agent safety is not the sum of single-agent safety.
AI agent societies develop emergent governance autonomously.
Project Sid (Altera, 2024) demonstrated that 1,000 AI agents in a shared environment autonomously developed specialized roles, governance structures, constitutions, and cultural practices. A separate 2026 simulation study showed these dynamics can diverge catastrophically: similar starting conditions produced outcomes ranging from democratic stability to complete social collapse.
What comes next.
Pillar M has a unique requirement: you need to be using more than one AI system. But in 2026, that describes almost everyone.
Pillar M has a unique requirement: you need to be using more than one AI system. But in 2026, that describes almost everyone.
We show our work because we expect others to build on it.
All five behavioral observations documented on this page are original to Audacion AI Labs.
Cross-model divergence, inter-model conflict, multi-model workflow ecology, agent-to-agent emergence, and invisible filtering detection were all identified through direct operational observation by Dee Williams, Founder of Audacion AI Labs, beginning February 2026. They emerged from intensive multi-model operational work and from building and operating a 54-agent AI workforce with purpose-built governance frameworks for multi-agent dynamics.
External research has validated the foundational importance of multi-agent safety as a field. Project Sid (Altera, 2024) demonstrated emergent governance in AI agent societies. The MAEBE framework (2025) confirmed that ensemble behavior diverges from individual behavior. The Open Challenges in Multi-Agent Security paper (2026) cataloged cascading vulnerabilities in multi-agent systems.
But the specific behavioral categories documented here, the citizen-observable events that real people experience when they use multiple AI systems, exist in no other framework. We built them because we saw them. We formalized them because nobody else had.
External research has validated the importance of the field. The specific behavioral categories came from direct operational observation. We show our work because we expect others to build on it.