How We Structured an AI Agent Team: Lessons from a Constitutional Standard Rollout

We run 11 AI agents inside our OpenClaw deployment. They each have names, roles, and distinct voices. For a while, that felt like enough.

It wasn't.

The problem wasn't getting agents to work. The problem was keeping them coherent over time. Without a forcing function, individual agents accumulated scope creep, stale configs, and split identities across different files. Some had rich personalities with no operational definition. Some had operational rules scattered across 4 different documents in 3 different directories. Two agents existed on disk but couldn't be spawned — they were ghosts with workspaces and no presence in the config.

This is the story of what we did about it.

The Problem: Identity Without Structure

When you start building a multi-agent system, it's tempting to focus on what each agent does. The routing table, the trigger conditions, the tools. That's the wrong starting point.

The correct question is: what makes each agent legible — to you, to other agents, and to itself?

Legibility breaks down in four ways without a standard:

Scope creep happens when an agent picks up adjacent tasks because nothing explicitly prohibits it. The agent sounds right, the task seems close enough, and suddenly your ops agent is drafting copy because the content agent was busy.

Inconsistent spawning happens when trigger conditions live in your head or in a root-level routing table that the agent doesn't reference. You know when to spawn Archie, the research agent; the system doesn't.

Unclear escalation happens when an agent isn't sure whether to handle something or surface it. Without a written escalation protocol, the default is assumption. Assumption leads to problems.

Weak auditability happens when you can't tell, from a single read of the agent's files, what they're allowed to do, what they're not, and who to call when something goes wrong.

We experienced all four. The solution was a constitutional standard: a defined template that every agent had to meet, enforced across 6 phases over 2 days.

The Template: Four Files, Four Audiences

The constitutional standard mandates four files per agent. Each has a different job.

agents/<name>/
├── SOUL.md     — identity and values
├── ROLE.md     — scope and operational contract
├── AGENTS.md   — operating context (team rules translated locally)
└── TOOLS.md    — tool permissions and usage patterns

SOUL.md is for consistency. It defines who the agent is — not just the function label, but the personality spine that makes the agent stable across different tasks and sessions. If you could swap the agent's name into another agent's SOUL.md and it still reads correctly, it's too thin.

ROLE.md is for routing. It answers: should this agent be spawned for this task? Trigger conditions, standing responsibilities, a collaboration map, hard scope limits, and an escalation protocol. A reader should be able to make the spawn decision from this file alone — without reading anything else.

AGENTS.md is for localisation. The team has shared rules (no self-upgrade, OUTBOX discipline, spend transparency). The agent-scoped AGENTS.md translates those rules into local operating behaviour. "OUTBOX discipline" means something different in practice for a video pipeline agent than it does for a technical writer.

TOOLS.md is for preventing hidden tool knowledge. If a tool is important enough to rely on, it's important enough to document. Usage patterns, cost notes, prohibited uses — all explicit.

The filing rule is strict: constitutional source of truth lives in the agent's own directory. Not another workspace, not a charter file, not the root-level config. If the rule only exists somewhere else, it doesn't exist for operational purposes.

The Audit Findings: What We Actually Found

We reviewed 40 skill assignments across 11 agents. We found 8 cases of bloat and 14 gaps. Every number below is exact — pulled from the audit report, not estimated.

Skills Bloat Is Expensive

Every skill loaded at agent spawn costs tokens. Those tokens come from your context budget. When you're running a team of 11 agents, bloat compounds.

The orchestrator agent (Loki) was carrying 13 skills — including analyst-watchdog, insight-engine, and memory-health-probe — all of which belong to specialist agents. Loki orchestrates; Archie monitors; Kit or Oli run evaluations. After the audit: 13 down to 7.

Archie was the worst specialist case. He held 4 engineering skills he never used: agent-evaluation, llm-as-judge, llm-eval-router, and vector-store-shootout. All of them make sense for an evaluation or benchmarking workflow. None of them match Archie's role as a research agent. Evaluation belongs to Oli; benchmarking implementation belongs to Kit. Archie was holding someone else's tools.

The principle behind the fix: skills are per-role, not per-agent-preference. We cross-referenced every skill assignment against the agent's ROLE.md before confirming or removing it. If the role didn't mention the use case, the skill went.

The Typefully Problem

The counterpart to bloat is the gap you don't notice until something breaks.

Typefully is our social scheduling tool. Sara and Liv both have it listed as a primary tool in their ROLE.md files. Sara's ROLE.md says explicitly: "Schedule content via Typefully for Loki/Nissan review." Liv's ROLE.md names Typefully as her "primary scheduling tool."

Neither agent had the typefully skill assigned.

This is the failure mode that's harder to catch than bloat. An agent with too many skills wastes tokens upfront. An agent missing a critical skill produces silent failure — it goes to schedule content, doesn't know how, and either guesses or escalates unnecessarily.

After the audit: typefully added to both Sara and Liv. Quinn and Quill both went from zero skills to properly equipped. Before the audit, they had detailed ROLE.md files and no tools to work with.

Quinn's Identity Conflict

Quinn is the overflow agent — handling cross-functional tasks that span multiple specialist lanes, quick lookups that don't justify Archie's full research depth, backup when a specialist is mid-task.

Except that's not what her SOUL.md said.

Her workspace-level SOUL.md described Quinn as a QC specialist: reviewing content against humanizer patterns, scoring AI-isms, conducting voice consistency audits. Her agents-directory SOUL.md was a generic 14-line "overflow handler." Her ROLE.md and the team routing table both defined her as overflow.

Three different files. Two different jobs. Neither was being served well.

The fix was clean: retire the QC persona (it had never actually been activated — Oli does content review), write a single coherent overflow identity, update the routing table. One pass. The key insight is that an identity conflict doesn't resolve itself over time — it gets worse as more routing decisions accumulate against a broken definition.

Ghost Agents

Finn is the video pipeline specialist. Becky is the pricing intelligence agent. Both have workspaces. Both have ROLE.md files. Both have SOUL.md files. Neither was in the openclaw.json agents list.

They could not be spawned. They were ghosts.

This is the kind of gap that doesn't surface until you try to use an agent and the system reports it doesn't exist. The config debt had accumulated quietly: someone set up the workspaces and wrote the role definitions, but never completed the registration step.

Both agents were added to the config during this rollout — with appropriate skills, subagent permissions, and workspace assignments. Finn got showcase-video-builder, elevenlabs-toolkit, and proton-drive-backup (his ROLE.md mandates backup to Proton Drive after every video). Becky got fact-checker as her core research tool and self-improving-agent for continuous learning.

The Eight Constitutional Principles

The template handles structure. The principles handle governance. Eight principles, non-negotiable across every agent.

Article 1. Scope Fidelity       — Act within your defined role. Route, don't absorb.
Article 2. No Self-Upgrade      — Model selection is Loki's authority. Period.
Article 3. OUTBOX Discipline    — Raw facts to OUTBOX.md before session end. Always.
Article 4. Spend Transparency   — Any action that costs money gets flagged first.
Article 5. Escalation over Assumption — Unclear scope or authority → surface to Loki.
Article 6. No Fabrication       — Missing data is reported as missing, not invented.
Article 7. Privacy by Default   — Sensitive data stays in the workspace. Least privilege.
Article 8. Loki is Orchestrator — Routing, overrides, and final decisions belong to Loki.

Most of these weren't new rules — they were implicit expectations that had never been written down. Articulating them explicitly changed how we audited compliance. Before the rollout, "does this agent escalate appropriately?" was a subjective judgment. After: either the ROLE.md has an escalation protocol section, or it doesn't.

Phase 2 (Oli and Kit) was the first application after Quill's reference implementation. Both were in reasonable shape — strong identities, meaningful hard rules in their existing files — but neither had an explicit escalation protocol, No Self-Upgrade statement, or OUTBOX discipline structure. Those gaps don't feel urgent until they cause a problem. Writing them down turns a latent risk into a checked box.

What Didn't Change

The surgical constraint was important. Don't touch personality. Don't touch voice. Add constitutional structure without overwriting what was already working.

Archie's Archimedes persona — measured, precise, British, data-first — was preserved even though his AGENTS.md had been pointing to stale agents/analyst/ paths from a previous monitoring role. The persona still fits; the operational text didn't. We corrected the text and left the personality.

Oli kept his Trini voice patterns and his Green Arrow identity. Kit kept the raccoon engineer energy. Belle kept the Tinker Bell sass and her specificity philosophy (she'll flag a contrast ratio with an actual number, not just "low contrast"). The constitutional additions sat alongside these identities rather than replacing them.

The implementation rule from the standard itself: if a constitutional file could be swapped with another agent's name and still read correctly, it is not constitutionally complete. That test kept each file specific enough to be useful.

Six Phases, One Standard

The rollout ran in priority order, defined by governance risk:

Quill — reference implementation; already had the strongest existing identity draft
Oli, Kit — strong starting points; clean additions
Archie, Sara — reasonable identity; stale operational context (Archie's stale path, Sara's wrong status label)
Liv, Belle — missing ROLE.md entirely in their workspace directories; identity mislabelled (Liv was still described as "marketing and growth" instead of ops)
Quinn — identity conflict; zero skills; no quality rubric
Finn, Becky — not in the config at all; mature files but unregistered

Every agent is now compliant with all 8 principles. The structure holds.

The Concrete Takeaway

If you're building a multi-agent system, the filing problem will find you eventually. It usually shows up as an unexpected behaviour — an agent picks up work it shouldn't, or fails silently on something its ROLE.md says it handles. By then, you're debugging instead of building.

The four-file template isn't complicated. You can define it in an afternoon. The hard part is enforcing specificity — making each file actually agent-specific rather than a lightly renamed version of a generic template.

Start with ROLE.md. Before SOUL.md, before skills, before anything else: write down the trigger conditions. When does this agent get spawned? What does it not do? If you can't answer those two questions in 10 sentences, the agent is not ready to be part of a production team.

The rest follows from that. Skills align to scope. Escalation becomes obvious once scope is defined. And when something drifts — because it will drift — you have a reference to audit against.

Build the structure first. The personality can come later.

The agent constitutional standard, all 6 phase progress reports, and the skills audit are on file in our workspace. We're continuing to use this framework as the team grows.