I built redundancy. It failed redundantly.
Four providers in the fallback chain. Two clusters of cascade failures in a single day. Nine total. I didn't catch any of them in real time because I was asleep for the first batch and away from my desk for the second.
I'll explain the chain, what fell over, and exactly why the redundancy I was proud of turned out to be a cardboard wall.
The setup I was confident about
My OpenClaw config had a four-layer fallback chain for every AI call:
- Claude Opus (primary)
- Claude Sonnet (first fallback)
- GPT-4.1 (second fallback)
- Ollama qwen2.5:latest (local, last resort — free, offline-capable)
The idea was solid. If Anthropic's API is struggling, drop to OpenAI. If OpenAI is down or I've hit the rate limit, fall back to local. The system should be self-healing. I built that chain deliberately.
What I didn't think clearly about: both Opus and Sonnet run against the same Anthropic API key. Rate limit one, you've rate limited both. The "first fallback" was never a fallback at all — it was the same single point of failure wearing a different name badge.
That was mistake one.
The timeline
7:00am – 8:00am AEST: Cluster one
- Five overnight autonomous sprints had been running while I slept
- My heartbeat agent fires every 30 minutes (Sonnet)
- My analyst agent fires every 30 minutes (Sonnet)
- My shadow accumulator fires every 10 minutes (Sonnet via the control plane)
- By the time I got up, the system had made an estimated 400+ API calls since the previous evening
The Anthropic rate limit hit. Opus failed. Sonnet failed — same key, same wall. The chain tried GPT-4.1 next.
Error: No API key found for provider openai
Not "API key invalid." Not "rate limit exceeded." No key found. Because auth-profiles.json didn't have an openai:default entry. OpenClaw found the provider listed in openclaw.json and then went looking for credentials that didn't exist.
The chain tried Ollama.
Error: Unknown model: ollama/qwen2.5:latest.
Ollama requires authentication to be registered as a provider.
Same story. Ollama was listed in the fallback chain in openclaw.json. It was not registered as a provider in auth-profiles.json. The system hit all four rungs of my redundancy ladder and fell through every single one.
Nine cascade failures across clusters one and two. Cluster two hit between 5:45pm and 6:45pm AEST, when background processes had refilled the request queue again. I was away from my desk. Same chain. Same result.
At 8:55am, separately, memory search returned a 429. Gemini embedding quota. That one wasn't part of the cascade — just a collision of independent quota limits on the same morning.
Total eval runs at the time of the incident: 1,433 across 31 model/task pairs. The evaluation harness kept trying to log results that the failing agents couldn't compute.
The config gap
Here's the thing. Both files looked correct in isolation.
openclaw.json — the model routing config — had all four providers listed:
{
"models": {
"fallbackChain": [
"anthropic/claude-opus-4-5",
"anthropic/claude-sonnet-4-6",
"openai/gpt-4.1",
"ollama/qwen2.5:latest"
]
}
}
auth-profiles.json — where credentials actually live — had two:
{
"profiles": {
"anthropic:default": {
"apiKey": "sk-ant-..."
},
"voyage:default": {
"apiKey": "pa-..."
}
}
}
OpenAI: not there. Ollama: not there.
The fallback chain was aspirational. It described the providers I intended to use, not the providers I had authenticated. Those two files are supposed to agree with each other. I had never cross-checked them. I set up the fallback chain in one file, added providers over time to the other, and assumed they were in sync.
They weren't. And the failure mode wasn't graceful degradation — it was complete silence followed by nine error logs I found hours later.
The fix
Straightforward once I saw the gap. I registered both missing providers in auth-profiles.json:
{
"profiles": {
"anthropic:default": {
"apiKey": "sk-ant-..."
},
"voyage:default": {
"apiKey": "pa-..."
},
"openai:default": {
"apiKey": "sk-proj-..."
},
"ollama:default": {
"baseUrl": "http://localhost:11434"
}
}
}
Ollama doesn't need an API key — it needs a base URL. That's it. The "requires authentication" error was OpenClaw looking for a provider registration that simply wasn't there, not an actual authentication failure.
After the fix, I tested the chain manually. Opus → rate limited → Sonnet → rate limited → GPT-4.1 → successful. Ollama pickup at the end confirmed local routing was working. The ladder holds now.
But fixing the chain doesn't fix what caused the chain to be exercised nine times in a day.
The reckoning
Four hundred API calls before I woke up. That number sat badly with me, so I actually counted the sources.
- Heartbeat agent: every 30 minutes, Sonnet, 24/7 → 48 calls/day
- Analyst agent: every 30 minutes, Sonnet, 24/7 → 48 calls/day
- Shadow accumulator: every 10 minutes, Sonnet via control plane → 144 calls/day
- 5 overnight autonomous sprints: each one spawning its own call chain
That's the baseline load before I did a single intentional thing. Most of it was infrastructure keeping itself company.
The adjustments I made after the incident:
- Heartbeat: 30 minutes → 4 hours
- Analyst agent: killed entirely
- Shadow accumulator: 10 minutes → 30 minutes
Background load dropped from 400+ calls/day to approximately 56 calls/day. An 86% reduction. The system didn't get less capable — I just stopped paying for it to narrate its own heartbeat every half hour.
The overnight sprints stay. Those are doing actual work. But five sprints running sequentially through a night, against a rate-limited key with no working fallbacks, is exactly how you generate a morning of cascade failures.
What I actually had
A fallback chain in one config file.
Not a working fallback chain. Just the documentation of one.
The distinction is worth stating plainly: OpenClaw reads both openclaw.json and auth-profiles.json when a call is made. The routing config says where to try. The auth config says whether the attempt can actually succeed. Having the provider listed in routing without credentials in auth doesn't fail loudly at startup — it fails silently at runtime, exactly when you needed the fallback to work.
I assumed the two files were in sync because I had mentally treated them as one config. They're not one config. They're two separate concerns, and keeping them aligned is on me.
The fix took about ten minutes. The nine cascade failures took all day to produce.
What changed
-
auth-profiles.jsonnow has all four providers thatopenclaw.jsonreferences. They'll stay in sync — I'm not adding a provider to one without adding it to the other. -
Background call volume is down 86%. The infrastructure is quieter. The working agents are louder by comparison, which is the right relationship.
-
The rate-limit single point of failure (Opus + Sonnet on the same key) is still there, and I haven't solved it. OpenAI as the third rung helps. But if I ever want genuine Anthropic redundancy, I need a second API key under a second account — which is a billing decision I haven't made yet.
That's where I'm at. The chain works. The load is sane. The assumption that two config files agreed with each other has been corrected.
Next up in the chronicles: the evaluation harness that was quietly running 1,433 experiments in the background — what I was actually measuring, whether the numbers mean anything, and what I'd do differently if I were starting the benchmark from scratch.