My OpenClaw Chronicles — How I built a statistical proof that local AI can replace Claude

"The local model was good" means nothing. I've read enough AI demos to know what "it looks good in testing" actually means: someone ran a prompt they liked, got an answer they liked, and stopped there. That's not a proof. It's anecdote dressed up in tech clothing. The question I wanted to answer was different: can local Ollama models replace Claude Sonnet for delegated sub-tasks? Not in one demo. Not in my personal judgment. But statistically, at 200 evaluated runs, with independent judges, across multiple task types. Building the measurement machine took longer than building the thing being measured.

The ground truth problem

Any evaluation needs a reference answer. For this system, the reference is Claude Sonnet 4: the current production model, the one that would be replaced if the evaluation succeeded. Every shadow test works like this. A task arrives: summarize this text, classify this intent, extract this entity. Sonnet runs it and generates the ground truth response. The local model runs the same task in parallel. Then the evaluator scores the local model's output against Sonnet's. This means every evaluation run costs money on the Sonnet side, even as you're trying to prove you don't need Sonnet. The first 45 runs cost about $0.05 in GT generation. Across our 4,193 accumulated runs, the GT cost had scaled proportionally. You can't evaluate your way to cost savings without spending some money on the evaluation. Accept this upfront.

Six dimensions, not one score

The first version of the evaluator used a single similarity score. This was wrong. A response can be structurally correct but semantically off. It can be semantically close but factually drifted. It can complete the task while running 3x slower than the baseline. That matters if the task is user-facing; it doesn't matter at all if it's a background summarisation job. The evaluation dimensions we settled on, with weights:

Dimension	Weight	How it's measured
Structural accuracy	25%	Layer 1: JSON validity, required keys, forbidden patterns
Semantic similarity	25%	Voyage AI's `voyage-3-lite` cosine similarity vs GT
Factual drift	20%	Layer 2 heuristic: novel numbers, entities, URLs not in GT or context
Task completion	15%	LLM-as-judge (Zheng et al., 2023), sampled 15% of runs; None on non-sampled runs
Tool call correctness	10%	Pattern matching
Latency	5%	Normalised: 1.0 = at or better than GT; 0.0 = 2× GT latency or worse

The distinction between None and 0.0 for unsampled task completion scores turns out to matter. Logging None as 0.0 would penalise every un-judged run. The Langfuse (open-source LLM observability) logging layer crashed on float(None) at run 100 before this was fixed. That's one of the more embarrassing bugs in the project: a metric that was silently wrong for dozens of runs.

Why two judges and a tiebreaker

One LLM judge introduces correlated bias, a known risk in the LLM-as-judge evaluation framework (Zheng et al., 2023). If the judge is Claude Opus, it'll naturally lean toward outputs that look like Claude's style. If the judge is GPT, it'll have its own systematic preferences. The solution: two independent judges, neither of which is a candidate model. Primary judges: Claude Sonnet 4 and gpt-4o-mini. Different providers, different training data, different evaluation biases. They score independently. When their scores diverge by 0.20 or more, Gemini 2.5 Flash fires as a tiebreaker. This happens on less than 1% of runs. Most tasks do not produce scores that diverge that dramatically between two competent judges. The tiebreaker is cheap insurance for the edge cases. Judges are sampled at 15%. Most runs go through Layer 1 (structural validation) and Layer 2 (factual drift heuristics) only. If the output fails Layer 1 (invalid JSON when JSON was required, missing required keys), it scores 0.0 immediately and no judge is called. Across our 4,193 runs, this prevented hundreds of judge API calls for outputs that were obviously broken. The cost logic is direct: at 15% sampling, roughly 600 out of 4,000 runs called a judge. At full 100% sampling, that's 4,000 judge calls. At gpt-4o-mini pricing, that difference is real money over a month of accumulation.

The promotion gate: why 0.95 and not 0.85

The original spec was 100 runs and a mean score of 0.85 for promotion. I tightened this to 200 runs and 0.95. 0.85 is a B+ in school. In a production routing system where the promoted model handles real agent tasks, B+ creates noticeable degradation. The cost of a false promotion (routing 10% of real tasks to a model that's actually 0.85 quality and degrading outputs for weeks before you notice) exceeds the cost of 100 extra shadow tests by a wide margin. 200 runs gives a meaningful confidence interval on a 0-1 bounded score. 100 runs does not. At 200 runs with a mean of 0.95 and reasonable variance, you can say with some statistical confidence that the model genuinely performs at that level. At 100 runs, you might be looking at a lucky streak. The promotion gate is binary: either a model clears all three criteria (200 runs, mean ≥ 0.95, cost efficiency > 10× Sonnet baseline) or it stays as a shadow candidate. There's no partial promotion, no A/B routing at lower confidence. The gate exists because a premature promotion is harder to detect and fix than a delayed one.

The first result

The first live score on the RAG synthesis task type came from ministral-3:8b on Sprint 4, the same day the infrastructure was built.

Dimension	Score
Structural accuracy	0.818
Semantic similarity	0.954
Factual drift	1.000
Task completion	1.000
Tool call correctness	1.000
Latency	0.292
Weighted total	0.933

0.933 on the first RAG run. In our testing, this was higher than llama3.1 had reached on any task type across its 16 accumulated runs. The reason is instructive: the corpus is in the prompt. RAG synthesis turns a knowledge-retrieval problem into a reading-comprehension problem. The model can't hallucinate facts that weren't in the retrieved chunks. The factual drift score is 1.000 because there was nothing to drift to. 7B models handle reading comprehension well. They don't handle memory retrieval reliably. The latency score of 0.292 is the drag: 6.5 seconds against Sonnet's 1.9 seconds pulls the score from a theoretical ~0.96 to 0.933. For a background summarisation task, that tradeoff is fine. For a user-facing call with a 2-second SLA, it isn't. The evaluation system captures this distinction; a single-score evaluator wouldn't.

What the measurement machine is actually for

By the time our system had logged 4,193 total runs, the question 'can local models replace Claude' had a more precise answer: on some task types, yes. On others, not yet. The specific tasks where local models compete are the ones you'd expect: structured output, classification, faithful retrieval. Not the ones that require genuine reasoning or synthesis from incomplete information. That specificity only exists because the measurement machine ran continuously for weeks, across multiple models and task types, with enough data to distinguish a lucky streak from a real result. Building the machine cost more than the answer was worth on day one. On day 28, with 4,193 runs and clear promotion candidates, it was the most valuable thing in the project.

What the machine taught us next

The hybrid control plane experiment did what it was designed to do: it measured. Across our 4,193 accumulated runs, a 6-dimension rubric, 15% judge sampling, a 0.95 promotion gate. By Sprint 6 we had statistical evidence that local models could match Claude Sonnet on specific task types at specific quality thresholds. That's not a marketing claim. That's a result. But having a result is different from having a system, and the complexity of running shadow tests, maintaining ground truth, and calibrating a promotion gate started to feel like overhead for the actual question I wanted to answer next. That question is: which local model is best for which type of task? Not whether local can match Claude in aggregate, but whether you can get Claude-quality outputs on specific task types by routing to the right local model. That's what Ralph Lab is built to answer. It's a task-type specialisation benchmark: simpler in structure than the hybrid control plane, but more targeted. In our testing so far: 505 runs across 7 local models (qwen2.5-coder, qwen3:8b, ministral-3:8b, deepseek-r1:8b, qwen3:1.7b, granite4:tiny-h, and one more slot rotating through candidates), 10 task types, LLM-as-judge scoring on a 10-point scale. No shadow testing, no promotion gate. Just direct head-to-head comparisons across task categories. The output of Ralph Lab is a routing table: for each task type, which model scores highest? The table is already taking shape. What it's becoming is the empirical foundation for something more structured: an arxiv paper on task-type routing for local LLMs. The benchmark design has four conditions: local single-model (best average performer), local routed (routing table applied), hybrid cloud/local (smart escalation to Claude on low-confidence tasks), and frontier ceiling (Claude Sonnet on everything). The research question is whether smart routing closes the gap, and by how much. We don't know yet. That's not a hedge. It's the honest state of the experiment. The hybrid control plane proved local models can compete. Ralph Lab is mapping where they compete best. The paper will quantify the gap that routing closes. A future post will be that story.