My OpenClaw chronicles — 958 shadow test runs later: what the data actually shows about local AI quality

I didn't set out to build an evaluation framework. I set out to answer one question: can a local Ollama model running on a Mac Mini M4 do the work Claude Sonnet is currently doing, without me noticing the difference?

That question turned into 958 scored runs across 38 model/task pairs, seven task types, a two-judge ensemble, a tiebreaker API, nineteen sprints, and a grand total of zero promoted models.

That last part is important. I'll get back to it.

What I'm actually testing

The setup: a Mac Mini M4 with 24GB unified memory, running eight candidate models simultaneously via Ollama. No cloud inference for the candidates — everything local, $0.00 per inference call. The task is to prove — statistically, not by demo — that one or more of these models can replace Claude Sonnet on specific delegated sub-tasks.

The candidates: llama3.1, ministral-3:8b, phi4, qwen2.5, qwen2.5-coder, deepseek-r1:8b, deepseek-coder, and granite4:tiny-h.

That last one, granite4:tiny-h, was designated as the control floor. The worst expected performer. The baseline everything else had to beat. I'll come back to what actually happened with it.

The task types I'm evaluating across: classify, format, summarize, extract, analyze, rag_synthesis, code_transform. For each, Claude Sonnet generates ground truth. The candidate model runs the same prompt. I score the difference across six dimensions:

| Dimension | Weight | |---|---| | Structural accuracy | 25% | | Semantic similarity | 25% | | Factual drift | 20% | | Task completion | 15% | | Tool call correctness | 10% | | Latency | 5% |

The judges: Claude Sonnet and gpt-4o-mini as independent scorers on a 15% random sample. If they disagree by more than 0.20, Gemini 2.5-flash breaks the tie. The other 85% of runs are scored by Layer 1 and Layer 2 validators — structural checks and heuristic drift detection — before any LLM judge is called. A run that fails Layer 1 scores 0.0 and never reaches the judge. At scale, that saves real money. Over 958 runs, I've avoided somewhere around 130-140 judge API calls I didn't need to make.

The promotion gate: 200 runs with a mean score of 0.95 or above. Not 100 runs, not 0.85. I tightened it from the original spec because 0.85 is a B+, and B+ doesn't belong in production routing.

What the scoreboard shows at run 958

Total scored runs: 958. Total cost: roughly $1.80 in Sonnet API calls for ground truth. Local inference: $0.00.

Top performers by mean score, requiring at least 20 runs for inclusion:

| Model | Task | n | Mean | |---|---|---|---| | phi4 | classify | 23 | 1.000 | | qwen2.5 | classify | 34 | 0.989 | | phi4 | format | 23 | 0.982 | | granite4:tiny-h | format | 38 | 0.932 |

Those top two numbers — 1.000 and 0.989 — are real. phi4 has hit a perfect score on 23 consecutive classify runs. qwen2.5 is at 0.989 over 34. These aren't cherry-picked demo outputs. They're scored by a two-judge ensemble with a third-party tiebreaker, applied to a random 15% sample. The other 85% pass Layer 1 and Layer 2 validators.

For context on what these tasks look like: classify is a structured single-label classification task. Format is a structured output formatting task where exact form matters as much as content. Both reward compact, deterministic responses — which maps well to instruction-tuned models.

The weakest task type: analyze. No model has broken 0.60 mean. There's a reason for that, and it's partially my fault.

The analyze bug

For the first eighteen sprints, every analyze score came back between 0.44 and 0.59, regardless of which model I tested. Every single run. 100% consecutive failure rate. I kept looking at the models. The problem was the evaluator.

The structural accuracy dimension — 25% of the total score — uses difflib.SequenceMatcher.ratio() for non-JSON tasks. That's a character-level edit distance metric. For extraction and classification tasks where format matters, it's fine. For open-ended analysis prose, it's catastrophically wrong.

Two analyses that reach the same correct conclusion in different words score approximately 0.29 on SequenceMatcher. With structural accuracy weighted at 25%, that alone caps the total score at roughly 0.74 even if the model is perfect on every other dimension.

I verified this with a manual test. I took two semantically identical analyses of the same question — same conclusion, different phrasing — and ran them through difflib.SequenceMatcher. Result: 0.295. Same meaning. 29.5% character overlap.

The fix was a per-task weight override: drop structural accuracy to 10% for analyze tasks, raise semantic similarity (cosine distance via Voyage embeddings) to 40%. The embedding space captures meaning. Character edit distance doesn't. I also added structured output templates — a system_prompt forcing both GT and candidates into the same Finding / Recommendation / Confidence / Reasoning format — which lifted semantic similarity scores from 0.55–0.70 to 0.83–0.92.

Post-fix: mean scores of 0.70, with 7 out of 10 manual evals above the 0.75 pass floor. Pre-fix: 0 out of 112.

That's 112 analyze runs I can't use. Historical scores from before Sprint 18 are not retroactively corrected — they represent the broken evaluation. The accumulator is re-running with the fixed weights.

The control floor that wasn't

I designated granite4:tiny-h as the control floor — the lowest expected performer, there to set a baseline everything else had to beat. IBM's 4.2GB enterprise model, small and conservative. I assumed it would anchor the bottom of the scoreboard.

It didn't.

At Sprint 9, granite4 was scoring 0.926 on format. That beat every other candidate I had in the format chain. By Sprint 19 — with 38 runs accumulated — it sits at 0.932 mean on format, with a last-10 of 0.935. It's not the floor. It's the fourth-highest performer on the scoreboard.

I've since moved phi4 to the control floor role. phi4 then proceeded to hit 1.000 on classify over 23 runs.

What this tells me about the scoring system: 50% of the total score rewards resembling Claude Sonnet's output (structural accuracy plus semantic similarity). A compact, instruction-tuned enterprise model naturally produces outputs that look like Sonnet's compact outputs. General-purpose conversational models don't. This is correct behaviour — Sonnet is the production target — but it means I was wrong about which models would compete. I assumed model size and brand would correlate with evaluation scores. They didn't.

I should have measured first and labelled second. I didn't. Took nine sprints to sort out.

The thing nobody has done yet

No model has been promoted.

The promotion gate requires 200 runs at a mean of 0.95 or above. The closest candidates — phi4/classify at 1.000 (n=23) and qwen2.5/classify at 0.989 (n=34) — are both above threshold on mean but are at 11-17% of the required run count. At the current accumulator rate, running seven task types every ten minutes, the promotion gate is roughly five to six weeks out for the fastest-accumulating task type.

This was the point of setting 200 runs, not 50. At n=23, even a perfect score has a confidence interval wide enough to matter. I want to know if phi4 holds 1.000 through 200. I don't know yet.

What I do know: the system is working. The scores are moving the right direction. The bugs I've found — the round-robin router that wasn't round-robining, the classify prompts that were demanding JSON from a plain-text task, the accumulator that silently died for six hours on an IndentationError, the deepseek-r1 that scored 0.096 because I was evaluating its chain-of-thought scratchpad instead of its answer — all of those found their way into the data before I caught them. That's exactly what a statistical evaluation framework is supposed to do: run long enough that you can't hide the bugs.

Where things actually stand

The two models I'd bet on for promotion, if I had to pick today: phi4 on classify and qwen2.5 on classify. Both above 0.95 mean, both trending stable. Format is the secondary priority — phi4 at 0.982 and qwen2.5 at an earlier read of 0.987 are both strong there too.

The task types I wouldn't promote anything on yet: analyze (still validating the post-fix weights), rag_synthesis (best is phi4 at 0.730, n=15 — not close), code_transform (deepseek-coder scoring 0.428 despite being purpose-built for code; phi4 beating it at 0.869).

The next article in this series will cover the vector backend comparison — seven FOSS vector databases tested head-to-head for RAG retrieval, and why the hybrid search model returned the wrong document with higher confidence than the correct one. The short version: BM25 is not universally better. The longer version involves alpha tuning, keyword ambiguity, and a retrieval failure that looked like success until I checked what document it actually returned.

The accumulator is running. Scores are coming in every ten minutes. Nothing has been promoted yet.

Hardware: Mac Mini M4, 24GB unified memory. Models running simultaneously via Ollama. Scoring infrastructure: Python 3.14.3 on pyenv, Langfuse self-hosted for observability, Voyage voyage-3-lite for semantic similarity, 1Password service account for secret management.

958 runs. $1.80 in API costs. 0 promotions.