My eval was silently giving every analysis task a failing score for weeks (and why)

The whole point of building a statistical evaluation framework is that it tells you the truth. That's the pitch. Not vibes, not "it looked good in the demo" — numbers. Reproducible numbers. Numbers you can trust.

So I built one. Ground truth from Claude Sonnet, two independent judges, a tiebreaker, six weighted scoring dimensions, a 200-run promotion gate. The works. I wrote about the architecture in the first post. I was proud of it.

Then I ran it on analyze tasks for about nine sprints. Across five different models. And every single run came back below the pass floor.

Not some runs. Not the weaker models. Every run. Every model. Zero exceptions across 112+ scored candidate runs on analyze.

I thought local 7B–14B models just couldn't do open-ended analysis. I wrote it off. I moved on.

I was wrong about what I was measuring.

The setup

My eval framework scores model outputs across six dimensions:

| Dimension | Default weight | |---|---| | Structural accuracy | 25% | | Semantic similarity | 25% | | Factual drift | 20% | | Task completion | 15% | | Tool call correctness | 10% | | Latency | 5% |

For JSON-structured tasks, structural accuracy uses a real schema validator — required keys, forbidden patterns, JSON validity. It's meaningful.

For prose tasks — summarise, extract, analyse — structural accuracy uses difflib.SequenceMatcher.ratio(). Character-level edit distance between the ground truth and the candidate. Two texts with the same meaning but different words will score near zero on this metric.

I knew that. I set the weight anyway. I figured 25% on structural plus 25% on semantic (Voyage embeddings, cosine similarity) would balance each other out. Prose doesn't need to match character-for-character — it just needs to be semantically close, and semantic was there to capture that.

What I didn't think through: on an open-ended analysis task, a candidate model saying "The data shows qwen2.5 significantly outperforms llama3.1 on classification, which suggests it should be prioritised for that task type" and Claude Sonnet saying "qwen2.5 is the stronger classifier — llama3.1 lags by 28 points and shouldn't be the primary candidate" might mean essentially the same thing.

difflib.SequenceMatcher doesn't care. They share maybe 30% of their characters.

difflib.SequenceMatcher(None, gt_text, cand_text).ratio()
# 0.295

With structural accuracy at 25% weight, a score of 0.29 on that dimension alone drags the maximum possible weighted total down to roughly 0.74. Below my 0.75 pass floor. Even if every other dimension was perfect.

It was never going to pass.

How long it ran broken

Nine sprints. From the day I added analyze as a task type to the day I finally dug into the data.

During that time, the demotion tracker was quietly doing its job. For every model on analyze, consecutive_failures was incrementing on every run. By the time I caught it, deepseek-r1/analyze had 33 consecutive failures — because it had only run 33 times. Same for every other model. The failure counter was just their total run count.

The system wasn't broken. The eval was telling me something. I just read it as "local models are bad at analysis" rather than "the scoring function for this task type is wrong."

I didn't notice for nine sprints because the other task types were fine. format, classify, summarize, extract — all producing sane-looking scores, a spread of values, some models clearly better than others. The eval felt like it was working. And it was, for those tasks.

analyze was just one task type. It was accumulating data in the background. I wasn't watching it.

What the numbers actually looked like

All 112+ analyze candidate runs: mean between 0.44 and 0.59 depending on the model. None above 0.75.

Here's what I should have caught earlier: the semantic similarity scores were actually decent. 0.55 to 0.70 on semantic. The models were producing semantically similar outputs to ground truth. The cosine similarity in embedding space was picking it up.

But 0.55–0.70 on semantic, multiplied by 25% weight, is 0.14–0.18 of your final score. Then structural at 0.29, multiplied by 25%, is 0.07. Those two dominant dimensions combined were contributing maybe 0.21–0.25 to the final score.

The right signal was there. It was just buried under the wrong metric at the wrong weight.

The fix

Two things, both applied in Sprint 18.

First: per-task weight overrides.

For analyze, the weights now look like this:

TASK_WEIGHT_OVERRIDES = {
    "analyze": {
        "structural_accuracy": 0.10,   # demoted — difflib is wrong for prose
        "semantic_similarity": 0.40,   # promoted — this is what actually matters
        "factual_drift":       0.25,
        "task_completion":     0.18,
        "tool_correctness":    0.04,
        "latency_score":       0.03,
    },
}

Structural from 25% to 10%. Semantic from 25% to 40%.

Second: a structured output format via system_prompt on all 10 analyze inputs.

Without structure, Sonnet might write a paragraph, and a candidate might write a numbered list. Both correct, completely different character sequences. I added a system prompt requiring both GT and candidates to follow the same template:

You are an expert AI evaluation analyst. Always respond using this exact format:
Finding: [key finding in one sentence]
Recommendation: [clear actionable recommendation]
Confidence: [high/medium/low]
Reasoning: [1-2 sentences of supporting logic]

When both outputs share the same four section headers, difflib finally sees something structurally common. The character overlap goes from near-zero to something workable. But more importantly, the semantic similarity goes up too — constrained structure forces both outputs toward more comparable vocabulary.

What changed after

10 manual evals, post-fix.

Mean score: 0.70. Pre-fix range was 0.44–0.59. 7 of the 10 runs came in above the 0.75 pass floor. Before the fix, it was 0 of 112.

The semantic similarity on those runs: 0.83–0.92. That's what the models were actually achieving the whole time. I was just weighting a character-comparison metric at 25% of the score on outputs that were never going to match character-for-character.

One number worth sitting with: 0 out of 112. For nine sprints. That's not one bad eval run — that's a systematic misconfiguration that made an entire task type unpassable from day one, and I didn't notice because the rest of the system looked healthy.

The actual mistake

I assumed one weight profile would work across all task types.

That's the mistake. Not the choice of difflib. Not the specific weights. The assumption that the same balance of metrics — structural accuracy, semantic similarity, factual drift — would mean the same thing on a JSON classification task as it does on a prose analysis task.

It doesn't. On classification, structural accuracy is important — the model either outputs the right label or it doesn't. On open-ended analysis, structural accuracy using character-level edit distance is noise. It tells you whether two responses happen to share vocabulary, not whether they're both correct.

The eval system I built is inherently task-specific in what it should measure. I built it as if it were task-agnostic.

The fix isn't complicated. Per-task weight overrides are maybe twenty lines of code. But I had to first accept that the signal I was ignoring — 112 runs, zero passes, consecutive_failures equals total run count — wasn't telling me that local models can't do analysis.

It was telling me I was measuring the wrong thing.

Historical pre-Sprint-18 analyze scores are not retroactively corrected — they represent the old eval, clearly labelled as such. New runs from Sprint 18 onward use the corrected weight profile.