The detective and the surgeon: what 395 experiments taught us about routing AI agents

We build software with AI agents. One of those agents — Kit — writes code. And every time Kit picks up a task, it needs to pick a local model to work with. For a while, that choice was easy. qwen2.5-coder led the coding leaderboards. So we defaulted to it everywhere. Boilerplate? qwen2.5-coder. Bug fixes? qwen2.5-coder. Schema generation? qwen2.5-coder. We never tested whether that was actually right. So we built Ralph Lab — a scoring harness that runs local models against specific coding task types and scores the output. We ran 395 experiments across 8 models and 15+ task types. Here's what we found, and how it changed how we build.

The obvious move: look at the leaderboard

The first instinct when choosing a model is to look at aggregate scores. We looked at them. Then we stopped trusting them. Here's what our own data showed:

| Model | Avg Score | Runs |
|---|---|---|
| qwen2.5-coder:latest | 7.48 | 98 |
| qwen3:8b | 6.70 | 94 |
| ministral-3:8b | 6.15 | 89 |
| deepseek-r1:8b | 5.34 | 31 |
| granite4:tiny-h | 5.04 | 37 |
| qwen3:1.7b | 4.68 | 44 |
```text

qwen2.5-coder leads. That looks like a clean answer. Use it everywhere, done.
Except an average score tells you which model is best on average. It doesn't tell you which model is best for any specific task. Those are different questions. And when you're routing agent tasks, the per-task question is the only one that matters.

---

## Bug detection and bug repair are not the same job

We had been routing all bug work to qwen2.5-coder. Seemed obvious. Then we tested it.
We split `bug_handling` into two separate task types: `bug_detection` (find the problem) and `bug_repair` (write the fix). The scores came back completely different.

```text
| Task | Winner | Score |
|---|---|---|
| bug_detection | qwen2.5-coder | 9.05 avg |
| bug_repair | qwen3:8b | 9.00 avg, zero variance across 6 runs |
```text

That's when we realised we'd been routing two different jobs to the same model.
qwen2.5-coder is strong at pattern recognition — scanning a pile of code and spotting what's wrong. qwen3:8b is strong at precision — writing a targeted fix without disturbing anything else. One model is the detective. The other is the surgeon. They're not interchangeable.
Before this split, every bug repair went to qwen2.5-coder. That was a real quality cost we hadn't noticed because we had no comparison.

---

## The smallest model won instruction-following

We expected the larger models to dominate constraint-following tasks. They didn't.
qwen3:1.7b — the smallest model in the pool — scored 6.08 average across 12 replicated runs on instruction_following tasks. It beat every other model in the set, including the 8b models.
What made qwen3:8b's result stranger wasn't that it lost. It was how it lost. It scored exactly 4.51 on all 12 runs. Not approximately. Exactly. Twelve identical scores.
That's not variance — that's a model hitting a hard ceiling it cannot get past. It's consistent, which means it's reliable, but it's reliably capped on this task type. And a 1.7b model clears it every time. Bigger isn't always better. For constraint-following, precision mattered more than raw capability.

---

## The probe we built badly

We designed a simplified version of the instruction-following task to test a specific hypothesis. We ran it. The results came back and didn't make sense — qwen3:8b, which was scoring consistently on the full task, completely collapsed on the simplified one.
Our first instinct was to ask what was wrong with the model.
It wasn't the model.
The probe was poorly constructed. It wasn't isolating the variable we thought it was isolating. The numbers looked like data but weren't — they were measuring our probe design, not model behaviour.
We threw those results out entirely.
Task design is a real skill in evaluation work. Not every experiment produces clean signal, and the discipline isn't just running experiments — it's knowing which results to keep.

---

## What we actually shipped

After 395 runs, this is the routing table Kit uses today:

```text
| Task Type | Model | Reason |
|---|---|---|
| code_generation | qwen2.5-coder | Highest general code quality |
| bug_detection | qwen2.5-coder | 9.05 avg, strong pattern recognition |
| bug_repair | qwen3:8b | 9.00 avg, zero variance |
| instruction_following | qwen3:1.7b | 6.08 avg, best constraint precision |
| sql_complex | qwen3:8b | Strongest structured reasoning |
| schema_generation | qwen2.5-coder | granite4 scored 0.0 here — never use it |
| boilerplate_floor | granite4:tiny-h | 100% completion, fast, cheap |
```text

The granite4 schema_generation result is worth naming explicitly. It scored 0.0. Not a low score. Zero. That's information the aggregate leaderboard doesn't give you.
The route that surprised us most wasn't the bug split. It was instruction_following going to qwen3:1.7b — a 1.7b model, winning a category outright, in a pool that includes 8b models. We checked it twice.

---

## How the pipeline changed

Before Ralph Lab, the phase headers in `BUILD_PLAN.md` looked like this:

```text
## Phase 2: Implement auth middleware
```text

Now they look like this:

```text
## Phase 2: Implement auth middleware
task_type: code_generation
```text

Firefly, our build planner agent, labels every phase with a `task_type`. Kit reads that label and picks the right local model before writing a single line of code. The routing table lives in Kit's spec file. The experiments update the spec. The spec updates the pipeline.
It's a two-line addition to a phase header. The work that justified it was 395 runs.

---

## What we don't know yet

Wave 3 is queued. We haven't tested refactoring tasks properly — whether qwen3:8b's precision carries over to restructuring existing code rather than writing new code. We haven't run test generation. We haven't looked at how task type interacts with context length in multi-turn work.
The routing table we shipped is a snapshot. It's the best data we have right now, for the task types we've tested. It's not a finished document.
That's fine. The point of building the harness wasn't to produce a final answer. It was to build a feedback loop — where running more experiments makes the routing table better, and a better routing table makes Kit's output better. We still don't know which model wins at everything. But we know more than we did 395 runs ago, and we have the machinery to keep finding out.

---

*Ralph Lab is part of the reddi.tech agent stack. Wave 3 results will be published here.*