The model I designed as my floor outperformed every candidate

When I designed the evaluation framework for my hybrid control plane, I needed a control floor. In clinical trials, the placebo arm isn't there because anyone thinks the sugar pill will work — it's there to give you a baseline. Something definitively at the bottom, so you know what "not working" looks like and can measure everything else above it.

I needed the same thing. A model so clearly outclassed by the others that its scores would anchor the floor of the scoreboard. A reference point for "this is what failure looks like."

I picked granite4:tiny-h.

IBM's smallest Granite 4.0 variant. 2GB on disk. A model that — in my head — was the obvious choice for the basement. The candidates I was actually excited about were phi4:latest at 9.1GB and qwen2.5:latest at 4.7GB. Those were the ones with the headroom, the parameter count, the benchmarks I'd read about. Granite4 was the placeholder. The thing I'd point at and say: "see, the others are better."

What the accumulator told me instead

After 38 scored runs, granite4/format had a mean of 0.932.

llama3.1 — the first model I tested, the one I had the most data on after 12 runs — was sitting at 0.835. ministral-3 was at 0.820. My supposed "floor" had just outperformed both of them on format tasks by 10-11 percentage points.

qwen2.5 was still ahead at 0.948 (n=13). But granite4 at 0.932 wasn't trailing it by much. Not "floor" distance. Competitive distance.

I looked at classify. Same story. After 34 runs: granite4/classify at 0.920. llama3.1/classify was at 0.706. ministral-3/classify was worse than that.

The model I'd designated as the control floor was, statistically, one of the strongest performers in the study.

Why I got it wrong

My mental model was: bigger model → better performance. At 2GB, granite4:tiny-h was supposed to be hamstrung by its size. phi4 at 9.1GB had more than four times the parameter budget. qwen2.5 at 4.7GB had more than twice. This felt like a law of nature.

It isn't.

The clue was in what the format and classify tasks actually require. Both tasks are essentially: take an input, produce a compact, structured output that matches a specific template. My ground truth model — Claude Sonnet — is extremely good at this. Compact. Precise. Deterministic in its formatting.

IBM designed Granite 4.0 for enterprise structured output. The "tiny-h" variant in particular: efficient, instruction-following, low verbosity. When granite4 receives a format task, it produces something that looks a lot like what Sonnet produces — because both models are tuned to produce clean, concise, structured responses.

50% of my evaluation score comes from resembling Sonnet's output (structural accuracy + semantic similarity). A model that naturally produces compact, structured text will score well on those dimensions regardless of its parameter count. A model that tends toward verbose, conversational responses — even a large one — will pay a penalty every time.

That's why ministral-3:8b — a solid general-purpose model — underperforms on classify. Sonnet answers "crypto". ministral answers "Based on the context provided, I would classify this as related to cryptocurrency..." The structural accuracy dimension sees those as very different outputs. They're not equivalent in score, even if they're semantically equivalent in meaning.

But it's not a miracle

I want to be careful here. Granite4 didn't beat everything.

On extract tasks: 0.738. On summarize: 0.729. On RAG synthesis: 0.394.

That 0.394 on RAG is bad. Not "needs more data" bad. Just bad. The model that performs at 0.932 on format fails at retrieval-augmented generation. The same model. Completely different task type, completely different result.

This is task-specific excellence, not general competence. IBM built granite4:tiny-h to do certain things extremely well. Those things happen to include exactly what I tested it on when I assumed it would fail.

The control floor concept worked — just not how I planned. Instead of anchoring the bottom of the scoreboard, granite4 highlighted that my assumptions about model quality were the thing being tested, not just the models.

What I've changed

In Sprint 9, I promoted granite4:tiny-h to local_candidate tier. It's now the primary model on format tasks: granite4 (0.932) > llama3.1 (0.835) > ministral-3 (0.820). I moved phi4:latest to control floor duty — not because phi4 is bad, but because it actually is sitting at the bottom of the scoreboard on the tasks it's been evaluated on so far.

I also updated my model selection heuristic. Or rather: I removed the one I had. I don't have a size heuristic anymore. I have scores.

The evaluation system didn't just tell me which models performed well. It told me I was ranking models before running a single evaluation. I'd decided the outcome before the experiment. The control floor failing to be the floor was the experiment doing what it was supposed to do.

The number that keeps me honest

0.394.

That's granite4 on RAG synthesis. Same model. 38 runs apart, the same 2GB binary scores 0.932 on one task and 0.394 on another.

If I'd only run format evaluations, I'd have promoted granite4 everywhere. If I'd only run RAG evaluations, I'd have demoted it on day one. The right answer is: you run both, you look at the task, and you route accordingly.

That's what the control plane is for.