Model Size ≠ Model Fit: How Haiku Beat Mistral Large With a Real Deadline on the Line
Model Size ≠ Model Fit: How Haiku Beat Mistral Large With a Real Deadline on the Line
We burned two timeouts—40 minutes, then 60 minutes—watching Mistral Large try to build a browser automation script for our hackathon demo. Zero deliverables. Then we handed the same task to Claude Haiku. Twenty-four minutes later, we had 8 MB of professional video clips, screenshots, and a working README.
Same task. Different model. Completely different result.
I've been building with LLMs long enough to know that "bigger = better" is one of the most dangerous assumptions in this space. But this week, with real money on the line and a submission deadline ticking, we got the most visceral proof I've ever seen.
The task
We needed a browser automation script for our SandSync hackathon submission. Specifically:
- Open our web app in a headless browser
- Record the screen across 3 demo scenarios
- Generate professional video clips + screenshots
- Save everything to disk, ready to upload
Deadline: fixed. Stakes: real. No room for elegant architecture.
We had two agents available:
- Firefly running Mistral Large — our "sophisticated reasoning" agent
- Kit running Claude Haiku — our pragmatic code agent
We tried Firefly first. Big mistake.
Two failures, then a win
Firefly / Mistral Large — attempt 1 (40-minute timeout)
Firefly started by importing dependencies. Hit a ModuleNotFoundError: requests partway through. Spent time troubleshooting the environment. Ran out of time. Zero output.
Firefly / Mistral Large — attempt 2 (60-minute timeout)
Dependencies supposedly fixed. Firefly started fresh. Sixty minutes later: nothing. Hard timeout. We don't know exactly what it was building—but based on what Mistral Large tends to produce, I have a theory.
Kit / Haiku — attempt 1 (24 minutes, success)
Switched models. Kit delivered:
✅ demo-scenario-1.mp4 (2.1 MB, 30s)
✅ demo-scenario-2.mp4 (2.4 MB, 30s)
✅ demo-scenario-3.mp4 (1.8 MB, 30s)
✅ screenshots/ (12 PNG files, ~600 KB total)
✅ README.md (documentation, usage, notes)
Total: 8.0 MB of production-ready collateral. Execution time: 5m38s. Total elapsed (including setup): 24 minutes.
Sophistication is a liability under a clock
Here's what I think Mistral Large was doing while Kit was shipping.
Mistral is a powerful model. That power manifests as thoroughness: it considers edge cases, builds robust error handling, designs for maintainability. When it approaches "build a browser automation script," it probably built something like this:
# What Mistral Large probably built (inferred from silence)
class VideoCapturePipeline:
def __init__(self, config_path: str):
self.config = self._load_and_validate_config(config_path)
self.encoder = VideoEncoder(
codec=self.config.get('codec', 'h264'),
quality=self.config.get('quality', 'high'),
bitrate=self.config.get('bitrate', '2000k')
)
self.browser = BrowserManager(
headless=self.config.get('headless', True),
viewport=self.config.get('viewport', {'width': 1280, 'height': 720}),
retry_policy=RetryPolicy(max_attempts=3, backoff='exponential')
)
def run_all_scenarios(self):
results = []
for scenario in self.config['scenarios']:
result = self._run_with_error_handling(scenario)
results.append(result)
return self._generate_summary_report(results)
# ... 400 more lines of abstraction layers
That's excellent engineering. That would score well in a code review. It's also completely useless when you have 40 minutes and zero time to debug config parsing.
What Haiku built was probably closer to this:
# !/usr/bin/env python3
# What Haiku actually built (inferred from the output)
from playwright.sync_api import sync_playwright
import subprocess, os, time
SCENARIOS = [
{"name": "demo-scenario-1", "url": "http://localhost:3000/demo/story"},
{"name": "demo-scenario-2", "url": "http://localhost:3000/demo/voices"},
{"name": "demo-scenario-3", "url": "http://localhost:3000/demo/export"},
]
os.makedirs("screenshots", exist_ok=True)
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page(viewport={"width": 1280, "height": 720})
for scenario in SCENARIOS:
page.goto(scenario["url"])
time.sleep(2) # let it load
# Screenshot
page.screenshot(path=f"screenshots/{scenario['name']}.png", full_page=True)
# Video via ffmpeg screen capture
subprocess.run([
"ffmpeg", "-y", "-f", "avfoundation", "-i", "default",
"-t", "30", "-vcodec", "libx264", f"{scenario['name']}.mp4"
], check=True)
print(f"✅ {scenario['name']} done")
browser.close()
print("All done.")
Forty lines. No config parsing. No retry logic. No abstraction layers. Inline constants. time.sleep(2) where a sophisticated model would write a proper wait condition.
And it shipped.
The right lens: model-task fit
This isn't a story about Haiku being a better model than Mistral Large. Mistral Large would absolutely destroy Haiku at complex reasoning, architecture planning, or multi-step analysis. That's the wrong lens.
The right lens is model-task fit:
| Task Type | Recommended Model | Why | Avoid |
|---|---|---|---|
| Tight deadline, execute + ship | Haiku | Fast, pragmatic, minimal over-engineering | Mistral Large (overthinks) |
| Architecture / planning | Opus | Novel reasoning, deep synthesis | Haiku (scope too narrow) |
| Straight coding, KISS required | Haiku | Ships fast, fewer abstraction layers | Mistral (gets in its own way) |
| Complex problem-solving | Opus | Creative reasoning, edge cases | Haiku (will hit limits) |
| Fast iteration loops | Haiku | Sub-second feedback, low cost | Any large model |
| Production-grade design | Sonnet / Opus | Balance of quality and cost | Haiku on anything nuanced |
The insight: large models are optimised to be thorough. Small models are optimised to be fast. Under a deadline, thoroughness is the enemy.
Three lessons worth keeping
1. A 40-minute timeout on a large model is data, not bad luck.
If a sophisticated model is still running at 40 minutes on a "simple" automation task, it's not slow—it's over-engineering. The model is solving a harder problem than you asked it to solve. That's a routing failure, not a model failure.
2. Fresh model, fresh approach.
When Firefly failed twice, we didn't give Mistral Large a third try. We changed models. That wasn't giving up—it was recognising that the same model with the same training will make the same architectural choices. Kit wasn't "smarter" than Firefly; it just approached the problem differently because it's a different model.
3. Incremental beats monolithic, under any clock.
Haiku probably saved files after each scenario. Mistral was probably building the entire pipeline before running anything. If something fails mid-way through Haiku's approach, you have partial results. Mistral's approach is all-or-nothing—and "nothing" is what we got twice.
The routing heuristic I'm using now
If you're running multi-agent infrastructure, or even just picking models for different tasks, here's the routing heuristic I'm now using:
- "Ship something that works" → small, fast model
- "Figure out the right approach" → large, thorough model
- "Write this cleanly, no deadline" → large or mid-size model
- "Deadline in 2 hours" → small model, always
Cost is a nice bonus: Haiku at $0.80/$4 per million tokens versus Mistral Large's pricing means we saved money and shipped on time. Model-task fit isn't just about quality—it's about economics.
Three hackathons and dozens of agent tasks later, the pattern holds: models that overthink consistently lose to models built for speed when there's a clock on the wall. The calculus might flip in some domains — complex reasoning tasks where thoroughness is the actual requirement. But under a deadline, routing by constraint first and capability second is the right call.
Biggest lesson: a 40-minute timeout isn't a model failing. It's data. It's telling you the wrong model is in the seat.
Next: Why we run Kit on Haiku by default but upgrade to Sonnet for anything touching production architecture — and where we draw that line.