← Blog

Model Size ≠ Model Fit: How Haiku Beat Mistral Large With a Real Deadline on the Line

Nissan Dookeran6 min readai-engineeringllm-selectionmodel-evaluationhackathon

Model Size ≠ Model Fit: How Haiku Beat Mistral Large With a Real Deadline on the Line

We burned two timeouts—40 minutes, then 60 minutes—watching Mistral Large try to build a browser automation script for our hackathon demo. Zero deliverables. Then we handed the same task to Claude Haiku. Twenty-four minutes later, we had 8 MB of professional video clips, screenshots, and a working README.

Same task. Different model. Completely different result.

I've been building with LLMs long enough to know that "bigger = better" is one of the most dangerous assumptions in this space. But this week, with real money on the line and a submission deadline ticking, we got the most visceral proof I've ever seen.

The task

We needed a browser automation script for our SandSync hackathon submission. Specifically:

  • Open our web app in a headless browser
  • Record the screen across 3 demo scenarios
  • Generate professional video clips + screenshots
  • Save everything to disk, ready to upload

Deadline: fixed. Stakes: real. No room for elegant architecture.

We had two agents available:

  • Firefly running Mistral Large — our "sophisticated reasoning" agent
  • Kit running Claude Haiku — our pragmatic code agent

We tried Firefly first. Big mistake.

Two failures, then a win

Firefly / Mistral Large — attempt 1 (40-minute timeout)

Firefly started by importing dependencies. Hit a ModuleNotFoundError: requests partway through. Spent time troubleshooting the environment. Ran out of time. Zero output.

Firefly / Mistral Large — attempt 2 (60-minute timeout)

Dependencies supposedly fixed. Firefly started fresh. Sixty minutes later: nothing. Hard timeout. We don't know exactly what it was building—but based on what Mistral Large tends to produce, I have a theory.

Kit / Haiku — attempt 1 (24 minutes, success)

Switched models. Kit delivered:

✅ demo-scenario-1.mp4  (2.1 MB, 30s)
✅ demo-scenario-2.mp4  (2.4 MB, 30s)
✅ demo-scenario-3.mp4  (1.8 MB, 30s)
✅ screenshots/          (12 PNG files, ~600 KB total)
✅ README.md             (documentation, usage, notes)

Total: 8.0 MB of production-ready collateral. Execution time: 5m38s. Total elapsed (including setup): 24 minutes.

Sophistication is a liability under a clock

Here's what I think Mistral Large was doing while Kit was shipping.

Mistral is a powerful model. That power manifests as thoroughness: it considers edge cases, builds robust error handling, designs for maintainability. When it approaches "build a browser automation script," it probably built something like this:

# What Mistral Large probably built (inferred from silence)

class VideoCapturePipeline:
    def __init__(self, config_path: str):
        self.config = self._load_and_validate_config(config_path)
        self.encoder = VideoEncoder(
            codec=self.config.get('codec', 'h264'),
            quality=self.config.get('quality', 'high'),
            bitrate=self.config.get('bitrate', '2000k')
        )
        self.browser = BrowserManager(
            headless=self.config.get('headless', True),
            viewport=self.config.get('viewport', {'width': 1280, 'height': 720}),
            retry_policy=RetryPolicy(max_attempts=3, backoff='exponential')
        )

    def run_all_scenarios(self):
        results = []
        for scenario in self.config['scenarios']:
            result = self._run_with_error_handling(scenario)
            results.append(result)
        return self._generate_summary_report(results)

    # ... 400 more lines of abstraction layers

That's excellent engineering. That would score well in a code review. It's also completely useless when you have 40 minutes and zero time to debug config parsing.

What Haiku built was probably closer to this:

# !/usr/bin/env python3
# What Haiku actually built (inferred from the output)

from playwright.sync_api import sync_playwright
import subprocess, os, time

SCENARIOS = [
    {"name": "demo-scenario-1", "url": "http://localhost:3000/demo/story"},
    {"name": "demo-scenario-2", "url": "http://localhost:3000/demo/voices"},
    {"name": "demo-scenario-3", "url": "http://localhost:3000/demo/export"},
]

os.makedirs("screenshots", exist_ok=True)

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page(viewport={"width": 1280, "height": 720})

    for scenario in SCENARIOS:
        page.goto(scenario["url"])
        time.sleep(2)  # let it load

        # Screenshot
        page.screenshot(path=f"screenshots/{scenario['name']}.png", full_page=True)

        # Video via ffmpeg screen capture
        subprocess.run([
            "ffmpeg", "-y", "-f", "avfoundation", "-i", "default",
            "-t", "30", "-vcodec", "libx264", f"{scenario['name']}.mp4"
        ], check=True)

        print(f"✅ {scenario['name']} done")

    browser.close()

print("All done.")

Forty lines. No config parsing. No retry logic. No abstraction layers. Inline constants. time.sleep(2) where a sophisticated model would write a proper wait condition.

And it shipped.

The right lens: model-task fit

This isn't a story about Haiku being a better model than Mistral Large. Mistral Large would absolutely destroy Haiku at complex reasoning, architecture planning, or multi-step analysis. That's the wrong lens.

The right lens is model-task fit:

Task TypeRecommended ModelWhyAvoid
Tight deadline, execute + shipHaikuFast, pragmatic, minimal over-engineeringMistral Large (overthinks)
Architecture / planningOpusNovel reasoning, deep synthesisHaiku (scope too narrow)
Straight coding, KISS requiredHaikuShips fast, fewer abstraction layersMistral (gets in its own way)
Complex problem-solvingOpusCreative reasoning, edge casesHaiku (will hit limits)
Fast iteration loopsHaikuSub-second feedback, low costAny large model
Production-grade designSonnet / OpusBalance of quality and costHaiku on anything nuanced

The insight: large models are optimised to be thorough. Small models are optimised to be fast. Under a deadline, thoroughness is the enemy.

Three lessons worth keeping

1. A 40-minute timeout on a large model is data, not bad luck.

If a sophisticated model is still running at 40 minutes on a "simple" automation task, it's not slow—it's over-engineering. The model is solving a harder problem than you asked it to solve. That's a routing failure, not a model failure.

2. Fresh model, fresh approach.

When Firefly failed twice, we didn't give Mistral Large a third try. We changed models. That wasn't giving up—it was recognising that the same model with the same training will make the same architectural choices. Kit wasn't "smarter" than Firefly; it just approached the problem differently because it's a different model.

3. Incremental beats monolithic, under any clock.

Haiku probably saved files after each scenario. Mistral was probably building the entire pipeline before running anything. If something fails mid-way through Haiku's approach, you have partial results. Mistral's approach is all-or-nothing—and "nothing" is what we got twice.

The routing heuristic I'm using now

If you're running multi-agent infrastructure, or even just picking models for different tasks, here's the routing heuristic I'm now using:

  • "Ship something that works" → small, fast model
  • "Figure out the right approach" → large, thorough model
  • "Write this cleanly, no deadline" → large or mid-size model
  • "Deadline in 2 hours" → small model, always

Cost is a nice bonus: Haiku at $0.80/$4 per million tokens versus Mistral Large's pricing means we saved money and shipped on time. Model-task fit isn't just about quality—it's about economics.

Three hackathons and dozens of agent tasks later, the pattern holds: models that overthink consistently lose to models built for speed when there's a clock on the wall. The calculus might flip in some domains — complex reasoning tasks where thoroughness is the actual requirement. But under a deadline, routing by constraint first and capability second is the right call.

Biggest lesson: a 40-minute timeout isn't a model failing. It's data. It's telling you the wrong model is in the seat.

Next: Why we run Kit on Haiku by default but upgrade to Sonnet for anything touching production architecture — and where we draw that line.