After PMF, Your Agent Stack Needs a Router

The early version of this argument sounded too clean:

Startups begin with Codex and Claude because those systems have the best managed loops, desktop integrations, subagents, cloud task environments, and day-to-day product polish. Then, once the work gets repetitive, open/local stacks such as OpenCode become the next step because you can assign cheaper models to specific actions and stop burning frontier tokens on everything.

That felt right. It also sounded like the kind of tidy AI take I would usually distrust.

So I checked it against what we can actually prove from OpenClaw.

The stronger version is narrower:

Frontier agents are the right default when the startup is still discovering what the work even is. After product-market fit, the repeated parts of the work become stable enough to classify. At that point, the infrastructure advantage shifts from "use the strongest model everywhere" to "route by task type, risk, context size, and failure cost."

Not every task needs a frontier model. Some need a small model. Some need a deterministic script. Some need a cache lookup. Some should still go straight to Codex, Claude, or a human reviewer because the blast radius is too high.

That is the real post-PMF agent stack: not a cheaper model, but a router.

Why frontier-first makes sense early

When you are pre-PMF, token spend is not the main bottleneck. Confusion is.

You do not yet know which workflows matter. You do not know which tasks will repeat. You do not know where the quality bar actually sits. You are still discovering the product, the customer, the operational loop, and the failure modes.

That is why Codex and Claude have such a strong early advantage. They are not just models. They are managed work environments.

Codex can run local CLI workflows, cloud tasks, and GitHub code review flows. Claude Code has subagents, hooks, SDK subagents, and desktop-to-code workflows. These systems collapse a lot of the "how do I even run this work?" overhead.

For an early startup, that is worth paying for.

In OpenClaw, the expensive weeks were not caused by one silly prompt. They came from exactly the kind of work you would expect in exploration mode: hackathon pipelines, media generation, long-form writing, code review, orchestration, context consolidation, and recovery from failures.

One historical token-burn analysis recorded about 1.62M tokens in a single week and a quota limit hit before reset. The major contributors were not mysterious: a hackathon pipeline, course/content generation, agent orchestration, and memory/compaction work.

That is what early-stage agent work looks like. It is messy. It is broad. It is not yet shaped.

Trying to over-optimize that phase too early can be its own waste. You save tokens, but lose learning speed.

What changes after PMF

After PMF, the work changes shape.

You still need high-quality reasoning, but not for everything. The repeated tasks start to look familiar:

classify this request
extract fields from this source
convert this markdown into a target format
triage this test failure
summarize this run
check whether this artifact has evidence
draft a social derivative from an approved blog
review whether a task can safely publish or spend money These tasks are not equally hard. They are not equally risky. They should not all hit the same model.

The model router becomes the economic layer.

OpenCode's current docs are interesting in this context because its configuration supports model selection at the system level and agent level, including a small_model concept and local model support. That does not mean "OpenCode replaces Codex" or "local models beat frontier systems." It means the product surface matches a scaling pattern that appears once work is typed.

The question becomes:

What task is this?

What permissions does it need?

What is the cost of being wrong?

How much context does it need?

Can this be handled by retrieval, a validator, or a deterministic workflow instead of a model?

Only after those questions do you choose a model.

The local model lesson: uneven is useful

OpenClaw already has evidence for this.

In the Hybrid Control Plane work, the cost-efficiency dataset covers tens of thousands of local-model-style eval runs across seven task types. The result was not "local models are good." That would be too blunt.

The result was: local models are uneven in useful ways.

Qwen 2.5 was strong on classification, with a 0.984 mean score in the evaluation notes. Phi-4 hit 1.000 on classification in the available runs. Granite4 tiny-h, despite being small, was surprisingly strong on format transformation, with a 0.931 mean score and good structured-output adherence.

DeepSeek-R1 8B was more interesting. Its extraction score looked bad until we stripped <think> blocks before evaluation. After that preprocessing fix, extraction improved from 0.41 to 0.906.

GPT-OSS 20B went the other direction. It was effectively demoted because latency averaged 39 seconds and extraction quality was poor in that run set.

The point is not that one model is best.

The point is that task type matters more than brand, size, or vibes.

A small model that is excellent at format work is more valuable than a larger model that is mediocre everywhere. A reasoning model that emits hidden-thought wrappers may be good underneath but needs preprocessing. A model named like it should be strong may fail a specific task badly enough that you do not want it in the route.

This is why post-PMF agent infrastructure needs evaluation, not taste.

The cheapest model call is the one you do not make

The routing conversation often turns into a model-price conversation.

That is only half right.

The mature move is not always "send it to a cheaper model." Sometimes the mature move is "do not send it to a model at all."

OpenClaw's trajectory evaluation summary from 2026-05-25 covered 200 trajectory events. It recorded a known success rate of 0.9583 and 133 model turns avoided.

Those avoided turns matter. A validator does not need to be creative. A script that checks whether a manifest hash matches the local file does not need a frontier model. A workflow that already knows the state transition does not need to ask a chatbot what to do next.

This is where agent startups can get margin back.

Not by pretending all reasoning is cheap.

By turning repeated work into:

workflow state
schemas
fixtures
validators
cached summaries
retrieval-backed context
artifact references
approval gates The frontier model remains in the system. It just stops being the only system.

Context is also cost

The other hidden cost is context.

As agent systems mature, they accumulate transcripts, artifacts, screenshots, videos, test outputs, review notes, research memos, and operational logs. If the only way an agent remembers work is by stuffing more of that history into the next prompt, the cost curve gets ugly.

OpenClaw's current QMD memory telemetry shows why retrieval matters. On 2026-06-20, the QMD backend had 15,753 indexed files/chunks across 14 collections, with a 0.95 BM25 hit rate in the probe set and zero QMD gateway errors.

That does not prove retrieval is solved. It does prove the shape of the answer.

The agent should not carry every artifact in context. It should retrieve the relevant state, reference larger artifacts by digest, and make its evidence trail inspectable.

That is why we added a local artifact/workspace manifest spike. The manifest binds evidence to content hashes, source boundaries, redaction state, publication state, approval state, and workspace references.

This sounds boring until you need to know why an agent made a decision three weeks later.

Then it is the difference between "trust me" and "here is the exact file, hash, source boundary, and validation state."

The router cannot only optimize price

A bad router is dangerous.

If the router only asks "what is the cheapest model that can answer this?", it will eventually send the wrong task to the wrong place.

Some work has low cognitive difficulty and high operational risk.

Publishing a post is not intellectually hard. Sending money is not intellectually hard. Mutating credentials is not intellectually hard. But those actions have blast radius.

So the router needs more than a model leaderboard. It needs command contracts and permission boundaries.

In OpenClaw, command-contract work already separates task intent families and keeps runtime registration report-only until reviewed. The artifact manifest work tracks source boundary, redaction state, publication state, and approval state.

That is the right direction.

The real routing fields are not just model and price. They are:

task family
required permissions
context size
source trust
external side effects
acceptable failure mode
evidence requirement
escalation rule Cheap models are great for narrow, low-risk, evaluable work. They are not a license to automate judgment away.

A practical post-PMF routing pattern

Here is the version I would actually use.

Classification goes to a proven cheap/local model first. If confidence is low, labels disagree with history, or the class enables an external action, escalate.

Format transformation goes to a small model plus schema validation. If the schema fails or the transformation is lossy, escalate.

Extraction goes to a task-proven model where the source is low risk. If the downstream action is legal, financial, security-sensitive, or public, escalate.

Summarization can be cheap for short internal logs. Multi-source synthesis with conflicting evidence should use a stronger model.

RAG synthesis can start cheap when retrieval confidence is high. If sources conflict or the answer becomes publishable, escalate.

Code transforms can use local/cheap models for mechanical edits. Architecture, security, data migration, auth, payments, or public API changes should go to a frontier coding agent and a review gate.

External actions should not be routed to cheap autonomous models by default. Publishing, paying, deploying, credential mutation, and reputation changes need approval gates.

That is not glamorous. It is what the operating system for agents has to look like.

Where this leaves Codex, Claude, and OpenCode

I do not think this is a simple replacement story.

Codex and Claude are strong where ambiguity is high and the work environment matters. They are excellent for early discovery, complex code work, review, planning, and the kind of multi-step tasks where you want the system to carry a lot of operational context.

OpenCode-style routing becomes more interesting once the work is known. If a startup can say "this agent/action should use this model unless this condition fails," then it can start moving repeated work off the most expensive path.

The post-PMF stack probably uses both patterns.

Frontier managed agents for hard reasoning.

Open/local routing for repeated, typed work.

Deterministic workflows for work that should not need a model anymore.

Evidence manifests and approval gates so the system remains auditable.

That is the path I believe in because it matches what we have seen inside OpenClaw.

The honest evidence note

This is not a market benchmark.

The quantitative examples here come from OpenClaw internal operational telemetry and local model evaluation fixtures. They are useful because they show the shape of the scaling problem in a real agentic operating system. They do not prove that every startup will see the same numbers, and they do not prove that one product category wins over another.

What they do prove is enough for an operator thesis:

frontier-heavy exploration can create real token pressure
local models can win specific task families
some model calls can be avoided entirely
retrieval and artifact references can control context growth
routing has to include permission boundaries, not just model cost The early-stage question is: how fast can we learn?

The post-PMF question becomes: how much of this work have we understood well enough to route?

That is when the agent stack changes.

Not because the frontier models stop mattering.

Because everything else finally starts to matter too.

Sources and Evidence

Internal evidence:

memory/token-burn-analysis-2026-03-10.md
memory/spend-report-2026-06-20.md
projects/hybrid-control-plane/data/corpus/model_evaluation_findings.md
projects/hybrid-control-plane/data/paper/cost_efficiency.json
data/trajectory-eval/openclaw-trajectory-summary-2026-05-25.json
data/memory_telemetry/2026-06-20.json
projects/openclaw-enhancements/research/STARTUP-EVOLUTION-ARTICLE-EVIDENCE-PACK-2026-06-20.md External product docs:
OpenCode config: https://opencode.ai/docs/config/
OpenCode models: https://opencode.ai/docs/models/
OpenCode agents: https://opencode.ai/docs/agents/
Claude Code subagents: https://code.claude.com/docs/en/sub-agents
Claude Agent SDK subagents: https://code.claude.com/docs/en/agent-sdk/subagents
Claude Code hooks: https://code.claude.com/docs/en/hooks-guide
Codex Cloud: https://developers.openai.com/codex/cloud
Codex CLI: https://developers.openai.com/codex/cli
Codex GitHub integration: https://developers.openai.com/codex/integrations/github