The free TTS model that beats OpenAI
I've been running OpenClaw on a Mac Mini M4 for a few weeks now. Four AI agents, a hybrid control plane, local Ollama models — the whole setup. One thing kept bugging me: every time an agent needed to speak, it hit the OpenAI TTS API. At $0.015 per thousand characters for TTS-1 and $0.030 for TTS-1-HD, it adds up. Not fast, but steadily.
So I decided to test everything. Every TTS option I could run on this machine, scored objectively, with real numbers.
The test
Four models:
- macOS
say— the built-in speech synthesiser that ships with every Mac. Free. - sherpa-onnx VITS — an open-source offline neural TTS model. Free.
- OpenAI TTS-1 — the standard API tier. $0.015/1k chars.
- OpenAI TTS-1-HD — the premium API tier. $0.030/1k chars.
Ten prompts across different categories: pangrams, tongue twisters, technical content with numbers, error messages, notifications, statistics readouts, confirmations, greetings, and status updates.
Scoring: each model synthesises the prompt to audio, then Whisper transcribes it back. Word Error Rate and Character Error Rate get combined into a weighted score where 1.0 is perfect. The round-trip approach means I'm measuring what a listener would actually hear — not some subjective quality rating.
The early failure
First attempt: run macOS say in a headless LaunchDaemon context. Silence. No audio device when there's no GUI session. I had to pipe through afconvert to write directly to file instead of playing audio. Half an hour of debugging for something that works perfectly in Terminal. Classic daemon gotcha.
The results
| Model | Avg Score | Cost/1000 chars | |-------|-----------|-----------------| | sherpa-onnx VITS | 0.933 | $0.00 | | macOS say | 0.918 | $0.00 | | OpenAI TTS-1 | 0.913 | $0.015 | | OpenAI TTS-1-HD | 0.888 | $0.030 |
Read that again. The free, offline, open-source model wins. The one that costs $0.030 per thousand characters comes last.
Per-category breakdown
| Category | macOS say | TTS-1 | TTS-1-HD | sherpa VITS | |----------|-----------|-------|----------|-------------| | pangram | 1.000 | 0.990 | 0.980 | 1.000 | | tongue_twister | 0.948 | 0.876 | 0.857 | 0.895 | | technical_numbers | 0.958 | 0.941 | 0.924 | 0.958 | | technical_error | 0.934 | 0.948 | 0.930 | 0.967 | | notification | 0.902 | 0.888 | 0.834 | 1.000 | | statistics | 0.826 | 0.848 | 0.830 | 0.865 | | confirmation | 0.969 | 0.949 | 0.930 | 0.969 | | error_message | 0.961 | 0.981 | 0.923 | 1.000 | | greeting_time | 0.925 | 0.903 | 0.881 | 0.925 | | status_update | 0.758 | 0.806 | 0.787 | 0.754 |
sherpa-onnx wins or ties in 8 of 10 categories. It loses on tongue twisters (macOS say at 0.948) and status updates with abbreviations like "CPU" and "GB" (TTS-1 at 0.806 vs 0.754). Status updates were the hardest category for every model.
TTS-1-HD is the puzzle. Twice the price of TTS-1, lower score. My guess: HD optimises for sounding good, not being precise. On numbers and technical terms, that costs it. At $0.030/1k chars, hard to justify.
What I didn't test
ElevenLabs — no API key yet, wasn't going to sign up just for this benchmark. If it scores above 0.933, the story changes. Also skipped Kokoro (another promising open-source model). Only 10 prompts, English only. Enough for a clear pattern, not enough for statistical confidence on edge cases.
What this means for me
sherpa-onnx VITS is now the default TTS for my agents. Not ElevenLabs, not the OpenAI API. The free offline model. It runs on-device, needs no API key, no internet, no per-character billing. And it produces the most accurate speech in the set.
I think this applies beyond my setup too. If you're building voice-enabled agents on a Mac or Linux box, you probably don't need to pay for TTS at all.
Next
Add ElevenLabs to the benchmark once I have a key. Expand the prompt set to 25+ with more edge cases — mixed languages, long-form paragraphs, code readouts. Test Kokoro. Run the suite on the Mac Mini's GPU for latency numbers.
The accuracy question might already be answered. The latency question hasn't been asked yet.