← Blog

Replacing OpenAI Embeddings with Ollama — $0 vs $5/hour

Nissan Dookeran5 min readollamaembeddingsqdrantcost-engineering

Replacing OpenAI Embeddings with Ollama — $0 vs $5/hour

We were paying for a background process and didn't notice for weeks.

When we finally checked the OpenAI billing dashboard, the number wasn't catastrophic, but it was embarrassing. A conversation indexer running in the background, polling every 5 seconds, embedding every chat session into a Qdrant vector store. The indexer ran during active sessions, not 24/7. Annualised cost: somewhere between $720 and $900 a year. For a background job. For embeddings.

We fixed it in an afternoon. Here's how.


The problem

Our setup was straightforward: every 5 seconds, the indexer would check for new or updated conversations, embed them using OpenAI's text-embedding-ada-002, and upsert them into Qdrant. The polling rate was a lazy default we never revisited. The embedding model was the obvious choice when we first built it.

The issue isn't that OpenAI embeddings are bad. They're excellent. The issue is that we were using a remote paid API for a background task that had no latency requirements, no uptime SLA, and didn't need the absolute frontier of embedding quality. We were paying cloud prices for work a local model could do just as well.


The migration plan

The target was nomic-embed-text via Ollama. It's a solid open-source embedding model: 768 dimensions, trained on a large diverse corpus, consistently benchmarks well against ada-002 on standard retrieval tasks.

The migration plan:

  1. Pull nomic-embed-text via Ollama locally
  2. Swap the embedding call in the indexer
  3. Re-embed all existing Qdrant points (you can't mix embedding spaces)
  4. Reduce the polling interval while we're in there

Step 4 was a freebie. There was no reason to poll every 5 seconds. We bumped it to 300 seconds. That alone dropped from 720 API calls/hour to 12.


The #1 gotcha: the API response format

This will get you if you're not careful.

OpenAI's embedding API returns:

# OpenAI
response = openai_client.embeddings.create(
    input=text,
    model="text-embedding-ada-002"
)
embedding = response.data[0].embedding
```text

Ollama's embedding API returns:

```python
# Ollama (via requests or httpx)
response = requests.post(
    "http://localhost:11434/api/embed",
    json={"model": "nomic-embed-text", "input": text}
)
embedding = response.json()["embeddings"][0]
```text

Note the difference:
- OpenAI: `response["data"][0]["embedding"]`
- Ollama: `response["embeddings"][0]`

If you copy-paste your OpenAI code and just swap the URL, you'll get `None` for every embedding. If you're not validating before upsert, you'll silently corrupt your vector store with null vectors. They'll look like valid points. Queries will return garbage.

The fix is one line. The debugging is not.

A safe wrapper that handles both:

```python
def get_embedding(text: str, provider: str = "ollama") -> list[float]:
    if provider == "openai":
        response = openai_client.embeddings.create(
            input=text, model="text-embedding-ada-002"
        )
        return response.data[0].embedding
    elif provider == "ollama":
        response = requests.post(
            "http://localhost:11434/api/embed",
            json={"model": "nomic-embed-text", "input": text}
        )
        result = response.json()
        embedding = result.get("embeddings", [[]])[0]
        if not embedding:
            raise ValueError(f"Empty embedding returned: {result}")
        return embedding
    else:
        raise ValueError(f"Unknown provider: {provider}")
```text

Always validate. Always raise on empty.

---

## The results

Numbers from the migration, measured on a local Mac Mini:

| Metric | Before | After |
|--------|--------|-------|
| Model | `text-embedding-ada-002` | `nomic-embed-text` |
| Dimensions | 1536 | 768 |
| Latency (avg) | ~80ms (network) | 79.5ms (local) |
| Polling interval | 5s (720 calls/hr) | 300s (12 calls/hr) |
| Annual cost | $720–900 | $0 |

**Migration time:** 1,761 points re-embedded and upserted in 1 minute 25 seconds.

**Semantic quality:** We ran our eval harness against both models on a held-out set of conversation queries. `nomic-embed-text` retained 100% of the quality we measured, same top-k results, same relevance scores. For this use case (conversational search within a personal assistant), the models are functionally equivalent.

The latency difference is noise. Remote API calls add round-trip overhead; local inference eats CPU. On a Mac Mini M2, 79.5ms per embedding is entirely acceptable for a background indexer.

---

## Should you do this?

**Yes, if:**
- You're embedding for a background process with no user-facing latency requirements
- Your data stays on-device or on your own infrastructure (privacy win too)
- You're self-hosting other infrastructure already (Qdrant, etc.)
- You can run Ollama locally or on a VPS
- Your embedding use case is standard retrieval / semantic search

**No, if:**
- You need multilingual embedding quality that frontier models provide
- You're on a device without enough RAM/CPU for inference (Ollama needs ~500MB–1GB for nomic-embed-text)
- You're already on a hosted provider that bundles embeddings cheaply (e.g., Supabase pgvector with their embedding endpoint)
- You genuinely need 1536-dim embeddings for compatibility with an existing system you can't re-embed

One note: switching embedding models requires re-embedding everything in your vector store. You can't mix `ada-002` vectors with `nomic-embed-text` vectors in the same collection. Plan for a one-time migration window.

---

## The takeaway

Local embeddings are production-ready for most use cases. The quality is there. The latency is fine. The cost is zero.

`nomic-embed-text` via Ollama is a drop-in replacement for `ada-002` in the vast majority of retrieval applications. The only real friction is the API response format difference, which is a one-line fix once you know about it.

Save the OpenAI API for things that actually need it. Frontier reasoning, multi-modal tasks, production applications where you genuinely need managed uptime. Not background indexers. Not internal tooling. Not anything that can tolerate a 5-minute polling interval.

We're now paying $0/month for conversation indexing. The quality is identical. The only thing we lost was a billing line item we should have cut months earlier.

---

*The full migration code and Qdrant upsert wrapper are in the follow-up post. Next up: how we structured the Qdrant collection for hybrid search across conversation history.*