
When a chat model changes, you can notice immediately when the responses read differently or the formatting shifts. Someone on the team says "this doesn't feel right" and you investigate. Embedding models don't give you that opportunity.
An embedding model converts text into a vector, a list of numbers that represents the meaning of that text in a geometric space. Your retrieval system (RAG, semantic search, recommendation engine) works by comparing those vectors. Documents embedded last month sit in the database alongside documents embedded today. The system assumes they all live in the same geometric space.
When the model provider updates the weights behind the same tag, the geometry changes. New vectors land in slightly different positions. The old vectors don't move. They can't. They're already stored.
The result is a slow rot. Retrieval quality degrades because the query vector and the document vectors were computed by different versions of the same model. Cosine similarity scores drift. Results that should rank first start appearing third or fifth. The system still returns results. It still looks functional. Nothing throws an error.
This is what makes embedding drift worse than chat model drift. Chat changes are visible in the output. Embedding changes are invisible in the output. You find out weeks later when someone notices the search results feel less relevant, or you don't find out at all.
How This Happens in Practice
Model registries like Ollama, Hugging Face, and cloud provider endpoints all use mutable tags. When you pull `nomic-embed-text:latest`, you get whatever version is current today. Pull it again next month and the bytes might be different. The tag is the same. The model is not.
This is the same class of problem that container registries solved years ago. Nobody deploys `python:latest` to production anymore because the community learned that mutable tags cause silent environment drift. The same lesson applies to model tags, but most teams haven't internalized it yet for ML artifacts.
The failure mode is particularly dangerous for systems that embed incrementally. If you embed your entire corpus once and never update the model, you're fine because everything lives in the same vector space. But most production systems embed new documents as they arrive and refresh documents that have changed. The moment the underlying model changes, new documents get embedded in a different space than old ones. The database doesn't know. The retrieval layer doesn't know. The degradation is gradual and cumulative.
What Pinning Looks Like
The fix is version pinning. Use explicit version tags, never `:latest`. Record the exact model version in every row that stores an embedding so you can audit which vectors came from which model version. Validate at application startup that the model tag matches what the system expects.
For local inference (Ollama, llama.cpp), this means extracting the model files and storing them under your own control. The Modelfile that ships with a model contains the weights, the chat template, the runtime parameters, and the stop tokens. Grabbing only the weights and losing the rest means losing the behavioral recipe the publisher intended. Pin the whole package.
A Pydantic validator that rejects any model reference without an explicit version tag, or any reference ending in `:latest`, catches this at config load. The application refuses to start with an unpinned model. That's a few lines of code doing the same job as the policy engines (OPA, Kyverno) that enterprise ML platforms use to gate deployments. The discipline scales down.
When to Re-embed
If you do need to upgrade the model (better quality, smaller size, new capabilities), re-embed the entire corpus with the new version. This is a migration, not an update. Treat it like a database schema change: planned, tested, and applied as a discrete operation. Run the old and new versions side by side, compare retrieval quality on known queries, and cut over when you're confident.
The cost of re-embedding is real but bounded. The cost of silent retrieval degradation is unbounded because you don't know it's happening.
By the Numbers
Embedding model updates can shift vector positions by 5-15% in cosine distance, enough to reorder top-k retrieval results without triggering any application error.
Observed in production RAG systems, 2025
Container image pinning adoption exceeded 90% in production Kubernetes clusters after years of silent drift incidents from mutable tags.
CNCF Survey, 2023
Related Services
Have a Question About Your Business?
Book a free 30-minute call and we'll work through it together.
Start a Conversation