Embedding Similarity Scores That Lie: Why High Cosine Scores Miss Intent
You've built a retrieval pipeline. The embedding model is solid, the vector store is indexed, and similarity scores are coming back at 0.92 or higher. Then someone tests it and asks "How do I cancel my subscription?" β and your system confidently returns a passage about contract renewal cycles. The score was high. The result was wrong.
This is one of the most common and quietly damaging failure modes in production retrieval systems. Understanding why it happens is the first step toward fixing it.
What You'll Learn
- Why cosine similarity measures vector geometry, not semantic intent
- How embedding models encode meaning in ways that confuse similar vocabulary with similar purpose
- The specific retrieval failure modes you should test for
- Practical techniques for catching and correcting intent mismatch
- How to evaluate retrieval quality beyond raw similarity scores
Prerequisites
This article assumes you have some familiarity with vector embeddings and have worked with a retrieval pipeline β even a basic one using sentence-transformers, OpenAI embeddings, or a similar library. You don't need to be a machine learning researcher to follow along.
What Cosine Similarity Actually Measures
Cosine similarity measures the angle between two vectors in high-dimensional space. A score of 1.0 means they point in exactly the same direction; 0.0 means they're perpendicular. It does not measure meaning. It measures geometric proximity in a space that was trained to approximate meaning β which is a critical distinction.
Embedding models are trained to place semantically related text close together. But "related" is a broad concept. During training, a model sees patterns like "cancel subscription" and "subscription renewal" appearing in similar contexts β FAQs, support docs, billing pages. The model learns they are topically related, so their vectors end up nearby. The model doesn't know that one is a request to end a relationship and the other is a pitch to extend it.
Cosine similarity tells you that two pieces of text live in the same neighborhood. It doesn't tell you if they're going to the same destination.
The Intent vs. Topic Problem
Most embedding models are trained primarily on topic similarity, not intent similarity. This means they're very good at answering "is this text about the same subject?" and significantly weaker at answering "does this text serve the same purpose?"
Consider these two sentences:
- "How do I cancel my account?"
- "Here's why customers should renew their accounts before the deadline."
Both are about accounts. Both involve the word or concept of account continuity. An embedding model trained on general text will cluster them because they share domain vocabulary. But one is a user in distress, possibly frustrated, looking for an exit. The other is a retention pitch. Serving the second in response to the first is not just unhelpful β it can actively damage trust.
This is the intent gap: the difference between what a piece of text is about and what it is for.
Common Failure Modes in Production Retrieval
Polarity Confusion
Affirmations and negations often land close together in embedding space. "This feature is stable" and "This feature is not stable" share almost all their vocabulary. The embedding for both sentences will be geometrically close, even though they carry opposite information. In a RAG pipeline, this means a user asking "is X safe to use?" might get a retrieval hit from a document saying "X is not safe to use" β with a high similarity score β and then your summarization step might smooth over the negation.
Question-Document Asymmetry
Queries are short. Documents are long. Embedding a 10-word question and a 500-word document into the same space creates an inherent mismatch. The document vector is a weighted average over a large body of text; the query vector is sharp and specific. Even if the document contains the answer, the vector for the whole document may be dominated by surrounding context, making the similarity score misleadingly low or creating false positives from documents that mention the topic without answering the question.
Jargon and Domain Drift
General-purpose embedding models are trained on general text. If your corpus uses domain-specific terminology β medical codes, legal phrases, internal product names β the model may map those terms onto their closest general-language neighbors. A query using precise domain language may miss highly relevant documents that use the same concepts with different surface phrasing, or match irrelevant documents that share vocabulary by coincidence.
Surface Form Matching
Sometimes high cosine scores come from shared stopwords, shared named entities, or shared syntactic patterns rather than shared meaning. Two paragraphs can both mention "Python", "2024", and "API" while being about completely different topics. The model may score them as highly similar because those tokens dominated the embedding.
A Concrete Example You Can Test
Here's a minimal test you can run with sentence-transformers to see this in action:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("all-MiniLM-L6-v2")
query = "How do I cancel my subscription?"
candidates = [
"To cancel your subscription, go to Account Settings and click Cancel Plan.",
"Renewing your subscription before it expires saves you 20% on your annual plan.",
"Our subscription service gives you access to all premium features.",
"Contact support if you need help managing your billing preferences.",
]
query_vec = model.encode(query, convert_to_tensor=True)
candidate_vecs = model.encode(candidates, convert_to_tensor=True)
scores = util.cos_sim(query_vec, candidate_vecs)[0]
for candidate, score in zip(candidates, scores):
print(f"{score:.4f} | {candidate[:60]}")
Run this and look at which candidates score highest. You'll often find that the renewal pitch and the generic subscription description score similarly to β or higher than β the actual cancellation instructions. The model is matching topic, not intent.
Techniques for Catching Intent Mismatch
Use a Reranker
Cross-encoder rerankers take a (query, document) pair and score them jointly, rather than encoding them independently. Because the model sees both texts at once, it can reason about whether the document actually answers the question β not just whether it's topically related. Models like cross-encoder/ms-marco-MiniLM-L-6-v2 are designed for this. The pattern is: retrieve the top 20β50 candidates by cosine similarity, then rerank with a cross-encoder and take the top 5.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [(query, candidate) for candidate in candidates]
rerank_scores = reranker.predict(pairs)
ranked = sorted(zip(rerank_scores, candidates), reverse=True)
for score, text in ranked:
print(f"{score:.4f} | {text[:60]}")
Cross-encoders are slower than bi-encoders because they can't pre-compute document embeddings. That's why you use them as a second-stage filter, not as your primary retrieval mechanism.
Embed Hypothetical Answers
Instead of embedding the raw query, generate a hypothetical answer to the query and embed that. If the user asks "How do I cancel my subscription?", generate a short synthetic answer like "To cancel your subscription, navigate to your account settings and select the cancellation option." Embed that answer and use it as your query vector. This shifts the search from query-to-document into answer-to-document space, which often produces significantly better intent alignment.
This technique is sometimes called HyDE (Hypothetical Document Embeddings) and is worth testing any time your query-document similarity feels off.
Add Metadata Filtering Before Similarity Search
If your documents are tagged with intent categories β "how-to", "policy", "marketing", "troubleshooting" β filter by the likely intent category before running similarity search. A query that contains "how do I" or "steps to" almost certainly wants procedural content, not conceptual overviews or promotional material. Pre-filtering by category reduces the pool of candidates and dramatically reduces the chance of a high-scoring but intent-mismatched result.
Test With Adversarial Pairs
Build a small evaluation set of adversarial query-document pairs: cases where topic similarity is high but intent is mismatched. Run your retrieval pipeline against this set regularly. If your pipeline ranks the wrong document above the right one on more than a small fraction of adversarial pairs, you have a real problem that will show up in production.
Evaluating Retrieval Beyond Similarity Scores
Cosine similarity is a retrieval signal, not a quality metric. Stop using it as a proxy for retrieval success. Instead, track metrics that reflect whether your users are actually getting what they need.
Mean Reciprocal Rank (MRR) tells you, on average, how far down your result list the first correct answer appears. If your correct document is always in position 3 or 4, your users are getting frustrated even if the scores look decent.
Precision at K (P@K) measures what fraction of your top K results are actually relevant. A high cosine score with P@5 of 0.4 means 3 out of your top 5 results are wrong.
Human relevance labeling is the ground truth. Sample 50β100 real queries from your production logs, have someone label the top-3 retrieved documents as relevant or not, and build a baseline. Then track whether it improves or degrades as you change your pipeline.
Common Pitfalls When Fixing This
Over-relying on a single fix. Adding a reranker helps, but if your embedding model has poor domain coverage, reranking a bad candidate set gives you the best of a bad set. Layer multiple improvements: better chunking, metadata filtering, and reranking together.
Chunking too coarsely. If your document chunks are 2,000 tokens of mixed content, the embedding for each chunk will be a blurry average. Smaller, focused chunks β ideally with a single coherent topic each β give the embedding model cleaner signal to work with.
Ignoring query preprocessing. Spelling errors, ambiguous pronouns, and conversational fillers degrade query embeddings. A quick cleanup step β spelling correction, pronoun resolution for chat contexts β can improve similarity scores for the right reasons.
Tuning thresholds on clean test data. If you set a similarity threshold of 0.75 because it works on your benchmark, check whether that threshold holds on actual production queries. Distribution shift between your test set and real users is one of the most common sources of unexpected degradation.
Wrapping Up
High cosine similarity is a necessary but not sufficient condition for useful retrieval. The geometric proximity of two vectors tells you they came from the same neighborhood in embedding space β it doesn't tell you they serve the same purpose.
Here are concrete next steps you can take right now:
- Run the adversarial test above on your own pipeline with 10β20 intent-mismatched pairs and see where it breaks.
- Add a cross-encoder reranker as a second-stage filter if you haven't already β it's a relatively low-effort, high-impact change.
- Review your chunk sizes and split any chunks that mix multiple distinct topics into separate, focused chunks.
- Build a small labeled evaluation set from real production queries and track MRR or P@5 instead of raw similarity scores.
- Experiment with HyDE on your most common query types and measure whether retrieval precision improves.
Cosine similarity is a useful tool. The mistake is treating it as the answer rather than one signal among several.
π€ Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!