Fixing Cosine Similarity That Returns Misleading Matches in High Dimensions

June 11, 2026 9 min read 56 views

You build a semantic search feature, embed your documents, and run cosine similarity. The top result looks plausible, so you ship it. Then a user points out that searching for "machine learning tutorial" returns a cooking blog post. The similarity score was 0.91. The system was completely confident, and completely wrong.

This is one of the quieter failures in applied machine learning. Cosine similarity is sound math, but high-dimensional spaces make its output unreliable in ways that aren't obvious until you're debugging in production.

  • Why cosine similarity degrades as dimensionality grows
  • What "concentration of measure" actually means for your search results
  • Practical fixes: dimensionality reduction, normalization strategies, and better similarity metrics
  • How to detect whether your current setup is already affected
  • When cosine similarity is still the right tool and when to walk away from it

What Cosine Similarity Actually Measures

Cosine similarity measures the angle between two vectors, not the distance between them. Two vectors pointing in the same direction get a score of 1.0, regardless of their magnitude. Two perpendicular vectors score 0. Two pointing opposite directions score -1.

This property is genuinely useful. A short document and a long document that discuss the same topic should score similarly. Magnitude noise from document length cancels out. That's the appeal.

import numpy as np

def cosine_similarity(a, b):
    dot = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    if norm_a == 0 or norm_b == 0:
        return 0.0
    return dot / (norm_a * norm_b)

# Simple example in 3D
a = np.array([1.0, 2.0, 3.0])
b = np.array([2.0, 4.0, 6.0])  # Same direction, different magnitude
print(cosine_similarity(a, b))  # 1.0 β€” perfect match

In low dimensions this works as expected. The trouble starts when you move into hundreds or thousands of dimensions, which is exactly where modern embedding models live.

The Curse of Dimensionality in Plain Terms

The phrase "curse of dimensionality" gets thrown around a lot. Here's the concrete version: as the number of dimensions grows, the difference between the nearest and farthest point from any given query shrinks toward zero.

Think about it geometrically. In 2D, points spread out across a plane. In 3D, across a volume. In 1000D, almost all the volume of a hypersphere sits near its surface. Random vectors in high-dimensional space tend to be nearly orthogonal to each other. Their cosine similarities cluster tightly around zero.

import numpy as np

def simulate_similarity_distribution(dims, n_samples=5000):
    similarities = []
    for _ in range(n_samples):
        a = np.random.randn(dims)
        b = np.random.randn(dims)
        sim = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
        similarities.append(sim)
    arr = np.array(similarities)
    return arr.mean(), arr.std()

for d in [10, 50, 200, 768, 1536]:
    mean, std = simulate_similarity_distribution(d)
    print(f"dims={d:5d}  mean={mean:.4f}  std={std:.4f}")

Run that and you'll see the standard deviation collapse dramatically as dimensions increase. At 768 dimensions (a common embedding size), nearly all random vector pairs score between roughly -0.05 and 0.05. Your actual matches also fall into a narrow band β€” making it nearly impossible to distinguish a strong match from a weak one by score alone.

Why This Produces Misleading Results

The core problem is score compression. When all similarities cluster in a narrow range, a score of 0.91 doesn't actually mean much. It might represent your best match, but it might also be barely better than a random document that happened to land in a similar region of the vector space.

Two failure modes show up most often in practice:

False positives at scale

With a large corpus, the probability that some document lands near your query vector by chance alone grows significantly. With millions of vectors at high dimensionality, you'll get high-scoring matches that share no semantic relationship with the query. The index returns them confidently because the score looks high β€” relative to the compressed range.

Indistinguishable ranks

Your actual top-10 results might have similarities of 0.88, 0.87, 0.87, 0.86... The differences are inside the noise floor. Re-ranking, filtering, or any logic that depends on score thresholds becomes unreliable. A cutoff of 0.85 that worked in testing fails silently in production when the distribution shifts.

Diagnosing the Problem in Your Own System

Before reaching for a fix, confirm you actually have this problem. Run a quick distribution check against your index.

import numpy as np

def diagnose_similarity_distribution(query_vector, corpus_vectors, sample_size=2000):
    """
    Sample pairwise similarities between the query and a subset of the corpus.
    Prints summary statistics to help you see how compressed the distribution is.
    """
    if len(corpus_vectors) > sample_size:
        indices = np.random.choice(len(corpus_vectors), sample_size, replace=False)
        sample = corpus_vectors[indices]
    else:
        sample = corpus_vectors

    q_norm = query_vector / np.linalg.norm(query_vector)
    norms = np.linalg.norm(sample, axis=1, keepdims=True)
    norms = np.where(norms == 0, 1e-9, norms)  # avoid division by zero
    sample_normed = sample / norms

    similarities = sample_normed @ q_norm

    print(f"Dimensions : {query_vector.shape[0]}")
    print(f"Sample size: {len(similarities)}")
    print(f"Min   : {similarities.min():.4f}")
    print(f"Max   : {similarities.max():.4f}")
    print(f"Mean  : {similarities.mean():.4f}")
    print(f"Std   : {similarities.std():.4f}")
    print(f"Range : {similarities.max() - similarities.min():.4f}")
    return similarities

If the range (max minus min) is less than about 0.2 across thousands of documents, you're in trouble. A healthy distribution for a well-separated corpus should show clear gaps between genuinely relevant matches and the bulk of unrelated documents.

Fix 1: Reduce Dimensionality Before Comparing

The most direct fix is to operate in a lower-dimensional space. PCA (Principal Component Analysis) is the classic approach: project your embeddings down to a space where the meaningful variance is preserved but the curse is less severe.

from sklearn.decomposition import PCA
import numpy as np

# Assume `corpus_vectors` is shape (N, 768)
pca = PCA(n_components=128, random_state=42)
corpus_reduced = pca.fit_transform(corpus_vectors)

# At query time, apply the same transform
query_reduced = pca.transform(query_vector.reshape(1, -1))

# Now compute cosine similarity in 128D instead of 768D
def cosine_similarity_matrix(query, corpus):
    query_norm = query / np.linalg.norm(query)
    norms = np.linalg.norm(corpus, axis=1, keepdims=True)
    corpus_norm = corpus / norms
    return corpus_norm @ query_norm.T

scores = cosine_similarity_matrix(query_reduced, corpus_reduced)

The tradeoff: you lose some information during projection. Keep enough components to explain 90–95% of the variance in your corpus β€” check pca.explained_variance_ratio_.cumsum() to find that threshold. Going too low discards signal along with noise.

UMAP is an alternative worth knowing. It preserves local neighborhood structure better than PCA in some cases, but it's non-linear and not invertible, so applying it to new query vectors at runtime requires careful setup. PCA is simpler and usually sufficient as a first step.

Fix 2: Use a Better Metric for High Dimensions

Sometimes the right move is to stop using cosine similarity for ranking and use something better suited to the geometry of your embedding space.

Dot product instead of cosine

If your embedding model was trained with a dot-product objective (many bi-encoder models are), using raw dot product rather than normalized cosine can give better results. The magnitude carries information about confidence that normalization throws away.

L2 distance

Euclidean distance tends to preserve more useful ranking signal than cosine similarity at high dimensions in some settings, especially when vectors are already normalized. If all your vectors are unit-normalized (common after embedding model output), cosine similarity and L2 distance rank identically anyway β€” so the representation matters more than the metric.

Learned metrics

If you have labeled pairs (query, relevant doc, irrelevant doc), training a small metric-learning layer on top of your embeddings can dramatically improve precision. This is a bigger investment, but it directly optimizes for your task rather than a generic geometric property.

Fix 3: Normalize Scores Relative to the Distribution

If you're stuck with cosine similarity and can't change the dimensionality, at minimum normalize your scores against the observed distribution. A raw score of 0.87 means nothing; a score that sits in the 99th percentile of your actual distribution means a lot more.

import numpy as np

def normalized_rank_score(query_vector, corpus_vectors, top_k=10):
    """
    Returns top_k indices with their percentile-normalized scores.
    """
    q_norm = query_vector / np.linalg.norm(query_vector)
    norms = np.linalg.norm(corpus_vectors, axis=1, keepdims=True)
    corpus_norm = corpus_vectors / norms
    raw_scores = corpus_norm @ q_norm

    # Normalize to [0, 1] relative to this query's distribution
    min_s, max_s = raw_scores.min(), raw_scores.max()
    if max_s - min_s < 1e-9:
        normalized = np.zeros_like(raw_scores)
    else:
        normalized = (raw_scores - min_s) / (max_s - min_s)

    top_indices = np.argsort(normalized)[::-1][:top_k]
    return [(idx, float(normalized[idx])) for idx in top_indices]

This won't fix false positives from random vector proximity, but it makes your scores interpretable and makes threshold-based filtering behave consistently across queries.

Fix 4: Approximate Nearest Neighbor Libraries

Libraries like FAISS, Annoy, and HNSWlib don't change the underlying metric math, but they give you fast access to a larger candidate pool and efficient indexing structures that implicitly group semantically similar vectors. Retrieving a larger set of candidates (say, top-200) and then re-ranking with a more expensive cross-encoder model is a common pattern that sidesteps the worst of the high-dimension problem.

The re-ranking step can use a model that looks at query and document together (not independently encoded), which captures interaction features that bi-encoder embeddings miss entirely. This two-stage retrieval plus re-ranking architecture is now the standard approach in production semantic search for exactly this reason.

Common Pitfalls to Avoid

  • Setting absolute thresholds on raw scores. A cutoff like "only show results above 0.8" makes assumptions about the score distribution that break as your corpus or embedding model changes. Use percentile-based thresholds or dynamic calibration instead.
  • Mixing normalized and unnormalized vectors. If some vectors in your index are unit-normalized and others aren't, cosine similarity becomes inconsistent. Pick one convention and enforce it at ingestion time.
  • Assuming the embedding model's output space is isotropic. Many embedding models produce vectors that cluster in certain regions of the space (anisotropy). This makes raw cosine comparisons even less reliable. Some models offer post-hoc whitening transforms to address this β€” check the model's documentation.
  • Skipping evaluation on real queries. Synthetic tests with random vectors reveal the math but not your actual failure modes. Build a small labeled evaluation set from real user queries and measure precision at K, not just similarity scores.
  • Reducing dimensions too aggressively. Dropping from 768 to 16 dimensions to "make it faster" will destroy semantic information. Profile what dimensionality your downstream task actually needs before committing.

When Cosine Similarity Is Still Fine

Not every use case is broken. If you're working in genuinely low-dimensional spaces (under roughly 100 dimensions), cosine similarity behaves well and the distribution concerns don't apply at the same scale. If your corpus is small (a few thousand documents), random false positives are unlikely to surface. If you care only about the top-1 result and that result has a large natural separation from everything else, the score compression won't matter in practice.

The red flag is combining high-dimensional embeddings, a large corpus, and strict score thresholds. That's the combination that makes the failure mode visible to users.

Wrapping Up

Cosine similarity isn't broken β€” it's just being asked to do something geometry makes difficult in high-dimensional spaces. The match scores you're seeing are mathematically correct; they're just not as informative as they look. Here's what to do next:

  1. Run the distribution diagnostic above against your live index. Check whether your score range is compressed and whether your top results are clearly separated from the rest.
  2. If you're seeing compression, apply PCA to reduce embeddings to 128–256 dimensions and measure whether retrieval quality improves on your labeled evaluation set.
  3. Switch from absolute score thresholds to percentile-based thresholds or dynamic calibration tied to each query's score distribution.
  4. If retrieval quality matters enough to justify the infrastructure, implement a two-stage pipeline: ANN retrieval for candidates, then a cross-encoder re-ranker for final ranking.
  5. Add a small labeled evaluation set if you don't have one already. You can't measure improvements without ground truth, and debugging search quality by eye doesn't scale.

πŸ“€ Share this article

Sign in to save

Comments (0)

No comments yet. Be the first!

Leave a Comment

Sign in to comment with your profile.

πŸ“¬ Weekly Newsletter

Stay ahead of the curve

Get the best programming tutorials, data analytics tips, and tool reviews delivered to your inbox every week.

No spam. Unsubscribe anytime.