Hybrid Search Pitfalls: Why Combining BM25 and Vectors Can Hurt Recall

Hybrid search has become one of the most popular retrieval strategies in modern AI systems.

Instead of choosing between:

Keyword Search

and

Vector Search

organizations increasingly combine both approaches.

The logic seems obvious:

BM25 Strengths
+
Vector Strengths
=
Better Search

This philosophy powers:

Enterprise search engines
AI assistants
RAG systems
Knowledge bases
Ecommerce search
Recommendation platforms

In theory, hybrid retrieval should increase both precision and recall.

However, many teams discover an unexpected problem:

Hybrid search can actually reduce recall.

Results that were previously retrievable disappear.

Relevant documents become harder to find.

Search quality declines.

The system becomes more complicated while producing worse outcomes.

This surprises many engineers because hybrid retrieval is often presented as an automatic improvement over standalone BM25 or vector search.

In reality, implementation details matter enormously.

In this guide, you'll learn why hybrid search sometimes hurts recall, the most common failure modes, and how production retrieval systems avoid these pitfalls.

What You Will Learn From This Article

After reading this guide, you'll understand:

How BM25 works.
How vector search works.
Why hybrid retrieval exists.
Common recall failures.
Fusion strategy mistakes.
Candidate selection problems.
Best practices for production systems.

Understanding BM25

BM25 is one of the most widely used keyword-ranking algorithms.

It scores documents based on:

Query terms
Term frequency
Document frequency
Document length

Example query:

Python asyncio deadlock

BM25 prioritizes documents containing those exact terms.

Benefits include:

Fast retrieval
Strong lexical matching
High precision for exact keywords

Understanding Vector Search

Vector search uses embeddings.

Example:

"Fix asyncio deadlock"

and

"Resolve async task freezing"

may generate similar vectors.

Benefits:

Semantic understanding
Synonym handling
Concept matching
Better natural language retrieval

This often improves search quality.

Why Hybrid Search Exists

Each retrieval method has weaknesses.

BM25 struggles with:

Synonyms
Paraphrasing
Semantic intent

Vector search struggles with:

Exact identifiers
Error codes
Product names
Rare keywords

Hybrid search attempts to combine both.

Architecture:

Query
↓
BM25 Search
↓
Vector Search
↓
Fusion
↓
Results

This sounds ideal.

Yet problems frequently emerge.

What Is Recall?

Recall measures:

Relevant Results Found
÷
Total Relevant Results

High recall means:

Most Relevant Documents Retrieved

Low recall means:

Important Documents Missed

Many hybrid systems unintentionally reduce recall.

Pitfall #1

Candidate Pool Truncation

A common architecture:

Top 20 BM25
+
Top 20 Vectors

Developers assume:

20 + 20 = Better Coverage

Not necessarily.

Suppose:

Relevant Document

ranks:

BM25 Rank 21

and:

Vector Rank 25

The document never enters the fusion stage.

Recall drops immediately.

Pitfall #2

Aggressive Score Thresholds

Example:

Vector Score > 0.80

or:

BM25 Score > 5

This may remove valuable results.

Many relevant documents live near threshold boundaries.

Hard cutoffs often reduce recall dramatically.

Pitfall #3

Score Normalization Problems

BM25 scores:

0–30+

Vector similarity:

0–1

These scales are fundamentally different.

Naive fusion:

BM25 + Cosine Similarity

often produces poor ranking behavior.

Documents can disappear simply because score distributions are incompatible.

Pitfall #4

Overweighting BM25

Example:

BM25 Weight = 90%

Vector Weight = 10%

Result:

Hybrid Search
≈
Keyword Search

Semantic matches become nearly irrelevant.

Many important documents are never surfaced.

Pitfall #5

Overweighting Vectors

The opposite problem:

BM25 Weight = 10%

Vector Weight = 90%

Now:

Product IDs
Error messages
Technical identifiers

become difficult to retrieve.

Recall drops for lexical searches.

Pitfall #6

Intersection-Based Retrieval

Some systems require:

Document Appears In
BM25 Results
AND
Vector Results

This seems reasonable.

In practice:

Recall Collapses

Many relevant documents appear in only one retrieval system.

Intersection filtering often removes them.

Pitfall #7

Small Candidate Sets

Example:

Top 10 BM25

Top 10 Vectors

for a corpus containing:

10 Million Documents

Candidate diversity becomes insufficient.

Many relevant results never reach ranking stages.

Pitfall #8

Poor Chunking Strategy

In RAG systems:

Document
↓
Chunking
↓
Embedding

Poor chunk design harms both:

BM25
Vector search

Example:

Huge Chunk

or:

Tiny Chunk

Both can reduce recall.

Pitfall #9

Embedding Quality Problems

Not all embeddings are retrieval-focused.

Weak embeddings produce:

Poor Candidate Retrieval

Hybrid search cannot recover documents that never enter the candidate pool.

Garbage in.

Garbage out.

Pitfall #10

Query Intent Mismatch

Consider:

ERR_CONNECTION_RESET

BM25 excels.

Vector search may struggle.

Now consider:

Why does my website randomly disconnect?

Vector search excels.

BM25 struggles.

Different queries require different retrieval strengths.

Static fusion weights often hurt recall.

Why RAG Systems Are Especially Vulnerable

RAG pipelines typically follow:

User Query
↓
Retriever
↓
Top Documents
↓
LLM

If retrieval misses relevant content:

LLM Never Sees It

No amount of prompt engineering can fix missing context.

Recall becomes critical.

Better Fusion Strategies

Instead of:

Score Averaging

consider:

Reciprocal Rank Fusion (RRF)

Example:

Rank-Based Fusion

Benefits:

Robust ranking
Better recall
Less score normalization complexity

RRF is increasingly popular in production systems.

Use Larger Candidate Pools

Instead of:

Top 10

consider:

Top 100

or:

Top 500

before fusion.

Benefits:

Improved coverage
Better recall
More diverse results

Add Re-Ranking

A common architecture:

BM25
+
Vectors
↓
Large Candidate Set
↓
Cross Encoder
↓
Final Ranking

This often produces the best balance of recall and precision.

Measure Recall Properly

Many teams optimize:

Click Through Rate

or:

Similarity Scores

Instead, evaluate:

Recall@K
MRR
NDCG
Precision@K

These metrics reveal hidden retrieval problems.

Best Practices Checklist

When implementing hybrid retrieval:

✅ Use large candidate pools

✅ Avoid hard thresholds

✅ Normalize scores carefully

✅ Consider Reciprocal Rank Fusion

✅ Evaluate Recall@K

✅ Test real user queries

✅ Tune fusion weights

✅ Use retrieval-focused embeddings

✅ Re-rank final candidates

✅ Review failure cases regularly

Common Mistakes to Avoid

Avoid:

❌ Small candidate sets

❌ Intersection-only retrieval

❌ Aggressive filtering

❌ Blind score averaging

❌ Static weighting assumptions

❌ Ignoring recall metrics

❌ Evaluating only synthetic queries

Real-World Example

A support knowledge base uses:

Top 20 BM25
+
Top 20 Vector Results

with:

Intersection Filtering

Recall:

68%

After switching to:

Union
+
RRF
+
Top 200 Candidates

Recall improves to:

91%

The retrieval methods remain the same.

Only the fusion strategy changes.

Wrapping Summary

Hybrid search is often presented as the best of both worlds, combining the lexical precision of BM25 with the semantic understanding of vector retrieval. While this approach can significantly improve search quality, poorly implemented hybrid systems frequently reduce recall by eliminating relevant documents before ranking even begins.

The most common causes include small candidate pools, score normalization mistakes, aggressive thresholds, intersection-based filtering, poor chunking, weak embeddings, and improperly balanced fusion weights. These issues are especially damaging in RAG systems, where missing documents cannot be recovered later in the pipeline.

Successful hybrid retrieval systems focus on maximizing candidate coverage, using robust fusion strategies such as Reciprocal Rank Fusion, employing large retrieval pools, and measuring recall directly rather than relying solely on ranking scores. By treating retrieval as a recall-first problem, teams can avoid the pitfalls that cause hybrid search systems to perform worse than the individual methods they were designed to improve.

Hybrid Search Pitfalls: Why Combining BM25 and Vectors Can Hurt Recall

Candidate Pool Truncation

Aggressive Score Thresholds

Score Normalization Problems

Overweighting BM25

Overweighting Vectors

Intersection-Based Retrieval

Small Candidate Sets

Poor Chunking Strategy

Embedding Quality Problems

Query Intent Mismatch

Reciprocal Rank Fusion (RRF)

Related Articles

Context Window Bloat: When Adding More History Hurts LLM Accuracy

Why Your Calibrated Model Becomes Miscalibrated After Retraining

Codeium vs GitHub Copilot: Which AI Autocomplete Fits Your Stack?

Comments (0)

Leave a Comment

Hybrid Search Pitfalls: Why Combining BM25 and Vectors Can Hurt Recall

Candidate Pool Truncation

Aggressive Score Thresholds

Score Normalization Problems

Overweighting BM25

Overweighting Vectors

Intersection-Based Retrieval

Small Candidate Sets

Poor Chunking Strategy

Embedding Quality Problems

Query Intent Mismatch

Reciprocal Rank Fusion (RRF)

Related Articles

Context Window Bloat: When Adding More History Hurts LLM Accuracy

Why Your Calibrated Model Becomes Miscalibrated After Retraining

Codeium vs GitHub Copilot: Which AI Autocomplete Fits Your Stack?

Comments (0)

Leave a Comment

Stay ahead of the curve