Embedding Quantization Trade-offs: When Shrinking Vectors Kills Recall
You've built a vector search pipeline that works beautifully in staging. Then someone asks why the index is consuming 40 GB of RAM β and suggests you "just quantize it." You do. Recall drops from 95% to 71% and nobody notices until users start complaining that search returns irrelevant results.
Quantization is genuinely useful, but the trade-off between size and accuracy is rarely linear. A bad choice at this stage can quietly poison your retrieval quality in ways that are hard to debug after the fact.
- What quantization actually does to your embedding vectors
- The three main quantization schemes and where each one breaks down
- How to measure recall degradation before you ship
- Practical strategies for recovering accuracy when you need both speed and size
- When to skip quantization entirely
What Quantization Does to an Embedding
An embedding is a list of floating-point numbers β typically 768, 1536, or 3072 of them, stored as 32-bit floats. Each dimension captures some aspect of meaning learned during training. At full precision, a single 1536-dimensional vector takes about 6 KB. Multiply that by ten million documents and you're looking at roughly 60 GB just for the raw vectors.
Quantization reduces precision by mapping those 32-bit floats to a smaller representation: 16-bit floats, 8-bit integers, or even single bits. The idea is that nearby numbers in high-dimensional space are still nearby after rounding β which is true, up to a point.
The problem is that embedding distances are computed from tiny differences across hundreds of dimensions. When you round each dimension, the rounding errors accumulate. Two vectors that were genuinely closer than a threshold can appear farther apart after quantization, causing your index to miss them entirely.
The Three Quantization Schemes You'll Actually Encounter
FP16 β Half-Precision Float
FP16 stores each dimension as a 16-bit float instead of 32-bit. You get a 2x memory reduction with minimal accuracy loss for most embedding models. The dynamic range of FP16 is narrower than FP32, but modern embedding models are trained to produce values in ranges that fit comfortably. This is almost always a safe first step.
Most vector databases (Qdrant, Weaviate, Milvus) support FP16 natively. You often don't even need to change your ingestion code β just flip a flag at collection creation time.
INT8 β 8-bit Integer Quantization
INT8 represents each dimension as an 8-bit integer, giving you a 4x size reduction versus FP32. To do this, the quantizer needs to map the continuous float range of each dimension to 256 discrete buckets. That mapping introduces real error, and the quality depends heavily on how the mapping is calibrated.
There are two calibration approaches. Scalar quantization uses a global min/max across the entire vector space to define the bucket boundaries. Product quantization (PQ) splits the vector into sub-vectors and quantizes each independently, which gives better coverage of the actual data distribution.
For most general-purpose text embeddings, INT8 with scalar quantization gives acceptable recall β expect to lose somewhere between 2 and 8 percentage points on standard benchmarks. But the loss is highly model- and dataset-dependent, so always measure on your own data.
Binary Quantization
Binary quantization takes each dimension and collapses it to a single bit: positive values become 1, negative values become 0. A 1536-dimensional vector becomes 192 bytes instead of 6 KB β a 32x compression. Hamming distance replaces cosine similarity, and modern CPUs can compute it with POPCNT instructions in a handful of nanoseconds.
The catch is significant. Binary quantization works well only for embedding models that are explicitly trained or fine-tuned to be binary-friendly β models where dimensions cluster reliably above or below zero. For general-purpose embeddings, binary quantization can crater recall by 20β40 points. OpenAI has published guidance indicating that some of their newer models tolerate binary quantization better than older ones, but you should verify this empirically rather than taking it on faith.
How Recall Degradation Actually Happens
It helps to understand the failure mode mechanically. Vector search works by finding approximate nearest neighbors (ANN). Your index builds a graph or tree structure over the quantized vectors, then traverses that structure at query time.
When quantization shifts a vector's position in space, two things go wrong. First, the index may route queries down the wrong branches of its search graph, never reaching the true nearest neighbors. Second, even when the right candidates are in the shortlist, their distances may be mis-ranked, so the wrong one floats to the top.
The result is a drop in recall@k: the fraction of true nearest neighbors that appear in your top-k results. A drop from 0.95 to 0.80 means one in five genuinely relevant results is being silently discarded before it ever reaches your re-ranker or application logic.
Measuring Recall Before You Commit
Never deploy a quantized index without measuring recall on a representative sample of your data. Here's a minimal evaluation loop you can run locally:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
def recall_at_k(full_vectors, quantized_vectors, queries, k=10):
"""
Compare top-k results from full-precision vs quantized vectors.
Returns the mean recall across all queries.
"""
recalls = []
for q in queries:
# Ground truth: top-k from full precision
full_scores = cosine_similarity([q], full_vectors)[0]
true_top_k = set(np.argsort(full_scores)[::-1][:k])
# Quantized result
q_quant = quantize_vector(q) # your quantization function
quant_scores = cosine_similarity([q_quant], quantized_vectors)[0]
quant_top_k = set(np.argsort(quant_scores)[::-1][:k])
recall = len(true_top_k & quant_top_k) / k
recalls.append(recall)
return np.mean(recalls)
Run this against a sample of at least a few hundred queries drawn from real production traffic, not synthetic test data. Synthetic queries often don't reflect the long-tail distribution that causes the most recall failures.
A recall drop of under 3β4 points is usually acceptable for most retrieval-augmented generation (RAG) pipelines, especially when a re-ranker is downstream. A drop of more than 10 points is a signal to reconsider your approach.
Strategies for Recovering Recall
Rescoring with Full-Precision Vectors
The most effective pattern is to use quantized vectors only for the initial ANN pass, then re-score the candidate shortlist using full-precision vectors stored separately. This is sometimes called two-stage retrieval or quantized ANN + exact rescoring.
You retrieve, say, the top 100 candidates cheaply using the quantized index, then compute exact distances against the full-precision vectors for only those 100 candidates before returning the top 10 to the user. The full-precision pass is cheap because it operates on a tiny set, and the quantized pass is cheap because it scans compressed vectors. You get most of the memory benefit with most of the accuracy.
# Pseudocode β adapt to your vector DB client
def two_stage_search(query_vector, quantized_index, full_vectors, k=10, overretrieve=100):
# Stage 1: fast ANN over quantized index
candidates = quantized_index.search(query_vector, top_k=overretrieve)
# Stage 2: exact rescore over full-precision candidates
candidate_ids = [c.id for c in candidates]
candidate_vecs = full_vectors[candidate_ids]
scores = cosine_similarity([query_vector], candidate_vecs)[0]
ranked = sorted(zip(candidate_ids, scores), key=lambda x: -x[1])
return ranked[:k]
Increasing the ef/nprobe Search Parameter
Most ANN indexes expose a parameter that controls how broadly the search explores the graph at query time. In HNSW-based indexes this is typically ef_search; in IVF-based indexes it's nprobe. Increasing this parameter makes the search slower but recovers recall, because it explores more candidate paths before deciding on the shortlist.
After quantizing, try doubling your current ef_search value and re-measure recall. You'll pay a latency penalty, but it's often smaller than the latency saved by the reduced memory footprint (fewer cache misses, faster distance computations).
Choosing a Quantization-Friendly Embedding Model
Some newer embedding models are trained with quantization awareness built in. The model's training objective explicitly encourages dimensions to be robust to integer rounding. If you're choosing an embedding model and know you'll need to quantize, look for published quantization benchmarks from the model provider. The Massive Text Embedding Benchmark (MTEB) leaderboard sometimes includes quantized variants β check it before you commit to a model.
Common Pitfalls
Calibrating on the wrong data. INT8 quantization requires a calibration pass to set the min/max range. If you calibrate on a small or non-representative sample, the bucket boundaries will be skewed and dimensions that fall outside the calibration range get clipped to the extreme bucket. Always calibrate on a large, diverse sample of your actual corpus.
Quantizing your query vectors differently from your index vectors. This sounds obvious but it happens. If your index uses INT8 scalar quantization with a specific calibration set, your query vector at search time needs to go through the same transformation with the same calibration parameters. Mismatched quantization is a silent accuracy killer β the distances will be computed but they'll be meaningless.
Assuming the same quantization scheme works across embedding models. A scheme that works fine for one model's embedding space can perform poorly on another. Embedding dimensions are not standardized in scale, variance, or distribution. Always evaluate per model, not per technique.
Forgetting that quantization interacts with your index type. HNSW and IVF indexes behave differently under quantization. An HNSW graph built over quantized vectors may have degraded graph connectivity compared to one built over full-precision vectors. Some databases let you build the graph in full precision and then quantize the stored vectors, which gives better results than building directly on quantized data.
When to Skip Quantization Entirely
Quantization is a tool for a specific problem: you have too many vectors to fit in memory at acceptable cost, and your latency budget allows for a small recall trade-off. If neither of those constraints applies, don't bother.
If your corpus is under a few million documents and your server has enough RAM, FP32 with a well-tuned HNSW index will give you better recall at low latency without the complexity of managing quantization parameters, calibration sets, and two-stage pipelines.
Similarly, if you're in a domain where missing a relevant result carries a real cost β legal document retrieval, medical literature search, compliance workflows β the recall drop from aggressive quantization may not be acceptable regardless of the memory savings. Know your recall floor before you start compressing.
Wrapping Up
Quantization is not a free lunch. The memory savings are real, but so is the accuracy hit if you choose the wrong scheme or skip measurement. Here are concrete next steps:
- Benchmark recall on your actual data before any production rollout. Use the recall@k pattern above with real query samples from production logs.
- Start with FP16. It's the least risky reduction and often enough to solve the memory problem without touching recall in a meaningful way.
- If you need INT8, implement two-stage retrieval (quantized ANN + full-precision rescore) as the default pattern, not an afterthought.
- Calibrate your quantization parameters on a large, representative slice of your corpus β not your dev sample.
- If you're picking a new embedding model, check whether the provider publishes quantization benchmarks and factor that into your selection.
π€ Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!