Embedding Indexing Cost Is Where Your Money Actually Goes
In a hybrid pipeline at scale, embedding computation and the vector index dominate cost. Stale embeddings are a second, quieter bill your users pay in quality.
In one Milvus 2.0 benchmark, roughly 97% of total indexing time was spent on embedding generation, with only 3% on database insertion (Milvus, 2022). That ratio is the heart of the cost problem in a hybrid pipeline: the lexical side is three decades into its life, compact and well-understood, while the vector side pushes every document through a neural network and then fights to keep the resulting graph healthy. Once a corpus is large enough to matter, the cost structure looks less like a software system and more like a compute bill with a search engine attached. The chapters on operations start here.
Where the money actually goes
Indexing a hybrid system requires maintaining both an inverted index and a vector index over the same corpus, and the two structures want opposite things. Inverted indexes prefer frequent small writes and amortize I/O through LSM-style segment merges (O'Neil et al., 1996). Vector indexes, especially HNSW, prefer infrequent large batches, because incremental updates create unreachable points in the graph and degrade recall. A hybrid platform has to reconcile those cadences without letting either side starve.
Bulk loading is roughly two orders of magnitude faster than single-document ingestion on most vector databases. Weaviate reports a 100x or larger speedup from batching (Weaviate, 2025). Pinecone caps batch upserts at 1,000 records or 2 MB per request and reports a 5x speedup from asynchronous batching (Pinecone, 2025). Qdrant recommends tiered strategies by collection size (Qdrant, 2025). A team ingesting one document at a time through any of these systems is paying a two-order-of-magnitude tax.
Embedding computation is not a one-time cost
It is tempting to treat embedding as a setup cost and forget about it. Corpus churn settles the matter. The GDELT Project documented that embedding three billion images costs 180,000 to 300,000 USD one-time via commercial APIs, with daily incremental ingestion of roughly 972K images running 58 to 97 USD per day (GDELT, 2024). On the same corpus, real-time streaming updates under one Google Vertex AI pricing model reach 9.3 million USD per month, because streaming triggers full index rebuilds (GDELT, 2024). That is the order-of-magnitude gap between batch and streaming at scale on identical data.
Commercial API pricing helps but does not change the shape of the problem. OpenAI's text-embedding-3-small is priced at 0.02 USD per million tokens, and the Batch API adds a 50% discount (OpenAI, 2025). Those unit costs are attractive until they are multiplied by a re-embedding event. Switching models means a full reindex that scales linearly with corpus size and can take days or weeks at scale. This is why the embedding model selection decision is so hard to reverse: a mid-course change is not a sprint, it is a project.
The silent cost: stale embeddings
The second bill is paid in quality rather than dollars. Without correction mechanisms, stale cached embeddings lose 4 or more points in recall across retrieval tasks and 10 or more points in retrieval-augmented language model tasks compared to methods that maintain fresh embeddings (Yadav, 2024). When the model stays pinned and the corpus evolves, old embeddings still reflect the state of documents when they were indexed, and the relevance drop is slow enough that no alarm fires.
Practitioners report concrete thresholds for catching this. In stable systems, cosine distance between successive embedding batches stays between 0.0001 and 0.005, and 85 to 95% of nearest neighbors persist week-to-week. In drifting systems, cosine distances reach 0.05 or higher, and 25 to 40% of nearest neighbors change silently (Dev Community, 2025). Those numbers are practitioner-reported rather than controlled benchmarks, but they are specific enough to hang an alert on.
Which lever to pull first
At least five levers reduce the embedding bill: content-hash caching, incremental re-embedding, distillation into smaller student models, Matryoshka dimensionality reduction, and inference-level quantization. They do not compose cleanly. Caching helps most when corpus churn is low. Distillation helps most when query volume is high and latency budgets are tight. Matryoshka helps when storage and ANN search compute dominate the steady-state bill rather than the ingestion spike. Quantization interacts with index structure, so its payoff depends on whether the inverted-index side is also under pressure.
The right order of adoption depends on two numbers: corpus churn rate and model-rotation cadence. A corpus that churns 5% per week with a model rotation every two years has a very different optimal sequence than one that churns 40% per week with quarterly model updates. The two-speed tension between the inverted index (near-real-time segment merges) and HNSW (large, rare batches) also has to be reconciled in that plan, and the reconciliation shape is not obvious from the component behaviors alone. Picking the first lever is the decision that determines whether the rest of the stack compounds savings or fights itself.
Related chapter
Chapter 14: Indexing at Scale
Running both a lexical and a vector index on the same corpus is strictly harder than running either one alone: inverted indexes prefer a steady stream of small writes, while ANN structures prefer infrequent large rebuilds, and a hybrid system must satisfy both at the same time. The chapter covers the batch-versus-incremental trade-off, refresh strategies that keep the system queryable during updates, containing the cost of embedding recomputation, evolving schemas as models and fields change, and isolating tenants on shared infrastructure.
Get notified when the book launches.
Laszlo Csontos
Author of Designing Hybrid Search Systems. Works on search and retrieval systems, and writes about the engineering trade-offs involved in combining keyword and vector search.
Related Posts
The three HNSW knobs (M, efConstruction, efSearch) move your recall-latency curve more than most teams realize. Pick defaults deliberately, not because the library shipped them.
January 12, 2026
A search system's quality can degrade without the latency, error rate, or any other standard dashboard noticing. Embedding drift monitoring is one of the pieces that standard ML observability tends to miss.
January 5, 2026
Stacked compression can cut vector index RAM by up to 192x, but the quality losses are non-additive. A validation workflow is the only way to find the Pareto point.
December 29, 2025