Back to Blog
Part V · Chapter 17costquantizationoperations

Vector Cost Optimization: Matryoshka and Quantization Without the Hype

Stacked compression can cut vector index RAM by up to 192x, but the quality losses are non-additive. A validation workflow is the only way to find the Pareto point.

December 29, 20255 min read

OpenAI's text-embedding-3-large, shortened to 256 dimensions, outperforms the previous-generation text-embedding-ada-002 running at its full 1536 dimensions. That is a 6x reduction in vector size with a measurable quality gain, achieved by toggling a single API parameter. The same book's chapter outline that walks through production hybrid search closes with cost optimization because results like this are the rule at current embedding quality, not the exception.

Why vectors are the line item

Lexical indexes are compact; decades of inverted-file engineering have compressed their storage and query cost. Vector indexes are not. For one million vectors at 768 dimensions stored as 32-bit floats, the raw data occupies approximately 3 GB. HNSW at a connectivity parameter M of 40 brings total memory consumption to roughly 4.8 GB, a 1.5 to 2 times multiplier over the raw vectors that is typical for graph-based ANN indexes (Zilliz, 2024). Scaled to 100 million vectors, storage alone enters several-hundred-gigabyte territory, and RAM is the most expensive tier of cloud infrastructure.

Two levers bend that curve: dimensionality reduction and quantization.

Matryoshka embeddings

Matryoshka Representation Learning trains embedding models so that the first k dimensions of any embedding form a valid representation at that reduced size (Kusupati et al., 2022). The text-embedding-3-large result noted above is the production signal: a 6x reduction from 1536 to 256 dimensions with no quality regression against the prior generation (OpenAI, 2024). Sentence Transformers documentation reports MRL-trained models preserving 98.37 percent of performance at 8.3 percent of the original embedding size (Sentence Transformers, 2024). Nomic Embed v1.5 retains approximately 90 percent of its full-dimensional MTEB performance at 64 dimensions, a 12x reduction from its native 768 (Nomic, 2024).

For teams stuck with a black-box API model that was not trained with MRL, the Matryoshka-Adaptor approach applies a post-hoc transformation. On BEIR, it delivers 2 to 12 times dimensionality reduction without compromising performance for both Google and OpenAI embeddings (Yoon et al., 2024).

Scalar and product quantization

Quantization reduces the precision of each stored dimension rather than the number of dimensions. Scalar int8 quantization yields 4x memory reduction with recall retention in the 97 to 99.9 percent range for general-purpose encoders, and enables SIMD-accelerated integer arithmetic that speeds up distance computation by 2.7 to 3.4 times (Thoresen, 2026). Binary quantization collapses each float to a single bit for 32x compression, but the quality trade-off is substantial and depends heavily on whether the source model was trained with quantization in mind. Product quantization takes a different route, dividing the vector into sub-vectors and replacing each with the index of its nearest centroid from a learned codebook (Jegou et al., 2011). The approach is well supported by FAISS, Vespa, Qdrant, Milvus, and Weaviate.

A critical dependency: the recall retention figures above vary substantially depending on whether the embedding model was trained with quantization in mind. Voyage AI's voyage-3.5 at int8 and 2048 dimensions reports 83 percent lower vector database costs than a full-precision text-embedding-3-large baseline, while outperforming it on retrieval benchmarks (Voyage AI, 2025). Model selection and quantization strategy are coupled decisions, not independent ones.

Stacked compression and the quality trap

Compression techniques compose. The chapter's worked example takes 100 million vectors at 1536 dimensions in float32, a 576 GB raw footprint, through MRL to 256 dimensions (96 GB), int8 scalar quantization (24 GB), and binary quantization with int8 rescoring (3 GB in RAM, 24 GB on disk). The combined effect is a 192x reduction in RAM, with the secondary int8 index on disk preserving reranking quality for the candidates that survive the binary first pass.

Quality losses from stacking these techniques are not strictly additive, and this is where aggressive optimization tends to go wrong. MRL truncation discards information encoded in the removed dimensions. Scalar quantization introduces rounding error across all retained dimensions. Binary quantization collapses the information per dimension to a single bit. When applied in sequence, these errors interact. A dimension that carries marginal signal at full precision may carry critical signal in a reduced-dimensional space, and quantizing it to int8 or binary may eliminate it entirely. A 6x dimensionality reduction at 98 percent independent quality retention, stacked with a 4x int8 quantization at 99 percent independent retention, does not guarantee 97 percent retention on the combination. The actual retention has to be measured empirically on a domain-representative evaluation set, against a stored baseline that is not regenerated between steps. Finding the Pareto point where further compression yields diminishing cost savings relative to quality loss requires a specific validation workflow.

Related chapter

Chapter 17: Cost Optimization

Vectors dominate the economics of hybrid search: their memory footprint grows linearly with the number of documents, with embedding dimensionality, and with the overhead of the ANN index structure itself. The chapter decomposes spending into indexing compute, storage, query-time compute, and operational overhead, then works through the main levers for reducing the bill: lower-dimensional models, quantization schemes, tiered storage, and the architectural question of whether to build the platform or rent it.

Get notified when the book launches.

L

Laszlo Csontos

Author of Designing Hybrid Search Systems. Works on search and retrieval systems, and writes about the engineering trade-offs involved in combining keyword and vector search.