HNSW Parameter Tuning: M, efConstruction, efSearch Explained

Two HNSW indexes built from the same vectors with the same parameters can differ by up to 17 percentage points in relative recall, just because documents arrived in different orders (Marqo, 2024). That single result, covered in the book's chapter on latency, throughput, and scaling, upends the way most teams reason about HNSW tuning: the parameters are not the only variable, and the graph you ship is not the only graph the parameters describe.

What the three parameters actually do

HNSW is a graph-based approximate nearest neighbor index. Queries traverse a multi-layer graph, descending from coarse to fine connectivity until they land near the query point. Three parameters shape that graph, and each has a well-measured effect.

M is the number of bidirectional edges each node keeps. Higher M makes the graph denser, improves recall at a fixed efSearch, and increases memory. The graph structure alone adds approximately M x 8 to 10 bytes per element on top of the stored vectors (hnswlib, 2024). On SIFT-128 with one million vectors, moving M from 2 to 512 increases total index memory from roughly 0.5 GB to 5 GB (Pinecone, 2024d). The common default across vector databases is M = 16 (Marqo, 2024); for high-dimensional embeddings in the 768 to 1024 range, M = 48 to 64 is the recommended range (hnswlib, 2024).

efConstruction controls how hard the index works to find good neighbors while inserting each point. It does not affect query-time latency, only graph quality and build time. Under-configured efConstruction can cost up to 18% NDCG@10 (Marqo, 2024), which is a large retrieval quality hit for a parameter that is easy to get wrong and impossible to change without a full reindex.

efSearch is the query-time breadth of the search. It is the only one of the three that can be changed per query, and it is the parameter most teams actually tune in production.

The shape of the efSearch curve

The recall-latency curve is not linear and not forgiving. On SIFT-128 using hnswlib with M = 16 and efConstruction = 500, efSearch = 50 yields 0.950 recall at 28,022 QPS; efSearch = 500 yields 1.000 recall at 4,116 QPS (Aumuller et al., 2020). That is a 6.8x drop in throughput to buy five percentage points of recall. On the same benchmark, efSearch = 10 runs at 69,663 QPS but only reaches 0.713 recall.

The shape matters more than any single point on it. The cheap recall is at the bottom of the curve. The expensive recall is at the top. Picking a target of 1.000 instead of 0.985 is rarely a product decision; it is usually an artifact of nobody having measured the cost. Every 100 milliseconds of added search latency reduces daily searches by approximately 0.2% (Brutlag, 2009), and the latency difference between efSearch = 100 and efSearch = 500 at billion-scale fan-out can easily exceed that threshold.

Why defaults can mislead

Library defaults are tuned for generic benchmarks, usually SIFT-128 or GloVe-100. A modern embedding pipeline running 768 to 3072 dimensions is not SIFT-128. The M = 16 default that works on SIFT is exactly the lower bound of the recommended range for high-dimensional embeddings, not the middle of it. Running 1024-dimensional vectors through an HNSW index built with M = 16 is a choice, not a safe baseline.

The insertion-order result is the deeper warning. Two teams can run identical parameters, identical data, identical hardware, and produce indexes with 17% relative recall spread between them. That means a benchmark number reported without a reproducible build procedure is a point estimate with error bars the size of the effect most teams are trying to measure. It also means that reindexing the same corpus (after a model change, after a backfill, after a disaster recovery) can move your recall in ways that no parameter audit will catch.

What this leaves unresolved

HNSW parameters are one lever among several in a production retrieval stack, and they interact in ways a single-index benchmark does not capture. Vector retrieval is one stage inside a fixed latency budget that also has to accommodate query understanding, lexical retrieval, fusion, and reranking. Quantization changes the memory calculation, the distance computation cost, and the recall retention curve all at once, and the trade-offs differ between scalar, product, and binary schemes. Fan-out across shards amplifies tail latency non-linearly: a 1% per-shard slow rate at fan-out 100 produces a slow response on 63% of user requests. Hedged requests, adaptive replica selection, and partial-result policies address that tail, but only under specific utilization regimes. Chapter 15 works through each of these together, which is the only level at which HNSW tuning actually makes sense.

HNSW Parameter Tuning: M, efConstruction, efSearch Explained

What the three parameters actually do

The shape of the efSearch curve

Why defaults can mislead

What this leaves unresolved

Chapter 15: Latency, Throughput, and Scaling

Laszlo Csontos

Related Posts