Back to Blog
Part III · Chapter 9embeddingsfine-tuning

The Negatives You Train On Decide Your Embedding Model's Ceiling

Given fixed positives, the choice of negatives is the highest-leverage lever left in embedding fine-tuning, and getting it wrong quietly poisons retrieval quality.

February 23, 20265 min read

On the BioASQ biomedical retrieval task, a general-purpose dense retriever reaches an NDCG@10 of 0.232 against BM25's 0.465, a deficit of roughly 50 percent (Thakur et al., 2021). On Natural Questions, adding a single BM25 hard negative per query lifts top-20 retrieval accuracy from 68.3 percent to 78.4 percent, a 10-point jump from one change to negative sampling (Karpukhin et al., 2020). That gap and that lift are what this post is about. The book's chapter outline places hard negatives inside Chapter 9 because they are where the biggest remaining gains live once fine-tuning is on the table.

The domain gap is real

General dense retrievers trained on MS MARCO underperform on specialized corpora by wide margins. On financial question answering, generic RAG pipelines reached about 18 percent accuracy on hard questions while a fine-tuned retriever reached 55 percent (Nguyen et al., 2024). Domain vocabulary, acronyms, and technical terms rarely appear in general web text, and tokens like "immunoglobulin" fragment into meaningless subword pieces under a general-purpose tokenizer (Gu et al., 2021). Legal corpora, financial filings, and code behave similarly. The post on embedding model selection covers when to reach for fine-tuning; this one picks up once that decision is made.

Where the leverage actually sits

A contrastive loss pushes the query toward a positive example and away from negatives. Positive-pair quality is the single most impactful factor in embedding fine-tuning; no mining strategy rescues a training set whose positives are wrong, ambiguous, or drawn from the wrong distribution. That is the baseline assumption of this post.

Given fixed positives, the next most impactful factor is the quality of negatives. Random in-batch negatives are usually trivially separable because they come from unrelated topics, and when negatives are too easy the model learns coarse topic separation without developing the fine-grained discrimination production retrieval demands. Uninformative negatives also produce near-zero gradient norms and high gradient variance, slowing convergence and capping final quality (Xiong et al., 2021). Once the positives are set, negatives are where the highest-leverage gains remain.

Hard negatives are passages that are plausible but wrong: lexically similar, topically related, or high-ranking mistakes from the current retriever. A random negative for "treatment options for type 2 diabetes" might be a passage on car maintenance, which the model separates on topic alone. A harder one about "diagnosis of type 2 diabetes" shares most keywords but does not answer the question. Harder still, "treatment options for type 1 diabetes" forces the model to encode the numerical distinction. Each level teaches a more nuanced representation.

A progression of mining strategies

There is a progression of mining strategies with well-understood trade-offs along two axes: how hard the negatives are and how much compute each method requires. Each step yields more informative negatives at the cost of more infrastructure, and each also shifts the risk profile for a problem that tends to surprise teams on their first serious fine-tuning run.

Mining plugs into the broader recipe where in-batch negatives stop giving signal: positive pairs from click logs, synthetic generation, or human judgments; an InfoNCE-style contrastive loss; and an evaluation harness measuring real retrieval quality rather than training loss (see search quality metrics).

Consequences for the system

Fine-tuning is not free. It requires a labeled evaluation set, a training harness, periodic retraining as the corpus drifts, and the cost of hosting a non-standard model (see embedding drift monitoring on when retraining is actually needed). The right time to invest is not "when the leaderboard model looks weak on your favorite benchmark" but when a domain-representative evaluation shows a gap large enough to justify the engineering cost. When that time comes, positive-pair quality is the first place the budget goes and the mining strategy for negatives is the next.

The tension this leaves open

There is a trap hiding inside the phrase "harder negatives are better." In MS MARCO, roughly 70 percent of the passages most similar to a given query are actually relevant but unlabeled (Qu et al., 2021, as cited in Moreira et al., 2024). Sample negatives from the very top of the rankings and the training signal pushes the model away from documents that should be pulled closer. The fix is not to avoid hard negatives; it is to sample from a rank range hard enough to be informative but deep enough to dodge the false-negative cliff, and to layer in denoising when the budget allows. Which negatives, how hard, and how to avoid poisoning the model, that is what Chapter 9 answers.

Related chapter

Chapter 9: Fine-Tuning Embeddings for Your Domain

When no off-the-shelf embedding model meets the quality bar on domain-specific evaluation, fine-tuning is the next step. This chapter walks through deciding whether the investment will pay off, assembling the dataset, choosing loss functions and training hyperparameters, building a hard-negative mining loop, and validating real gains on held-out domain evaluations.

Get notified when the book launches.

L

Laszlo Csontos

Author of Designing Hybrid Search Systems. Works on search and retrieval systems, and writes about the engineering trade-offs involved in combining keyword and vector search.