Fine-tuning an LLM for State-of-the-Art retrieval: Zeta Alpha's top-10 submission to the MTEB benchmark

Arthur Câmara
Sep 12, 2024
6 min read

Updated: Nov 19, 2024

We are excited to introduce Zeta-Alpha-E5-Mistral, our first open model, as a showcase of how to fine-tune LLMs to produce state-of-the-art embeddings, and we are proud that at the moment of submission (5 September 2024) our model landed in the top-10 of this globally competitive benchmark.

MTEB Leaderboard - Retrieval tasks - 12 September 2024.

While extremely large open-source LLMs make the most headlines (like LLama's 3.1 405B parameter model), their smaller siblings, with less than 10B parameters, are quickly becoming one of the most popular and powerful ways to use LLMs for applications.

One of the problems researchers and users are tackling with these LLMs is how to create high-quality embeddings that can be used for tasks such as semantic retrieval, making large collections of documents searchable, especially within RAG pipelines.

The most common benchmark for such models, the MTEB leaderboard, measures performance on tasks such as clustering, classification, and, most importantly, retrieval. It shows that pre-trained LLMs can be fine-tuned to produce high-quality embeddings, with many of the best-performing models being based on 7B-parameter LLMs.

We hope that openly sharing our data and recipes is helpful for others working in a similar direction.

Model selection

Looking at the MTEB benchmark, it is clear that 7B models can produce high-quality embeddings, with Mistral-based models being the most popular choice. We did not want to train a model completely from scratch for this release, so we decided to further fine-tune e5-mistral-7b-instruct, one of the most successful and widely used open embedding models, which is, in turn, based on Mistral-7B-v0.1, in order to improve its standing on MTEB.

While fine-tuning an already strong model is a good starting point, it also limits some of our choices. For instance, trying to fix the inconsistencies in the instructions used by the model is tricky, as the model already "knows" how to perform retrieval under these instructions (one example of such "inconsistencies" is the instructions for STS datasets, which end with a period, unlike others).

Another limitation is that some newer models, such as NV-Embed-V2, show that tweaks to the attention mechanism can significantly increase performance, and inserting architectural changes or completely different loss functions in an existing model would require a full retraining process.

Training Dataset

One of the most critical parts of the process was deciding which datasets to use when training the model. While the original E5-Mistral was trained, at least in the first stage, with mostly synthetic data, we iterated on our training data mixture, using only "real" data. In the end, we settled on the following datasets for our training set:

Dataset	# of samples	Source
Arguana	4,065	Zenodo
FEVER	50,000	ir_datasets
FIQA	14,166	ir_datasets
HotPotQA	85,000	ir_datasets
MsMarco (passage)	200,000	ir_datasets
NFCorpus	4,000	ir_datasets
NQ	100,231	sentence-transformers
SciFact	919	ir_datasets
NLI	20,000	sentence-transformers
SQuad	87,417	sentence-transformers
StackExchange	100,000	sentence-transformers
TriviaQA	20,000	sentence-transformers
SciRep	43,000	SciRepEval
arXiv-s2s	34,929	mteb
arXiv-p2p	34,929	mteb
BiorXiv-s2s	4,070	mteb
BiorXiv-p2p	4,070	mteb
medRxiv-s2s	1,160	mteb
medRxiv-p2p	1,160	mteb
AmazonCounterfactual	4,018	mteb
AmazonReview	20,000	mteb
Banking77	9,926	mteb
Emotion	15,989	mteb
MTOPIntent	9,942	mteb
ToxicConversations	39,999	mteb
TweentSentiment	27,481	mteb
IMDb	14,999	mteb
STS12	1,850	mteb
STS22	414	mteb
STSBenchmark	2,777	mteb

When sampling from non-retrieval datasets, we used a stratified sampling strategy so the ratio of samples between the classes remained consistent with the rest of the training set.

The NV-Retriever training mix inspired our selection of training data (with some minor changes to filtering and selection). Of note is that we did not use samples from BioASQ, PAQ, and GOOAQ. Instead, we included samples from the SciRepEval collection from their search subset. We removed any queries and documents that may overlap with MTEB's SciDocs test collection to avoid contamination.

Hard negatives

One of the most important steps when building a training set is how you sample your hard negatives. Again, we took inspiration from the work of the NV-Retriever team and used a similar TopK-PercPos strategy for sampling negatives for each query in the retrieval datasets. Instead of naively selecting the Top-K non-positive documents retrieved by a retriever as hard negatives, we only select documents with a score of at least 95% of the score of the true positive document. This sampling strategy avoids the problems of too-hard negatives and false negatives.

Due to the expensive annotation step, most of the retrieval training datasets existing today have only a single positive document for each query. However, in reality, most queries can be answered by more than a single document. As an example, take the TREC-Covid dataset. In that case, some queries can have over 100 documents marked as "relevant". Therefore, avoiding false negatives in the training set is critical.

To sample these negative documents, we relied on an existing powerful embedding model, Snowflake's Arctic-embed-m-v1.5. While a larger model could yield even better hard negatives, Snowflake's model's medium size allowed us to gather the hard negatives without a large GPU budget, which would have been better used to train our model.

Evaluation Datasets

A key consideration when training any Machine Learning model is how fast we can iterate over hyperparameters and models. However, the size of the MTEB collection's datasets, particularly the corpus of some of the retrieval datasets, can make this process impractical. Evaluating the full MTEB collection (or even only on BEIR) can take days, especially with a large model like Zeta-Alpha-E5-Mistral. Hence, we have created a smaller version for each BEIR dataset, known as the NanoBEIR collection, to address this. As part of this release, we are making the NanoBEIR datasets available on the HuggingFace Hub.

The NanoBEIR collection consists, for each dataset, of 50 queries, randomly sampled from the full collection and up to 200 negative documents per query. To sample these negative documents, we used both BM25, as implemented by Pyserini, and another embedding model, Alibaba's gte-large-en-v1.5. This is similar to the approach used by the Snowflake's Arctic team, in what they called an internal "Lite BEIR dataset". We make our version publicly available, in hopes of it being a useful resource for reproducibility and faster experimentation. As an example, we show in the table below the results of the base model (E5-Mistral) and Zeta-Alpha-E5-Mistral, on the NanoBEIR datasets:

Dataset	E5-Mistral	Zeta-Alpha-E5-Mistral
NanoArguAna	59.9	65.8 (+5.8)
NanoClimateFever	42.5	42.3 (-0.2)
NanoDBPedia	71.8	72.8 (+1.0)
NanoFEVER	94.9	96.2 (+1.35)
NanoFiQA	60.3	61.0 (+0.7)
NanoHotPotQA	85.6	89.9 (+4.3)
NanoMSMarco	66.1	70.1 (+4.0)
NanoNFCorpus	33.0	39.4 (+6.4)
NanoNQ	75.4	83.1 (+7.7)
NanoQuora	94.1	95.8 (+1.75)
NanoSCIDOCS	35.4	41.3 (+5.92)
NanoSciFact	78.0	79.8 (+1.8)
NanoTouche-2020	52.5	54.0 (+1.5)
Average	65.3	68.6 (+3.3)

Training recipe

We trained Zeta-Alpha-E5-Mistral on 4xA100(80G) GPUs. The training process took about 80 hours. To increase the effective batch size, we used GradCache. In our experiments, we found that in-batch negatives did not seem to be helpful when continuing training from an existing checkpoint. We also experimented with following the alternating tasks for each batch, as proposed by the SFR-Embedding-Mistral team. However, we did not notice any significant changes in the results, with an added complexity in the training script. We used the traditional InfoNCE loss with temperature scaling. Finally, we also used an early stopping technique, which stopped training after no improvements on the evaluation set after ten evaluation steps. In the end, we trained Zeta-Alpha-E5-Mistral with the following hyperparameters:

Parameter	Value
GradCache chunk size	8
Effective batch size	1024
Epochs	1
Maximum query length	192
Maximum document length	512
Loss Temperature	0.2
Negatives per query	5
Lora R	8
Lora Alpha	32
Lora Dropout	0.1

We trained the model using BF16 with TF32 support activated in PyTorch to speed up training. We also trained our model using SPDA attention. We saw minor speed gains using Flash-Attention and SPDA, but SPDA was more stable in a few small-scale experiments. Another thing we tried implementing to speed up training was using PyTorch's compiler. However, given the nature of text data, where the input shapes always change (i.e., between query and document), avoiding the PyTorch compiler to re-compile the graph frequently was not trivial. It should be possible to work around it with Dynamic Shapes, but we leave this investigation for future work.

Next steps

For now, we hope that Zeta-Alpha-E5-Mistral and NanoBEIR can be useful, and we look forward to releasing more high-quality embedding models soon.