Sentence Embedding Fine-tuning for the French Language

8 min readFeb 25, 2022

Context

The recent search engine tools (for example, Elasticsearch) allowed us to query documents using keywords without focusing on the semantics of the sentence. It helps us quickly look for documents but sometimes the best-matching documents are not found as expected. This poor performance is explained by the fact that those tools only focus on keywords used in the queries: they count the keywords frequencies in a document or the number of documents containing the keywords (TF-IDF like approaches).

The emergence of Sentence-Embedding allowed us to better look for the documents we need by taking the semantic aspects of a sentence into consideration. Instead of just looking for keywords in a search query, it exposes implicit and explicit entities in the query, finds related entities, captures user intent, and delivers more relevant results. It represents knowledge in the right way to retrieve meaningful results.

This article takes the form of a tutorial on training a model used for semantic search purposes. This model is applied in French language search engines that have been deployed in some of our client’s projects.

Our Sentence Embedding model is fine-tuned using Siamese BERT-Networks [1] and pre-trained CamemBERT [2] via sentence-transformer [3].

The fine-tuning of the Sentence Embedding model is performed through two phases:

In the first phase, we train with Bi-encoders, which map each input independently to a dense vector space.
In the second phase, we perform training with data augmentation using a Cross-Encoder that is used to label a larger set of input pairs in order to increase the size of the training data for the Bi-encoder model.

Below is a visualisation of the two mentioned concepts: Bi-encoder and Cross-encoder.

The Bi-encoder takes a pair of sentences as input, each of which will be independently embedded by BERT models.
The Cross-encoder takes the same pair of sentences as input, performing Full-Attention over the input pair of sentences.

Figure 1: Difference between Bi-encoder and Cross-encoder

Using Full Attention on the pair of input sentences, the Cross-Encoder usually achieves a higher performance than the Bi-encoder (shown in figure 2).

Figure 2: Spearman rank correlation (ρ) test scores for different STS Benchmark (English) training sizes [6].

Phase 1: Training the Bi-Encoder

Dataset:

The Sentence Embedding model will be trained on the dataset of STSbenchmark, which is displayed in [4]/[5] consisting of 5749 samples for training, 1500 samples for dev (validation) and 1370 for testing as shown in figures 3, 4, 5.

The dataset consists of pairs of sentences associated to a similarity score ranging from 0 to 5. These values are then rescaled into the [0, 1] interval.

Training

First, we install sentence-transformers:

Then, we download the dataset:

We build a DataLoader ready for training using the dataset.

Next, we define the loss function based on the cosine-similarity function:

The model’s architecture is defined based on the pre-trained CamemBERT-large with its output dimension being 1024, and we choose to set the max_seq_length parameter of the input sentences to 128* (meaning that the sequences are either truncated -if longer- or padded -if shorter- than 128 tokens)

*N.B: the truncation/padding to a 128 tokens-sentence is applied to the word-pieces yielded by the tokenizer linked to the model used (in our case the CamemBERT-large tokenizer).

The training loss and evaluation during training are then defined:

We start the training process:

We then evaluate the test set using the best-saved model :

The results obtained after training for 10 epochs are shown in Figure 6. The performance is measured using Pearson and Spearman's correlations based on different vector distances (cosine, euclidian…)

Figure 6: Similarity scores using the best model.

This Bi-encoder model achieves a Pearson correlation score on, respectively, the dev and test sets of 87.6 % and 83.6%, which is quite an acceptable result.

Phase 2: Training with Data Augmentation using a Cross-Encoder.

First, we train on the same dataset with a Cross-encoder model, where full-attention is applied on the input sentence pairs (instead of each sentence independently using the Bi-encoder), and where the output is their corresponding similarity cosine score (between 0 and 1).

Indeed, applying self-attention to the whole pair of sentences yields to the obtention of a richer embedding taking into account both sentences in the score computation.

However, the end-use of a Cross-encoder is impractical due to the very high computational cost. For instance:

The clustering of 10,000 sentences has a quadratic complexity with a cross-encoder and would require about 65 hours of training with BERT [2]. End-to-end information retrieval is also not possible with Cross-encoders as they do not yield independent representations for the inputs that could be indexed.

In contrast, Bi-encoders encode each sentence independently and map them to a dense vector space, allowing efficient indexing and comparison. For example, the complexity of clustering 10,000 sentences is reduced from 65 hours to ~5 seconds [2]. Many real-world applications hence depend on the quality of Bi-encoders.

Due to these practical constraints, we decided to exploit the benefits of the two models:

The final model that will be used to give embeddings (vectors) for comparison will be a Bi-encoder

But

That model will be trained using a richer augmented dataset: pairs of sentences + scores. (We switch from a model-driven strategy to a data-driven strategy). This augmentation will be performed using the Cross-encoder.

How do we do this? By following the 3 steps below:

Step 1: Training the Cross-encoder Model with STSbenchmark dataset

In the preparation of the dataset, we will swap the positions of each pair of sentences in the training set and add them to the previous training set (shown in Figure 7) while keeping the test set and dev set unchanged.

This simple strategy doubles the number of training samples used for training the Cross-encoder. We call this augmented dataset “Gold samples”.

Figure 7: Preparation of the dataset for training with Cross-encoder

The code below illustrates this operation:

Next, we load the pre-trained Cross-encoder model from the library sentence_transformers

Then we define the evaluation of dev set during training by:

Training of Cross-encoder model is performed:

After training for 10 epochs, the Cross-encoder model achieves a Pearson correlation of 90.2% on the dev set. This score is significantly higher than the score of the Bi-encoder model (85.2%). This increase was expected (Figure 2), due to the self-attention benefit (jointly on the two input sentences) of the Cross-encoder.

Step 2: Build silver pairs and score them using the Cross-encoder

Using the Gold samples and the previously trained Cross-encoder, we generate pairs called Silver samples or Silver pairs according to the procedure illustrated (Figure 8) and explained below:

Figure 8: Generation of sentences from Gold samples

We only take 1 column of sentences (the first sentence in the pairs for example) from gold samples and remove duplicate sentences. We obtain a dataset containing unique sentences from the gold dataset (~10000 sentences).

After that step, we compute the cosine similarity of each sentence with the rest of the sentences in this set. We could be using the trained Cross-encoder to get high-quality similarity scores (since the score is more representative as it is computed using self-attention attending on both sentences), but as said before this will be very costly: With N sentences, we will need to do N x (N-1) predictions. Knowing that each prediction has a complexity of O(L²), this method would take a lot of time.

What we do instead is:

Use the Bi-encoder model trained in Phase 1 to encode the N sentences.
Compute the conventional cosine similarity score ( N x N-1 times but it is just a normal computation, it is fast)
Only retain the top-k similar sentences, for each sentence. (thus obtaining N x k pair of sentences)
Those N x k samples will be called “Silver samples”

We then use the trained Cross-encoder (Step 1) to label (score, a higher quality score this time ) for the pairs of silver sentences.

Step 3: Train a Bi-encoder model on the entire dataset, including Gold samples and Silver samples

After creating the sentence pairs in the Silver samples using the Phase 1 trained Bi-encoder model and labeling them using the Phase 2 — Step 1 trained Cross-encoder, we combine both the Gold samples and the Silver samples to train a new Bi-encoder model.

train_loss, evaluator of the dev and training sets are defined the same way as in Phase 1:

After training for 10 epochs, the final model achieves a Pearson score on the dev and test sets of 88.2 % and 85.9%, respectively. Thus, using the data augmentation trick thanks to using the Cross-encoder and the Bi-encoder we have improved the Pearson correlation score by 2 to 3 points on test set.

The trained sentence embedding model for the French language can be found here:

dangvantuan/sentence-camembert-large · Hugging Face. The table below compares the Pearson and Spearman correlations of our model and the distiluse-base-multilingual-cased for both the dev set and the test set.

Dev set

Test set

Usage:

The model can be used directly as follows:

Install sentence-transformers

pip install sentence-transformers

Load model using

Encode the sentences:

Application

One of the applications of sentence embedding models for the French language is Detection and Normalization of Temporal Expressions in French Text [7]. In addition, they are also applied to search engines through a semantic search of sentences by calculating the cosine similarity between the query and the previously embedded database.

References

[1] Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

[2] CamemBERT: a Tasty French Language Model

[3] https://www.sbert.net/

[4] https://ixa2.si.ehu.eus/stswiki/index.php/STSbenchmark

[5] stsb_multi_mt · Datasets at Hugging Face

[6] Augmented SBERT: Data Augmentation Method for Improving…

[7] Detection and Normalization of Temporal Expressions in French Text

La Javaness R&D — Medium

Acknowledgement

Thanks to our colleagues Kevin PAYET, Al Houceine KILANI, Ismail EL HATIMI and Nhut DOAN NGUYEN for the article review.

About

Van-Tuan DANG is a lead data scientist at La Javaness. Joining the company in 2020, he is a key player in various R&D projects on NLP & Computer Visions.