Detection and Normalization of Temporal Expressions in French Text — Part 3: A Machine Learning Model

La Javaness R&D
9 min readFeb 16, 2022

--

This series describes how to detect and normalize temporal expressions (date expressions) in French text. This article, the third one in the series, describes how to train a machine learning model to detect and normalize such expressions.

4. The Machine Learning Model

In the previous section, we annotated the dataset and stored it in nhutljn-temporal-expression-annotated.tsv. In this section, we will train a model with this dataset so that it will be able to recognize temporal expressions and normalize/convert them to our defined format.

4.1 Possible Modeling approaches

We thought of the following approaches :

  • Use a Sequence-to-Sequence (Seq2Seq) model to encode the raw text into a sequence of temporal expressions in our normalized format. E.g. Dans les cinq premiers mois de la campagne 2002–2003, les exportations vers les Philippines et l’Indonésie ont semblé diminuer par rapport à l’année précédente. -> REL DIR - FIRST M5; ABS DIR - FROM Y2002; ABS DIR - TO Y2003; REL IND - PREV Y1
  • Use a Name Entities Recognition (NER) model to locate the temporal expressions one by one; then use the Sequence-to-Sequence model on single temporal expressions detected to encode into the normalized format. E.g. NER model: Dans les cinq premiers mois de la campagne 2002–2003, les exportations vers les Philippines et l’Indonésie ont semblé diminuer par rapport à l’année précédente. -> Dans les cinq premiers mois, 2002-, -2003, l'année précédente. Then Seq2Seq model: Dans les cinq premiers mois -> REL DIR - FIRST M5; 2002- -> ABS DIR - FROM Y2002; -2003 -> ABS DIR - TO Y2003;; l'année précédente -> REL IND - PREV Y1.

The second method seems theoretically more coherent, but it adds extra work since we need to train a NER model (if we do not have an appropriate pre-trained one). Luckily, some libraries like spaCy support such pre-trained models.
This also means the final performance will depend on the NER model.

For these reasons, we will proceed with the 1st approach: only use a Seq2Seq model.

As Seq2Seq models are available with BERT architecture in Huggingface Transformers (a powerful tool to tackle the problem), we will use a pre-trained model implemented within the transformers library and fine-tune it with our datasets. If we head to the models' hub of Huggingface and select the Seq2Seq tasks like "Summarization" or "Translation", together with the language "fr", we will see the supported models. During our client project, we tested t5-base, t5-small, t5-large, mbart-large-cc25. t5-large is suitable for illustration in this article as it is light and powerful enough.

4.2 Prepare the Dataset for Training

As t5-large is initially used for a Machine Translation problem, we will transform our dataset as if it were a translation problem we want to tackle. (Source language: French, Goal language: Normalized dates)

Prepare the environment

The code in this article is supposed to be launched in a Python3 Kernel of Jupyter notebook.

Let’s first implement the necessary packages and define the constants.

Load the dataset

We download our annotated data file nhutljn-temporal-expression-annotated.tsv and put it in an arbitrary location (defined with constant DATASET in the code block above).

Next, we split the dataset into train and validation parts. We can have a look

splitted_datasets = raw_datasets.train_test_split(test_size=0.2)
splitted_datasets

Output

DatasetDict({
train: Dataset({
features: ['translation'],
num_rows: 748
})
test: Dataset({
features: ['translation'],
num_rows: 187
})
})

Let's have a look at some examples:

splitted_datasets["train"][0], splitted_datasets["test"][1]

Output

({'translation': {'en': 'NONE',
'fr': "Le Roi s'enfuit néanmoins avec la Bergère sur son automate géant, qui devait servir à l'animation de la cérémonie, mais l'Oiseau parvient à en prendre le contrôle après avoir assommé le machiniste. Il démolit alors le palais avec le robot, d'abord maladroitement puis de plus en plus méthodiquement. Pendant ce temps, le Ramoneur affronte le Roi au sommet de l'Automate. Acculé, le Roi tente de poignarder le Ramoneur dans le dos, mais l'Oiseau l'en empêche en le saisissant avec la main de la machine puis active une soufflerie qui propulse le Roi loin dans les airs."}},
{'translation': {'en': 'REL DIR - FROM PREV Y?',
'fr': 'La marque « Choix du Président » est très prisée sur ce marché depuis plusieurs années.'}})

Tokenization

Next, we tokenize the texts in the dataset, so they become tensors. To do that, we load the tokenizer t5-base (defined within the constant MODEL_LM).

tokenizer = AutoTokenizer.from_pretrained(MODEL_LM)

The function preprocess_function below describes how to transform the input into tensors: we add the prefix to the source text, then tokenize the source text and labels (the normalized expressions).

We apply this function to the dataset: splitted_datasets and get the tokenized version of the dataset.

Output

DatasetDict({
train: Dataset({
features: ['attention_mask', 'input_ids', 'labels'],
num_rows: 748
})
test: Dataset({
features: ['attention_mask', 'input_ids', 'labels'],
num_rows: 187
})
})

The dataset is now in the right shape for training.

4.3 Configure the Training Settings

The data is ready. Let’s set up the model by loading the pre-trained “t5-base” (defined within the MODEL_LM constant).

model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_LM)

Training Arguments

The training arguments are defined by initializing a Seq2SeqTrainingArguments object.

Basically,

  • We want to store the checkpoints in a temporary folder t5-base-temporal-expression but only keep the three latest checkpoints (specified in save_total_limit).
  • We evaluate and save the checkpoints at the end of each epoch.
  • The attributes per_device_train_batch_size, per_device_eval_batch_size are chosen for best practice in our GPU platform.
  • The attributes (learning_rate, weight_decay ) are specific params for learning; they are chosen based on experience.
  • We use the BLEU metric and specify its computation in the following code block. The BLEU score varies between 0 and 1; a perfect model would have a score=1.
  • We will train for 50 steps (num_train_epochs)
  • load_best_model_at_end: This is a strategy to keep the best model in terms of the highest BLEU score. When this param is True, transformers will keep the best model among the three kept checkpoints, then replace the latest model with the best one at the end of the training process.

Now we collate the model and the datadata_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

Metrics

Using BLEU metrics, we define a function to compute it at the end of each epoch.

We combine everything, and build the Trainer object that is well-known in transformers.

4.4 Training

Running

trainer.train()

Apart from warnings and technical logs, we retrieve the following progress table:

Output

Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from t5-small-finetuned-fr-to-en/checkpoint-1872 (score: 71.3346).

TrainOutput(global_step=2400, training_loss=0.07770670791467031, metrics={'train_runtime': 887.8723, 'train_samples_per_second': 43.081, 'train_steps_per_second': 2.703, 'total_flos': 1.326014013353472e+16, 'train_loss': 0.07770670791467031, 'epoch': 50.0})

We observe:

  • Training goes well, and the BLEU score improves until its highest potential value.
  • The best BLEU scores evaluated on the validation is not bad: 70–71%
  • The best model is saved in the temporary folder t5-small-finetuned-fr-to-en.
  • The best model is loaded as said in the logs (Loading best model from t5-small-finetuned-fr-to-en/checkpoint-1872 (score: 71.3346).)

We should save the best model and tokenizer to its final location for later use.

model.save_pretrained(MODEL_CHECKPOINT)
tokenizer.save_pretrained(MODEL_CHECKPOINT)

Output

Configuration saved in ../models/mt-public-t5-base/config.json
Model weights saved in ../models/mt-public-t5-base/pytorch_model.bin
tokenizer config file saved in ../models/mt-public-t5-base/tokenizer_config.json
Special tokens file saved in ../models/mt-public-t5-base/special_tokens_map.json
Copy vocab file to ../models/mt-public-t5-base/spiece.model

('../models/mt-public-t5-base/tokenizer_config.json',
'../models/mt-public-t5-base/special_tokens_map.json',
'../models/mt-public-t5-base/spiece.model',
'../models/mt-public-t5-base/added_tokens.json',
'../models/mt-public-t5-base/tokenizer.json')

As a reminder, in transformers, the model is a config.json for hyper-parameters and a .bin file for the network’s weights, whereas the tokenizer is a different .json file and a spiece.model file if it had been implemented with sentence_transformers as a fast tokenizer.

4.5 Test

Let's try some test cases.

Output

Does the result seem reasonable? There is one error in the example “Je voudrais un document pour le mois juillet.” where the prediction is ABS DIR - M7 D1 whereas the ground truth should be ABS DIR - M7. Apart from that, the other test cases work correctly. That means more work is still needed to make the model perfect; but looking at the limited data volume used for training (1000), the result is already promising.

Let’s evaluate the validation test. (In an actual project, we should consider an independent test set).

Output

Accuracy: 0.8343

4.6 Analysis

We can look at the cases where the model makes mistakes.

Output

Text: La marque « Choix du Président » est très prisée sur ce marché depuis plusieurs années.
Label: REL DIR - FROM PREV Y?
Predictions: DUR IND - Y?
Text: Il s'agit notamment de restrictions sur le moment de se nourrir et de boire, l'endroit où on le fait et sur ce qui peut être mangé et bu pendant tout le mois.
Label: REL IND - CURRENT M0
Predictions: NONE
Text: En 1972, Lester Kinsolving, éditeur au San Francisco Examiner et ancien prêtre épiscopal, rédige une série d'articles à charge contre le Temple du Peuple. Le premier article sort le 17 septembre 72, le second paraît le lendemain, le troisième encore le jour suivant. Ce troisième article affirme que Tim Stoen ne devrait pas être si haut placé dans les institutions publiques locales et qu'il officie en tant que pasteur sans en avoir la permission de l'État. Jones envoie 150 fidèles du Temple du Peuple manifester devant les bureaux du San Francisco Examiner,. Un journaliste est harcelé par téléphone, au point de devoir se cacher trois jours dans un hôtel avec sa famille. Sur huit articles rédigés, Kinsolving n'en publie finalement que quatre, les autres présentant un risque de diffamation trop important d'après son journal.
Label: ABS DIR - Y1972; ABS DIR - Y1972 M9 D17; REL IND - NEXT D1; REL IND - NEXT D1
Predictions: ABS DIR - Y1972; ABS DIR - Y1972 M9 D17; ABS DIR - Y1972 M9 D3; REL IND - NEXT D3
Text: La cité date du Ier siècle av. J.-C., pendant l'occupation romaine en Gaule : les Romains s'installent dans la plaine de l'Isle et créent la ville de Vesunna, à l'emplacement de l'actuel quartier sud. Celle-ci était la capitale romaine de la cité des Pétrocores. La ville de Périgueux naît en 1240 de l'union de « la Cité » (l'antique Vesunna) et du « Puy-Saint-Front ». Depuis, elle reste le centre du Périgord, subdivision historique de l'Aquitaine, puis est la préfecture du département français de la Dordogne. Elle s'agrandit encore en 1813 avec l'ancienne commune de Saint-Martin.
Label: ABS DIR - S-1; ABS DIR - Y1240; ABS DIR - Y1813
Predictions: ABS DIR - S1; ABS DIR - Y1240; ABS DIR - Y1813
Text: o La croissance du PIB atteint 7 % ou plus depuis près de 10 ans.
Label: REL DIR - FROM APPROX PREV Y10
Predictions: REL DIR - FROM PREV Y10
...Text: Durant le règne du roi nabatéen Arétas IV, d'environ 9 av. J.-C. à 40, le royaume connaît un important mouvement culturel. C'est à cette époque que la plupart des tombeaux et temples sont construits.
Label: ABS DIR - FROM APPROX Y-9; ABS DIR - TO Y40
Predictions: NONE
Text: Ces sociétés sont établies depuis des dizaines d’années sur le marché.
Label: REL DIR - FROM APPROX PREV Y10
Predictions: REL DIR - FROM PREV Y10
Text: La superficie consacrée à la production de petits fruits et de fruits de verger a changé (tableau 1) depuis vingt ans, avec l'aide de nouvelles techniques et de nouvelles pratiques de taille qui ont permis d'augmenter la densité de plantation, les rendements et donc la production.
Label: REL DIR - FROM PREV Y20
Predictions: ABS DIR - FROM PREV Y20
Text: Une année avant, près de 10 à 12 tonnes de boeuf australien ont été consommées par mois, mais la quantité est maintenant fixée à environ 20 tonnes par mois dans un marché qui consomme quelque 100 tonnes au total chaque mois.
Label: REL IND - PREV Y1; FREQ IND - M1; FREQ IND - M1; FREQ IND - M1
Predictions: REL IND - PREV Y1; FREQ IND - M1; REL DIR - M1; FREQ IND - M1
Text: La période de dédouanement peut varier d'une journée à un mois, tout dépendant de la nature du produit et de l'expérience de l'importateur.
Label: DUR IND - FROM D1; DUR IND - TO M1
Predictions: DUR IND - D1; DUR IND - M1
Text: Malgré deux récessions au cours des 10 dernières années (crise financière asiatique de 1997-1998 et ralentissement économique mondial en 2001-2002), la croissance économique est repartie à Hong Kong depuis 2003, avec une augmentation des exportations, du tourisme récepteur et des dépenses de consommation bénéfique pour le territoire.
Label: REL DIR - FROM PREV Y10; ABS DIR - FROM Y1997; ABS DIR - TO Y1998; ABS DIR - FROM Y2001; ABS DIR - TO Y2002; ABS DIR - FROM Y2003
Predictions: REL DIR - FROM Y10; ABS DIR - FROM Y1997; ABS DIR - TO Y1998; ABS DIR - FROM Y2001; ABS DIR - TO Y2002; ABS DIR - FROM Y2003
Text: Selon M. Tiu, la durée de conservation varie entre 8 mois et une année.
Label: DUR IND - FROM M8; DUR IND - TO Y1
Predictions: DUR IND - M8; DUR IND - Y1

We see that the problems of our model are:

  • It sometimes ignores the terms “le mois”, “la semaine”, “l’année”, whereas it should consider them as the current/referenced period like “ce mois”, “cette semaine-là”.
  • It sometimes ignores the future/past terms “derniers”, “suivantes”.
  • It sometimes ignores the approximation terms (REL DIR - FROM APPROX PREV Y10 vs REL DIR - FROM PREV Y10).
  • It sometimes cannot distinguish absolute and relative dates.
  • It cannot recognize negative year/century.

However, the global accuracy score of 83% seems promising; and fortunately, these weaknesses can be improved by adding more training data. To illustrate the idea in this series, we are satisfied with the current model and can pass to the next step — application and demonstration. The improvements will be carried out in future clients’ projects.

References

Acknowledgement

Thanks to our colleagues Al Houceine KILANI and Ismail EL HATIMI for the article review.

About

Nhut DOAN NGUYEN is data scientist at La Javaness since March 2021

--

--

La Javaness R&D
La Javaness R&D

Written by La Javaness R&D

We help organizations to succeed in the new paradigm of “AI@scale”, by using machine intelligence responsibly and efficiently : www.lajavaness.com

No responses yet