Detection and Normalization of Temporal Expressions in French Text — Part 3: A Machine Learning Model
This series describes how to detect and normalize temporal expressions (date expressions) in French text. This article, the third one in the series, describes how to train a machine learning model to detect and normalize such expressions.
4. The Machine Learning Model
In the previous section, we annotated the dataset and stored it in nhutljn-temporal-expression-annotated.tsv. In this section, we will train a model with this dataset so that it will be able to recognize temporal expressions and normalize/convert them to our defined format.
4.1 Possible Modeling approaches
We thought of the following approaches :
- Use a Sequence-to-Sequence (Seq2Seq) model to encode the raw text into a sequence of temporal expressions in our normalized format. E.g. Dans les cinq premiers mois de la campagne 2002–2003, les exportations vers les Philippines et l’Indonésie ont semblé diminuer par rapport à l’année précédente. ->
REL DIR - FIRST M5; ABS DIR - FROM Y2002; ABS DIR - TO Y2003; REL IND - PREV Y1
- Use a Name Entities Recognition (NER) model to locate the temporal expressions one by one; then use the Sequence-to-Sequence model on single temporal expressions detected to encode into the normalized format. E.g. NER model: Dans les cinq premiers mois de la campagne 2002–2003, les exportations vers les Philippines et l’Indonésie ont semblé diminuer par rapport à l’année précédente. ->
Dans les cinq premiers mois
,2002-
,-2003
,l'année précédente
. Then Seq2Seq model: Dans les cinq premiers mois ->REL DIR - FIRST M5
; 2002- ->ABS DIR - FROM Y2002
; -2003 ->ABS DIR - TO Y2003;
; l'année précédente ->REL IND - PREV Y1
.
The second method seems theoretically more coherent, but it adds extra work since we need to train a NER model (if we do not have an appropriate pre-trained one). Luckily, some libraries like spaCy support such pre-trained models.
This also means the final performance will depend on the NER model.
For these reasons, we will proceed with the 1st approach: only use a Seq2Seq model.
As Seq2Seq models are available with BERT architecture in Huggingface Transformers (a powerful tool to tackle the problem), we will use a pre-trained model implemented within the transformers
library and fine-tune it with our datasets. If we head to the models' hub of Huggingface and select the Seq2Seq tasks like "Summarization" or "Translation", together with the language "fr", we will see the supported models. During our client project, we tested t5-base, t5-small, t5-large, mbart-large-cc25. t5-large is suitable for illustration in this article as it is light and powerful enough.
4.2 Prepare the Dataset for Training
As t5-large is initially used for a Machine Translation problem, we will transform our dataset as if it were a translation problem we want to tackle. (Source language: French, Goal language: Normalized dates)
Prepare the environment
The code in this article is supposed to be launched in a Python3 Kernel of Jupyter notebook.
Let’s first implement the necessary packages and define the constants.
Load the dataset
We download our annotated data file nhutljn-temporal-expression-annotated.tsv and put it in an arbitrary location (defined with constant DATASET
in the code block above).
Next, we split the dataset into train and validation parts. We can have a look
splitted_datasets = raw_datasets.train_test_split(test_size=0.2)
splitted_datasets
Output
DatasetDict({
train: Dataset({
features: ['translation'],
num_rows: 748
})
test: Dataset({
features: ['translation'],
num_rows: 187
})
})
Let's have a look at some examples:
splitted_datasets["train"][0], splitted_datasets["test"][1]
Output
({'translation': {'en': 'NONE',
'fr': "Le Roi s'enfuit néanmoins avec la Bergère sur son automate géant, qui devait servir à l'animation de la cérémonie, mais l'Oiseau parvient à en prendre le contrôle après avoir assommé le machiniste. Il démolit alors le palais avec le robot, d'abord maladroitement puis de plus en plus méthodiquement. Pendant ce temps, le Ramoneur affronte le Roi au sommet de l'Automate. Acculé, le Roi tente de poignarder le Ramoneur dans le dos, mais l'Oiseau l'en empêche en le saisissant avec la main de la machine puis active une soufflerie qui propulse le Roi loin dans les airs."}},
{'translation': {'en': 'REL DIR - FROM PREV Y?',
'fr': 'La marque « Choix du Président » est très prisée sur ce marché depuis plusieurs années.'}})
Tokenization
Next, we tokenize the texts in the dataset, so they become tensors. To do that, we load the tokenizer t5-base
(defined within the constant MODEL_LM
).
tokenizer = AutoTokenizer.from_pretrained(MODEL_LM)
The function preprocess_function
below describes how to transform the input into tensors: we add the prefix to the source text, then tokenize the source text and labels (the normalized expressions).
We apply this function to the dataset: splitted_datasets
and get the tokenized version of the dataset.
Output
DatasetDict({
train: Dataset({
features: ['attention_mask', 'input_ids', 'labels'],
num_rows: 748
})
test: Dataset({
features: ['attention_mask', 'input_ids', 'labels'],
num_rows: 187
})
})
The dataset is now in the right shape for training.
4.3 Configure the Training Settings
The data is ready. Let’s set up the model by loading the pre-trained “t5-base” (defined within the MODEL_LM
constant).
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_LM)
Training Arguments
The training arguments are defined by initializing a Seq2SeqTrainingArguments
object.
Basically,
- We want to store the checkpoints in a temporary folder
t5-base-temporal-expression
but only keep the three latest checkpoints (specified insave_total_limit
). - We evaluate and save the checkpoints at the end of each epoch.
- The attributes
per_device_train_batch_size
,per_device_eval_batch_size
are chosen for best practice in our GPU platform. - The attributes (
learning_rate
,weight_decay
) are specific params for learning; they are chosen based on experience. - We use the BLEU metric and specify its computation in the following code block. The BLEU score varies between 0 and 1; a perfect model would have a score=1.
- We will train for 50 steps (
num_train_epochs
) load_best_model_at_end
: This is a strategy to keep the best model in terms of the highest BLEU score. When this param isTrue
,transformers
will keep the best model among the three kept checkpoints, then replace the latest model with the best one at the end of the training process.
Now we collate the model and the datadata_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
Metrics
Using BLEU metrics, we define a function to compute it at the end of each epoch.
We combine everything, and build the Trainer
object that is well-known in transformers
.
4.4 Training
Running
trainer.train()
Apart from warnings and technical logs, we retrieve the following progress table:
Output
Training completed. Do not forget to share your model on huggingface.co/models =)
Loading best model from t5-small-finetuned-fr-to-en/checkpoint-1872 (score: 71.3346).
TrainOutput(global_step=2400, training_loss=0.07770670791467031, metrics={'train_runtime': 887.8723, 'train_samples_per_second': 43.081, 'train_steps_per_second': 2.703, 'total_flos': 1.326014013353472e+16, 'train_loss': 0.07770670791467031, 'epoch': 50.0})
We observe:
- Training goes well, and the BLEU score improves until its highest potential value.
- The best BLEU scores evaluated on the validation is not bad: 70–71%
- The best model is saved in the temporary folder
t5-small-finetuned-fr-to-en
. - The best model is loaded as said in the logs (
Loading best model from t5-small-finetuned-fr-to-en/checkpoint-1872 (score: 71.3346).
)
We should save the best model and tokenizer to its final location for later use.
model.save_pretrained(MODEL_CHECKPOINT)
tokenizer.save_pretrained(MODEL_CHECKPOINT)
Output
Configuration saved in ../models/mt-public-t5-base/config.json
Model weights saved in ../models/mt-public-t5-base/pytorch_model.bin
tokenizer config file saved in ../models/mt-public-t5-base/tokenizer_config.json
Special tokens file saved in ../models/mt-public-t5-base/special_tokens_map.json
Copy vocab file to ../models/mt-public-t5-base/spiece.model
('../models/mt-public-t5-base/tokenizer_config.json',
'../models/mt-public-t5-base/special_tokens_map.json',
'../models/mt-public-t5-base/spiece.model',
'../models/mt-public-t5-base/added_tokens.json',
'../models/mt-public-t5-base/tokenizer.json')
As a reminder, in transformers
, the model is a config.json
for hyper-parameters and a .bin
file for the network’s weights, whereas the tokenizer is a different .json
file and a spiece.model
file if it had been implemented with sentence_transformers
as a fast tokenizer.
4.5 Test
Let's try some test cases.
Output
Does the result seem reasonable? There is one error in the example “Je voudrais un document pour le mois juillet.” where the prediction is ABS DIR - M7 D1
whereas the ground truth should be ABS DIR - M7
. Apart from that, the other test cases work correctly. That means more work is still needed to make the model perfect; but looking at the limited data volume used for training (1000), the result is already promising.
Let’s evaluate the validation test. (In an actual project, we should consider an independent test set).
Output
Accuracy: 0.8343
4.6 Analysis
We can look at the cases where the model makes mistakes.
Output
Text: La marque « Choix du Président » est très prisée sur ce marché depuis plusieurs années.
Label: REL DIR - FROM PREV Y?
Predictions: DUR IND - Y? Text: Il s'agit notamment de restrictions sur le moment de se nourrir et de boire, l'endroit où on le fait et sur ce qui peut être mangé et bu pendant tout le mois.
Label: REL IND - CURRENT M0
Predictions: NONE Text: En 1972, Lester Kinsolving, éditeur au San Francisco Examiner et ancien prêtre épiscopal, rédige une série d'articles à charge contre le Temple du Peuple. Le premier article sort le 17 septembre 72, le second paraît le lendemain, le troisième encore le jour suivant. Ce troisième article affirme que Tim Stoen ne devrait pas être si haut placé dans les institutions publiques locales et qu'il officie en tant que pasteur sans en avoir la permission de l'État. Jones envoie 150 fidèles du Temple du Peuple manifester devant les bureaux du San Francisco Examiner,. Un journaliste est harcelé par téléphone, au point de devoir se cacher trois jours dans un hôtel avec sa famille. Sur huit articles rédigés, Kinsolving n'en publie finalement que quatre, les autres présentant un risque de diffamation trop important d'après son journal.
Label: ABS DIR - Y1972; ABS DIR - Y1972 M9 D17; REL IND - NEXT D1; REL IND - NEXT D1
Predictions: ABS DIR - Y1972; ABS DIR - Y1972 M9 D17; ABS DIR - Y1972 M9 D3; REL IND - NEXT D3 Text: La cité date du Ier siècle av. J.-C., pendant l'occupation romaine en Gaule : les Romains s'installent dans la plaine de l'Isle et créent la ville de Vesunna, à l'emplacement de l'actuel quartier sud. Celle-ci était la capitale romaine de la cité des Pétrocores. La ville de Périgueux naît en 1240 de l'union de « la Cité » (l'antique Vesunna) et du « Puy-Saint-Front ». Depuis, elle reste le centre du Périgord, subdivision historique de l'Aquitaine, puis est la préfecture du département français de la Dordogne. Elle s'agrandit encore en 1813 avec l'ancienne commune de Saint-Martin.
Label: ABS DIR - S-1; ABS DIR - Y1240; ABS DIR - Y1813
Predictions: ABS DIR - S1; ABS DIR - Y1240; ABS DIR - Y1813 Text: o La croissance du PIB atteint 7 % ou plus depuis près de 10 ans.
Label: REL DIR - FROM APPROX PREV Y10
Predictions: REL DIR - FROM PREV Y10 ...Text: Durant le règne du roi nabatéen Arétas IV, d'environ 9 av. J.-C. à 40, le royaume connaît un important mouvement culturel. C'est à cette époque que la plupart des tombeaux et temples sont construits.
Label: ABS DIR - FROM APPROX Y-9; ABS DIR - TO Y40
Predictions: NONE Text: Ces sociétés sont établies depuis des dizaines d’années sur le marché.
Label: REL DIR - FROM APPROX PREV Y10
Predictions: REL DIR - FROM PREV Y10 Text: La superficie consacrée à la production de petits fruits et de fruits de verger a changé (tableau 1) depuis vingt ans, avec l'aide de nouvelles techniques et de nouvelles pratiques de taille qui ont permis d'augmenter la densité de plantation, les rendements et donc la production.
Label: REL DIR - FROM PREV Y20
Predictions: ABS DIR - FROM PREV Y20 Text: Une année avant, près de 10 à 12 tonnes de boeuf australien ont été consommées par mois, mais la quantité est maintenant fixée à environ 20 tonnes par mois dans un marché qui consomme quelque 100 tonnes au total chaque mois.
Label: REL IND - PREV Y1; FREQ IND - M1; FREQ IND - M1; FREQ IND - M1
Predictions: REL IND - PREV Y1; FREQ IND - M1; REL DIR - M1; FREQ IND - M1 Text: La période de dédouanement peut varier d'une journée à un mois, tout dépendant de la nature du produit et de l'expérience de l'importateur.
Label: DUR IND - FROM D1; DUR IND - TO M1
Predictions: DUR IND - D1; DUR IND - M1 Text: Malgré deux récessions au cours des 10 dernières années (crise financière asiatique de 1997-1998 et ralentissement économique mondial en 2001-2002), la croissance économique est repartie à Hong Kong depuis 2003, avec une augmentation des exportations, du tourisme récepteur et des dépenses de consommation bénéfique pour le territoire.
Label: REL DIR - FROM PREV Y10; ABS DIR - FROM Y1997; ABS DIR - TO Y1998; ABS DIR - FROM Y2001; ABS DIR - TO Y2002; ABS DIR - FROM Y2003
Predictions: REL DIR - FROM Y10; ABS DIR - FROM Y1997; ABS DIR - TO Y1998; ABS DIR - FROM Y2001; ABS DIR - TO Y2002; ABS DIR - FROM Y2003 Text: Selon M. Tiu, la durée de conservation varie entre 8 mois et une année.
Label: DUR IND - FROM M8; DUR IND - TO Y1
Predictions: DUR IND - M8; DUR IND - Y1
We see that the problems of our model are:
- It sometimes ignores the terms “le mois”, “la semaine”, “l’année”, whereas it should consider them as the current/referenced period like “ce mois”, “cette semaine-là”.
- It sometimes ignores the future/past terms “derniers”, “suivantes”.
- It sometimes ignores the approximation terms (
REL DIR - FROM APPROX PREV Y10
vsREL DIR - FROM PREV Y10
). - It sometimes cannot distinguish absolute and relative dates.
- It cannot recognize negative year/century.
However, the global accuracy score of 83% seems promising; and fortunately, these weaknesses can be improved by adding more training data. To illustrate the idea in this series, we are satisfied with the current model and can pass to the next step — application and demonstration. The improvements will be carried out in future clients’ projects.
References
- [1] Huggingface Transformers — Sequence-to-sequence models
- [2] Models on Transformers models hub: t5-base, t5-small, t5-large, mbart-large-cc25
Acknowledgement
Thanks to our colleagues Al Houceine KILANI and Ismail EL HATIMI for the article review.
About
Nhut DOAN NGUYEN is data scientist at La Javaness since March 2021