Detection and Normalization of Temporal Expressions in French Text — Part 1: Build a Dataset

1. Introduction to the Series

In human languages, the same notions of date/time can be expressed in various ways. For instance, at the moment this article is written (Monday 14 Feb 2022), the day 07 Feb 2022 can be referred to by the following expressions:

  • Monday of last week
  • One week ago
  • 7 days ago
  • Feb 7 2022
  • 02/07/2022
  • Feb 7th this year
  • etc.

1.1 Objective of the Series

In this article, we would like to combine the two problems mentioned above. We expect the machine to look at a human text like:

Je voudrais aller à Paris demain. J'y suis déjà allé l'an dernier entre janvier 12 et janvier 15 mais je n'ai pas pu visiter grande choses. 3 jours n'était pas suffisant.
(This format is just for illustration. We will define the canonical format in the dedicated section of the article)
MOMENT: 2022-01-15
INTERVAL: 2021-01-01 TO 2021-01-31
INTERVAL: 2021-01-12 TO 2021-01-15

1.2 Use Cases

We can think of different use cases for the problems above, some of which have been covered in our company’s clients’ projects.

  • Indexation of pages/parts of a book/document based on the time they cover. E.g. Given a history book about World War II, where the content is not organized in chronological order, can we create an index of pages/paragraphs or even events based on the date/time they mention?
  • Pre-annotation of time expressions for the relation classification problem. E.g. In a large document, we would like to detect the moments/dates when any kind of event happened and link them with the (absolute) time. It would be more convenient to automatically annotate the temporal expressions and the events first, then further carry on linking them together.

1.3 Roadmap

  • Creation of the dataset:
    Since it is not possible to use customer data to illustrate the article and not easy to find a French corpus adapted for the purpose, we will start explaining how to construct a toy dataset that has a rich presence of temporal expressions.
  • Defining a normalized format for time expression:
    For the objective of illustration, it is not necessary to use the full complex format of the TimeML mark language. Instead, we introduce a lighter format which may be enough to demonstrate the AI methods and the use case application later.
    After this step, we will annotate the dataset so that each text is associated with the normalized terms of its temporal expressions.
  • Fine-tune a Hugging Face Transformer model of type “Sequence to Sequence”(Seq2seq) which allows us to do the job.
  • Demonstrate a use case with an application.

2. Construction of Dataset

2.1 How to construct a useful dataset if we don’t have one

In practice, given any data science problem, it is not easy to find an adequate dataset (full of elements of interest). In our problem, we could not find a dataset with a full variety of temporal expression formats. Luckily, temporal expressions appear a lot in daily language, so the idea was to look into one or several sufficiently large corpora and filter out a rich subset of such expressions.

  1. Pre-trained models method: Use a pre-trained model that tackles the same problem — extraction of temporal expressions — to predict a part of the whole corpus. There are such open-source models available like dateparser, datefinder etc. although they do not handle all or most of the cases we intend to extract.
  2. Similarity methods: Select a sample query (N weeks ago, last year, February 4th, etc). Use a text-embedding model to encode paragraphs/sentences in the corpus as vectors, then look for the most similar paragraphs/sentences to the sample queries. The notion "most similar" is typically translated into a mathematical notion like cosine-similarity.

2.2 Download and Read the Original Dataset

Let us begin with asi/wikitext_fr. The dataset is registered within the Hugging Face dataset hub, we can easily download it using dataset.load_dataset

No config specified, defaulting to: new_dataset/wikitext-35 Reusing dataset new_dataset (/home/ 0%| | 0/3 [00:00<?, ?it/s]
train: Dataset({ features: ['paragraph'], num_rows: 376313 })
test: Dataset({ features: ['paragraph'], num_rows: 8449 }
validation: Dataset({ features: ['paragraph'], num_rows: 7753 })
{'paragraph': ['John Winston Ono Lennon [ d͡ʒɒn ˈlɛnən], né le 9 octobre 1940 à Liverpool et mort assassiné le 8 décembre 1980 à New York, est un auteur-compositeur-interprète, musicien et écrivain britannique.\n', 
"Il est le fondateur des Beatles, groupe musical anglais au succès planétaire depuis sa formation au début des années 1960. Au sein des Beatles, il forme avec Paul McCartney l'un des tandems d'auteurs-compositeurs les plus influents et prolifiques de l'histoire du rock, donnant naissance à plus de deux cents chansons.\n",
"Adolescent, influencé par ses idoles américaines du rock 'n' roll, il est emporté par la vague de musique skiffle qui sévit à Liverpool et fonde, au début de l'année 1957, le groupe des Quarrymen, qui évolue pour devenir, avec Paul McCartney, George Harrison et Ringo Starr, les Beatles. Des albums Please Please Me en 1963 à Let It Be en 1970, les Beatles deviennent un des plus grands phénomènes de l'histoire de l'industrie discographique, introduisant de nombreuses innovations musicales et mélangeant les genres et les influences avec une audace et une sophistication jusqu'alors inédites. Lennon occupe une place centrale dans cette réussite populaire, critique et commerciale, composant des œuvres majeures pour le groupe. Les dissensions entre les musiciens, en particulier entre Lennon et McCartney, mettent fin à l'aventure en 1970.\n",]}

2.3 The Sentence Embedding Model

To look for similar items with our queries, we need a text-embedding model as a backbone. We will use a model trained by my colleague Van-Tuan DANG (see how this has been done in this article). The trained model has been uploaded into Hugging Face model hub under the name: dangvantuan/sentence-camembert-large.

(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: CamembertModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})

Preparing to save the embeddings (encoded texts under the format of tensors)

To encode every item in the corpus, we use a function in torch that converts the text into tensors (vector of several vectors of real numbers). We can store this huge tensor as a torch file having the extension .pt, say We use the sentence-embedding model to select similar paragraphs.

EMBEDDED_WEIGHT_FILE = "../models/" # Replace it with your directory

2.4 Encode the Texts

As the corpus contains lots of short texts of several characters with no possibility to contain dates, we may try focusing on the long texts first (say 100 characters or more).
Also, as the corpus volume is large, we will only treat a subsample of it (say 200000 first items of the train split).

Embeddings file found: ../models/ Embeddings loaded. CPU times: user 1.89 s, sys: 723 ms, total: 2.62 s Wall time: 2.62 s
torch.Size([10285, 1024])

2.5 Search by Similarity

The following is an example of queries that can be used as a reference to look for similar text in the dataset.

results = search_by_query(QUERIES, sim_model, corpus_paragraphs, threshold=0.2)
Embeddings file found: ../models/ Embeddings loaded.

2.6 Save the Selected Items

We can output the selected text into a text file which should be used later for the next steps.


2.7 Repeat with Other Datasets

The first dataset seems to be texts in historical documents. We can diversify the datasets by adding other datasets. Let’s look at the second one that we introduced at the beginning of section 2. This is a dataset used for machine translation

corpus_2 = datasets.load_dataset("giga_fren")
Reusing dataset giga_fren (/home/ 0%| | 0/1 [00:00<?, ?it/s]
[{'en': 'Changing Lives _BAR_ Changing Society _BAR_ How It Works _BAR_ Technology Drives Change Home _BAR_ Concepts _BAR_ Teachers _BAR_ Search _BAR_ Overview _BAR_ Credits _BAR_ HHCC Web _BAR_ Reference _BAR_ Feedback Virtual Museum of Canada Home Page', 'fr': 'Il a transformé notre vie _BAR_ Il a transformé la société _BAR_ Son fonctionnement _BAR_ La technologie, moteur du changement Accueil _BAR_ Concepts _BAR_ Enseignants _BAR_ Recherche _BAR_ Aperçu _BAR_ Collaborateurs _BAR_ Web HHCC _BAR_ Ressources _BAR_ Commentaires Musée virtuel du Canada'}, {'en': 'Site map', 'fr': 'Plan du site'}, {'en': 'Feedback', 'fr': 'Rétroaction'}]We do exactly the same processing for this one for the field "" of this dataset: Retrieve the text, encode it as tensors, search for similar items with queries and finally stored the result. Concatenating the two selected items from 2 datasets together, we get something like this file: medium-temporal-expression-selected-raw-text.txt. Of course, to complete the dataset, we should try other approches (keywords, pretrained-models, other similarity models) on more datasets.

2.8 Recap

During this section, we presented the text similarity method via text embedding to look for useful data from a large text corpus. The output file nhutljn-temporal-expression-selected-raw-text.txt will be used in the next section for annotation.



Thanks to our colleagues Al Houceine KILANI and Ismail EL HATIMI for the article review.


Nhut DOAN NGUYEN is data scientist at La Javaness since March 2021



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store