Detection and Normalization of Temporal Expressions in French Text — Part 2: Label Format and Annotation Guide

La Javaness R&D
8 min readFeb 16, 2022

This series describes how to detect and normalize temporal expressions (date expressions) in French text. This article, the second one in the series, describes how to annotate the dataset with a normalized format of temporal expressions.

3. Temporal Expression Normalization

3.1 TimeML

TimeML is a markup language to annotate events and temporal expressions, which is introduced in [2] as an achievement during a six-month workshop, TERQAS, in 2002. The purpose is to convert temporal expressions into something that can be standardized, hence understandable by a machine. The purpose is more or less similar to what we do in this series, but TimeML’s scope is much larger. Indeed, its ambition is to cover all possibilities in human language. Moreover, TimeML deals with events and the relation between time and events. Therefore, the whole set of TimeML tags contains also EVENT, LINK beside TIME. We begin our job by studying TimeML as it is one of the reliable approaches for time expression annotation.

The following is a “simple” example with the paragraph "In Washington today, the Federal Aviation Administration released air A traffic control tapes from the night the TWA Flight eight hundred went down. There’s nothing new on why the plane exploded, but you cannot miss the moment.”, where we use TimeML to annotate the time expressions.

Intuitively, we see that temporal expressions and the verbs linked to them are annotated in the example.

Most of the time, at the end of each text, there will be a section for relation linking, such as:

which means the event of id e15 happened before e18. There can be relations between time and time, time and events, events and events.

The annotation guide [3] specifies all the instructions on how to label the temporal expressions and events in the text. In this guide, the main TimeML tags are

  • <EVENT> for events (typically verbs).
  • <TIMEX3> for temporal expressions.
  • <SIGNAL> for additional information on events (like during for emphasizing, if for condition context etc.)
  • <TLINK> for links between two temporal expressions or one event and one expression.
  • <SLINK> for links between two events.
  • <ALINK> for links between an aspectual event and an argument event.

The relation between time and event will be studied in a later series. At the moment, we focus on the TIMEX3 tag. The task can contain the following attributes:

type ::= ’DATE’|’TIME’|’DURATION’
functionInDocument ::= ’CREATION_TIME’|’EXPIRATION_TIME’ |
’MODIFICATION_TIME’|’PUBLICATION_TIME’|
’RELEASE_TIME’|’RECEPTION_TIME’|’NONE’
temporalFunction ::= ’true’ | ’false’
value ::= CDATA
{value ::= duration | dateTime | time | date | gYearMonth |
gYear | gMonthDay | gDay | gMonth}
valueFromFunction ::= IDREF
{valueFromFunction ::= TemporalFunctionID
TemporalFunctionID ::= tf<integer>}
mod ::= ’BEFORE’|’AFTER’|’ON_OR_BEFORE’|’ON_OR_AFTER’|
’LESS_THAN’|’MORE_THAN’|’EQUAL_OR_LESS’|
’EQUAL_OR_MORE’|’START’|’MID’|’END’|’APPROX’
anchorTimeID ::= TimeID
anchorEventID ::= EventID

Some of the attributes: functionInDocument, valueFromFunction, TemporalFunctionID, anchorTimeID, anchorEventID can be ignored because they are not related to time notions, but rather to their functionalities and contexts in the documents. The remaining attributes type, mod, value give us an idea of the temporal expression characteristics, that our simpler format of temporal expression will be based on.

It is worth mentioning that, in this section, the idea of processing using TimeML illustrate the approach “complex first, then simpler”. We begin with a normalized/standard solution, though complex and sophisticated, in order to get the main points related to our simpler problem. That way, we make sure we do not forget any essential aspects when modelling unless we intend to ignore some of them. Of course, we can proceed in the opposite order: start with something simple, then add more and more complexity until the modelling seems complete enough for our use case.

3.2 Our Normalized Temporal Expression Format

We define the following characteristics:

3.2.1 TYPE:

In TimeML, types can beDATE/TIME/DURATION. We won't consider time in our dataset (i.e. hours, minutes, seconds will be ignored).

we distinguish between

  • absolute date expressions like “08 Jan 2022”, “last June”, “the last 21st (day)” where we refer to a moment with absolute numbers;
  • relative date expressions like “3 days ago”, “since 3 days” where we need to specify a direction and a distance toward a referenced moment (for example, 3 days ago means the past direction, a distance = 3 days and the referenced moment = today);
  • relative date expressions like “last month”, “next week”, “last decade” where we have a direction but not a distance to the referenced moment, instead, it goes with a standard time period (a month begins with day 1st and ends with day 28–>31).
    The second and third types will be referred to as relative dates (as we will use different terms for the attribute TENSE below)

The second and third kinds will be called commonly relative dates. We will in fact use different terms for the attribute TENSE below.

We also introduce the notion of frequency (like "per week", "several times a year", "annual") beside the notion of duration.

3.2.2 VALUE:

We write the value as a number preceded by a unit, e.g. Y2002 for the year 2002, D1 for 1 day or for the 1st day (of a month). We use S, DC, Y, Q, M, W, D for a century, a decade, a year, a quarter, a month, a week and a day, respectively.

3.2.3 TENSE:

By TENSE, we describe whether the temporal expression is in the past, the present or the future in comparison to the time of speaking.

3.2.4 ANCHOR

By ANCHOR,we specify if a preposition like “before”, “after”, “from” (“since”) or “to” (“until”) is added to the moment mentioned in VALUE.

3.2.5 PARTIAL

ByPARTIAL, we specify if some phrases/words are added to the interval/moment mentioned in VALUEto limit the start, middle, end of the temporal expressions.

3.2.6 APPROXIMATION

By APPROXIMATIONwe specify if phrases/words like “less than”, “more than”, “approximately”, “more or less” etc. are added to the quantity/interval/moment mentioned within VALUE to make it an approximate value (instead of an exact value).

3.2.7 MODE

Finally, as we do not consider events in our modelization (while TimeML), let’s simplify this notion by introducing the speech mode direct or indirect. We use MODEto describe this speech mode where “Direct” means the temporal expression is linked with the moment of speaking; “Indirect” means the temporal expression is linked with an event that may happen in a moment other than the time of speaking. For instance, “yesterday” => DIRECT; "the day before" or "the previous day" => INDIRECT .We come up with the following annotation guide where each characteristic (attribute) should be written by one pre-defined term.

3.3 The Annotation Guide

Each temporal expression is represented as a tuple of the following attributes: TYPE, MODE, ORDER, ANCHOR, PARTIAL, APPROXIMATION, TENSE, VALUE. Their possible values are detailed below:

Please note that we ignore here time notions (hour, minute, second), Millenniums as well as weekdays (Monday, Tuesday, etc.) since we didn’t build the dataset with these notions. In a more complete version, these expressions should be taken into account as well.

Also, a situation to clarify: “before”/”after” should be put in “ANCHOR” or “TENSE”?

  • Before July => ANCHOR = "BEFORE" because in this case "before" is an auxiliary prep for the value "July".
  • 3 days before some event => TENSE = "PREV" because in this case "before" is auxiliary for the event, not the time.

The annotation for a temporal expression is the concatenation of these fields in that order. (If any field is empty, we just ignore it). We add a small hyphen “ — “ to separate TYPE, MODE with other fields.

The following image (captured from Number, an OS equivalent to Excel) gives us an example of converting some temporal expressions into the normalized format. The final output is written in the last column.

Some annotated examples for single temporal expressions.

Some annotated examples for single temporal expressions.

The annotation for the whole paragraph is the annotations for all temporal expressions in that paragraph concatenated in the same order, separated by “; ”.

Some examples of annotations for paragraphs (items)
Some examples of annotations for paragraphs (items)

3.4 Output

We annotated about 935 examples in our dataset built from section 2 of this series — (this file). The resulting annotations are stored in the file nhutljn-temporal-expression-annotated.tsv where text and labels are separated by \t.

3.5 Is It a Good Format?

One question is raised for the way we encoded our date expressions: is it good enough to reflect the variety of time notions in human language? It is of course not as good as the most sophisticated version like TimeML, however, we can consider it good enough if it is suitable for our use cases/projects. For example, consider the case where a customer requests for some documents (certificates, history) in a period, the problem becomes: retrieving the beginning and end date of the requested period, we can then try defining the rules to translate the normalized format like ABS DIR - FROM START Y2007 or REL DIR - FROM PREV M1 into two absolute dates for the beginning and end time like FROM 2007/01/01 to 2022/01/31. We will call them post-processing rules. These rules can be formulated as an algorithm, though we can employ more user-friendly formats like tables or trees.
Anyway, it will be converted into code for implementation at the end. If it is feasible to define such rules without confusion, or formally if there is a mapping from our format to the set of absolute dates, we can confidently confirm that our modelling is good enough.

The Post-processing Rules

The Post-processing Rules

Our format has 8 attributes TYPE, MODE, ORDER, ANCHOR, PARTIAL, APPROXIMATION, TENSE, VALUE. Some of them are less essential to translate time expressions into absolute dates:

  • APPROXIMATION: it makes the time more or less exact but does not change the main part of time expression. It will say "around February 2001" but won't change "February 2001" into "March 2002".
  • ORDER and PARTIAL: it changes the whole time notions into parts of it. This part can be processed independently after we already deduced the absolute dates.
  • ANCHOR: it adds a direction (before, after) to the time moment or change it into an interval (from some moment = since some moment, etc.)
  • MODE: by saying we are in direct or indirect mode, we will adapt the referenced date: today or the day of speech?

So finally, the most important parts are (TYPE, TENSE, VALUE). We can write our post-processing table as follows:

We did not fill all the rules in the table and it may not consider less frequent possibilities (something like “3 quarters 1 month before”). However, it is sufficient to show us how to translate the normalized format into such a use case.

3.6 Recap

That is it, in this article, we defined our simple normalized format for temporal expressions inspired by TimeML. We annotated the dataset built from the last steps to come up with nhutljn-temporal-expression-annotated.tsv, the annotated dataset that will be used in the next stage — Training an ML model.

References

Acknowledgement

Thanks to our colleagues Al Houceine KILANI and Ismail EL HATIMI for the article review.

About

Nhut DOAN NGUYEN is data scientist at La Javaness since March 2021

--

--

La Javaness R&D

We help organizations to succeed in the new paradigm of “AI@scale”, by using machine intelligence responsibly and efficiently : www.lajavaness.com