Comparing supervised learning algorithms for Spatial Nominal Entity recognition

. Discourse may contain both named and nominal entities. Most common nouns or nominal mentions in natural language do not have a single, simple meaning but rather a number of related meanings. This form of ambiguity led to the development of a task in natural language processing known as Word Sense Disambiguation. Recognition and categorisation of named and nominal entities is an essential step for Word Sense Disambiguation methods. Up to now, named entity recognition and categorisation systems mainly focused on the annotation, categorisation and identiﬁcation of named entities. This paper focuses on the annotation and the identiﬁcation of spatial nominal entities. We explore the combination of Transfer Learning principle and supervised learning algorithms, in order to build a system to detect spatial nominal entities. For this purpose, different supervised learning algorithms are evaluated with three different context sizes on two manually annotated datasets built from Wikipedia articles and hiking description texts. The studied algorithms have been selected for one or more of their speciﬁc properties potentially useful in solving our problem. The results of the ﬁrst phase of experiments reveal that the selected algorithms have similar performances in terms of ability to detect spatial nominal entities. The study also conﬁrms the importance of the size of the window to describe the context, when word-embedding principle is used to represent the semantics of each word.


Introduction
A critical aspect of polysemy (i.e., the ability to have multiple meanings) is that the different meanings of a word can be conceptually closely related but in very distant semantic categories. In most cases, the contextual support of evocation makes it possible to retain the appropriate meaning. For example, consider the word 'church' used to refer an organisation sense as in sentence (1). In this sentence (1), it allows to personify an affirmation, versus a building sense, as in sentence (2) here used as a spatial reference point.
paper, we first introduce the concept of spatial nominal entity (SNoE) and we propose an approach for their recognition from unstructured texts. For this purpose, we trained several supervised learning algorithms to study their ability to detect whether a nominal entity identified in the sentence is used as a reference to a spatial object or not.
The remainder of this paper is structured as follows. In section 2 we present an overview of tasks and methods from NLP domain related to our work such as: named and nominal entity recognition and categorisation, word-embedding and transfert learning. Section 3 is dedicated to the definition of the concept of SNoE and provides methodological details of our approach for SNoE recognition. In section 4 we describe the dataset we have generated and manually tagged. This dataset is used to train and evaluate the studied algorithms mainly to demonstrate the feasibility of our approach; we also expose and discuss the experimental results of these algorithms. Finally, section 5 concludes this paper.

Related work
NER is considered as the key task in the field of WSD. NER implementations are based on a wide variety of methods and their role is to recognise named entities in a sentence and classify them in various classes (e.g. Name of Location, Person, Organisation, Quantity, Time, Percentage etc.). Despite the many implementations available, there is a great need to develop methods to refine the capabilities of NER, since existing tools have a limited scope. In particular, the vast majority, if not all of these tools are not able to recognise nominal entities (without proper nouns). Whatever these limitations, it appears in the literature that NER has been addressed by both machine learning and knowledge-based approaches.
Learning methods are based on labelled learning data sets, usually by a human who does not need to be an expert in linguistics such as in [6,7,8,9,10]. Knowledgebased approaches use hand crafted syntactic and semantic rules developed by linguistic experts. They involve morpho-syntactic structures and specific resources (e.g., lexicons, gazetteers) [11,12,13,14]. In [15,16,17] both type of approaches have been combined to build hybrid methods where the input features of the machine learning algorithms are provided by knowledge-based systems. Once entities are recognised, the considered categories may vary. For example, some NERs have the 'acronym' category while others do not but can categorise dates, etc. However, the 'location' category is always present.
For instance, the well-known Stanford NER [18] is based on a features extraction and Conditional Random Fields (CRF), the system categorises named entities in three classes ('person', 'organisation' and 'location').
NER systems dedicated to location are known as: 'geoparsers'. In general, geoparsers proceed two sub process: geotagging and geocoding. The geotagging (i.e. recognition) consists in marking in texts all segments containing a named entity referring to a place (i.e. place name. The geocoding (i.e. resolution) assigns a single couple of geographical coordinates to the previously identified (in geotagging step) named entity. Karimzadeh and al. [19] have proposed a geoparser called Geotex. This geotagging system is a web-based geotagger where; users have a choice within a list of 6 publicly available NER systems (Standford NER 8 , ANNIE 9 , Illinois NER 10 , MITIE 11 , Apache OpenNLP 12 , LingPipe 13 ). The Edinburgh Parser [20] is a major geoparser whose; the geotagging task is performed by a multi-rule based geotagger. Moncla et al. [21] have proposed a system called Perdido 14 , consisting in a rule-based method implemented with a cascade of transducers for the generic recognition of ESNE structures. The resolution is done within specific corpora composed exclusively of textual descriptions of pedestrian movements.
As stated in the introduction, words used to construct nominal entities are polysemous and the context is the main available information for identifying the used meaning. A solution that currently seems to be very promising is the Word-Embeddings (WEs). WEs are continuous space language models built using Neural Networks (NN). The main idea behind WEs is to project a set of words of a vocabulary of size N v into a continuous vector space of a lower dimension N d (knowing that N d << N v ). As a result, each word of the vocabulary is represented as a real-valued vector in a low-dimensional space and words with similar representations appear in similar contexts. WEs can be learned in an unsupervised way to capture distributional similarities between words of the vocabulary, and be fine-tuned in a supervised context. Several works such as [22,23,24,25,26,27] have used NN to learn distributed representations for words. These approaches differ in the type of the model and the data used to train the model.
The principle of producing WEs through neural networks was first introduced by Bengio [28]. Recently, Bojanowski et al. [29] have proposed FastText, a WE method that takes into consideration the internal structure of words by including character sequences in the learning process of word representation, which has proved to be of a great impact when working with morphological rich languages such as French or Finnish. WEs has opened a new direction for many NLP tasks based on NN such as question answering [30,31], sentiment analysis [32,33,34], relation extraction and classification [35,36], NER [8] and mention detection [37].
In our context of implementing a WSD process, geoparsing and geotagging named entities and their spatial-based context is fundamental but not sufficient. Therefore, it is essential to apply the same kind of processing to nominal entities. To the best of our knowledge, none of the actual state-of-the-art works attempt to identify SNoE, at least for French language.
In the absence of a French annotated corpus of nominal entities, our methodology is based on the principle of transfer learning (TL). According to the proceedings of the NIPS-95 workshop entitled 'Learning to Learn' [38], TL was primarily motivated by: "the need for lifelong machine-learning methods that retain and reuse previously learned knowledge". Moreover, the information Processing Technology Office (IPTO) of the Defense Advanced Research Projects Agency (DARPA) published a Broad Agency Announcement N o 05-29 in 2005 15 where they define TL as "the ability of a system to recognise and apply knowledge and skills learned in previous tasks to novel tasks". Following the principles of TL, we propose to use the FastText 16 pre-trained WE model as input of different supervised learning algorithms. Then, we compare the obtained results with two manually labelled datasets.

Concept and definition
The SNoE is defined as a nominal phrase that refers to a physical object which is usually involved in a spatial-based context. SNoE may be a common noun composed of a single token (village, hut, church) or composed of several tokens (boundary marker, tourist office, transformer substation). The concept of SNoE derives from the concept of nominal entity that was defined in the Entity Discovery and Linking task 17 as "A nominal mention consists of a common noun which refers to an entity in place of a name" and is classed into 5 different types ('person', 'location', 'organisation', 'facilities', 'geopolitical entity'). Hence, a SNoE is composed of at least one common noun (i.e. the pivot) involved in a spatial-based context (e.g. In our definition, the concept of SNoE covers: -Physical static entities that have fixed geographical coordinates, such as 3a.
-Spatial objects with the property of being able to be in motion, as shown in example 3b. -A group of physical objects forming a unique spatial reference point, such as 3c.
Consequently, this concept does not cover: -Nominal phrase involving a common noun, which may refer to a physical object, but associated with a proper name, such as 4a. -Nominal phrase only used for its ability to evoke the object (abstract or physical) as a concept without a spatial reference, such as 4b. -A spatial reference to a virtual object without physical existence, such as an ephemeral entity which exists only in a specific moment of a narrative, as illustrated in example 4c (3) a. Continuer la descente en sous-bois pour rejoindre le lac. 'Continue downhill in the undergrowth to reach the lake.' b. Prendre le sentier qui passe sous le téléphérique.
'From the chalets de l'Échet, make a U-turn and walk back to the previous crossroads.' b. Le chalet est un bâtiment rural des régions de montagne, dont le bois est le constituant essentiel. 'The chalet is a rural building of mountain regions, essentially built of wood' c. Pour la descente, revenir sur ses pas pour une bonne centaine de mètres de dénivelé pour rejoindre le carrefour de montée. 'For the descent, retrace your steps for a good hundred meters of drop to reach the crossroads of the ascent.' Furthermore, as mentioned previously, words may be polysemic, they may have a different meaning depending on different syntactic or semantic contexts. In the scope of our problem, we distinguish two main categories covering three different senses for a specified word: 1. The word is used to identify a physical object used as a landmark, such as example 2. 2. The word is used to identify a non physical object, an abstract entity with no physical borders, it could be a reference to an organisation such as 1 or the word is used to identify a physical object which is not used as a landmark, as in the following example 5a.

Methodology
In order to detect SNoE, we are considering the development of a system based on a supervised machine learning approach. As shown on figure 1 the process-chain is divided in two main phases: the pre-processing phase and the learning phase.
Pre-processing The pre-processing phase is an input preparation step for the learning phase. Three tasks are performed: 1) establishing a lexicon containing a varied list of terms that can constitute the pivot 2) context setting and 3) semantic representation of words.
Although it is recognised that the left context is generally more important in French than the right context, there are cases where the right-hand context is useful to improve discrimination. We extract n-grams from sentences because both the right and the left  context are useful and important to determine whether or not the pivot in the sequence of n-grams considered is used as a spatial entity. In order to obtain the n-grams from a given corpus we have constructed a lexicon of terms that can refer to spatial entities. For building this lexicon, we propose to manually extract a set of words used as SNoE from a set of French hiking description texts, such as lac, pont,église, avenue, and office du toursime (respectively lake, bridge, church, avenue, tourist office). This lexicon is then used to extract n-grams from a sentence (see example 6) while the n-grams represents the context of the pivot. The extracted sentences are then manually annotated in order to build the different datasets used for training, testing and validation. Our hypothesis is that the principle of the n-grams (with the size of n yet to be defined) associated with the principle of TL are sufficient for the different algorithms under study to achieve a reasonably good rate of expected decision. Once the n-grams extracted, the next step of the prepossessing phase is to vectorize the inputs. In accordance with the principles of TL, each word x i of the n-grams N is transformed into a vector e i of dimensionality d e by looking it up in the WE table of a pre-trained FastText. As a result, the original n-grams can be now viewed as a matrix X of size n * d e : Notice that sometimes the pivot could be at the beginning or at the end of a sentence and not enough words can be found before or after the pivot. For these cases the best alternative is to pad using a White noise. The concept of white noise in WE could be related to the concept of neutral vector [39,40]. Unfortunately the concept of neutral vector does not exist in WE. However, we solve this issue by randomly extracting words from a French corpus.
Supervised Learning algorithms As shown in Figure 1, during the learning phase the matrix X is fed into the input layer of a supervised machine learning algorithm. As the experiments were designed, the algorithm must make a dichotomous choice in order to decide whether the pivot word of the input matrix represent a spatial phrase or a non-spatial phrase. We have used two types of machine learning models: classical machine learning (ML) and deep neural networks (DNN). Five different algorithms were selected (two ML and three DNN) based on some of their characteristics that we considered potentially relevant to our problem and described below. Each ML algorithm is fine tuned on a training dataset, then the best model is chosen following the empirical results on a testing dataset (see Section 4.1).
For classical ML algorithms we choose to evaluate the performances of Support Vector Machine (SVM) and Random Forest (RF) algorithms for the task of SNoE recognition. These two algorithms have been commonly used for NLP and information retrieval tasks such as text classification [41,42,43]. Support vector machine (SVM), is a vector space based machine learning method proposed by Cortes and Vapnik [44] where the goal is to find a decision boundary between two classes that is maximally far from any point in the training data (possibly discounting some points as outliers or noise). The SVM algorithms have been used in text classification task [43,42,41].
SVM models are known to scale well with high dimensional data with a good capacity of generalization and with a limited risk of over-fitting. Additionally, SVM is efficient when the number of input dimensions is greater than the number of samples. Thus, SVM appears as a good choice for our study and experiments using WEs. Random Forest (RF), is a supervised machine learning algorithm introduced by Breiman et al. [45] that has been widely used for classification and regression tasks. The principle behind RF is to create a forest with n number of decision trees. Then by a sampling process based on the bootstrapping principle [46] the algorithm created n subsets of the learning dataset and each tree is trained on one of these subsets. In order to classify an example, a 'tree voting' operation is conducted (i.e., where a tree predicts a class). Each vote is recorded and the forest chooses the class with the highest number of votes. In general, the greater the number of trees in the forest is, the stronger the prediction and the higher the accuracy are. The purpose of choosing the RF algorithms is motivated by the fact that the results of a trained RF models could be more interpretable than other complex models such as neural networks.
Deep learning models have recently led to significant and rapid progress in several NLP tasks such as: NER, Relationship extraction and question answering. We have experimented three DNN models: Multilayer Perceptron with Auto-Encoder, a Multilayer Perceptron with Principal Component Analysis, and a Gated Recurrent Unit.
Multilayer perceptron with an auto-encoder (MLP+AE), is a pipeline composed of an auto encoder (encoding layer, decoding layer) and a deep multilayer perceptron (MLP). The main idea behind using autoencoder (AE) is dimensionality reduction. We have made the hypothesis that as FastText is a model of pretrained vectors (300 dimensions) on Wikipedia and Common Crawl, it provides a generic representation of words. As a result, similar words (such as the plurals) have independent embeddings. In that way a vector representation of a word contains a lot of redundant information. What if we could take out the redundancy and express the same information in a fraction of the numbers (compression)? An AE can be used for that purpose. The AE receives the input matrix X representing a sentence and learns to encode it into a less dimension representation X ′ . The AE starts out by compressing the data into a lower-dimensional representation z (encoding step), and then converts it back to a reconstruction of the original input (decoding step). With the convergence of the AE, the representation z is a compressed version of the data but still encodes the same quantity of the information. The encoded representation matrix X ′ is fed into the deep MLP which performs the prediction task.
Multilayer Perceptron with a Principal Component Analysis (MLP+PCA) is a pipeline with a Principal Component Analysis (PCA) and deep MLP. As AE, the PCA is a method for data compression. The basic idea of PCA is to reduce the dimensionality of inputs by transforming elements of the input vector e to a new set of variables known as the principal components PCs. The PCs are a linear combination of the original variables, the PCs are orthogonal i.e., the correlation between any pair of variables is 0. The obtained vector is an eigenvector and represents the feature vector which is fed into the deep MLP. Both MLP+PCA and MLP+AE models uses a dimensionality reduction of the inputs, the hypothesis behind dimensionality reduction is to learn on a discriminating information. Indeed, the vector representation of the inputs using WEs provides a set of all the possible semantic spectrum for a given word. However, our context could be seen as a language of specialty with a specific terminology and therefore we assume that we just need a subset of possible semantics, so a subset of components of the vector representation.
Gated Recurrent Unit (GRU), was introduced by Cho et al. [47] and is an improvement of the standard recurrent neuron network (RNN) to solve the vanishing gradient problems that comes with RNN by bringing up the concepts of update gate and reset gate. As shown in Figure 2 the update gate z t helps the model to determine how much of the past information (from previous time steps) needs to be passed along the future, while the reset gate r t is used to decide how much of the past information to forget. In others terms, z t and r t are two vectors that decide which information should be passed to the output, therefore the GRU can be trained to keep information for long term and remove information that is irrelevant to the prediction.  GRU is design as a solution for short-term memory such as LSTM (Long Short Term Memory) [48]. While LSTM has three gates (input, output, and forget gates), GRU uses only two (reset and update gates). The GRU network is less complex than LSTM and is trained faster. This makes the GRU less complex than LSTM and so GRU models are trained faster than LSTM. In addition, the GRU unit controls the flow of information like the LSTM unit, but without having to use a memory unit. It just exposes the full hidden content without any control. According to Yin et al. [49] GRU has shown better performance on certain smaller data-sets for text classification tasks in NLP. For these reasons we have decided to evaluate the performances of the GRU model in our case study. An advantage of GRU being in the fact that it takes into account the sequentially of the input, which has a great impact as we work on text classification where the words order in a sentence is an important information.

Experiments and evaluations
This section describes the experimental study to demonstrate the feasibility of our approach, i.e., examine the ability of each algorithm to detect SNoEs. Thus, we have conducted a series of experiments based on two manually annotated gold standard datasets (cf. 4.1). The validation dataset is dedicated to evaluate the categorisation performances of each system on the SNoEs task. Pivots used to compose the samples of the validation dataset, are the same ones that were used for learning (training dataset) but in different contexts.
The second dataset (emergence dataset) is used to study the emergence performance of the different systems. Emergence assessment allows us to measure the ability to properly classify samples holding new pivots that do not have a sample in the training data set. As our study is a classification task we used the evaluation metrics: Precision (P), Recall (R), Accuracy (ACC) and F1 Score (F1) which is a combination of both recall and precision.

Datasets
As previously mentioned, there is no standard French dataset available to train and evaluate nominal entities recognition algorithms. Therefore, we have decided to build our own dataset. This implies two steps: 1) building the lexicon, 2) extracting and annotating a set of sentences containing at least a word from the lexicon.
As explained in section 3.2, the lexicon is built by manually extracting all the words used as SNoEs from a set of 14 French hiking description texts. As a result, 141 words have been extracted and constitute the elements of the lexicon called Aléa. Starting from this lexicon, we have extracted a corpus of sentences containing at least one lexical entry from two different sources: 1. Wikipedia articles, using the OpeanSearch API 18 . A corpus of 78,785 sentences were extracted from different web sources, 25,821 sentences were extracted from both Visorando and Camptocamp, where 52,964 sentences were extracted from Wikipedia. A total of 956 sentences were randomly selected from the corpus then manually labelled and distributed as shown in Table 1. As illustrated in the introduction and explained in Section 3.1, an example is annotated as positive only if the pivot designates a SNoE, otherwise it is labelled as negative. We dedicated 568 samples for the training dataset that is used to adjust the model parameters (weights and biases in the case of Neural Network), 194 samples for the test dataset to fine-tune the hyper-parameters of the trained models. Finally, 194 samples for the validation dataset that is used once a model is fully trained (using the train and test datasets) to evaluate competing models. Each sample is a sentence annotated according to the pivot meaning, the sample is annotated as SNoE if the pivot is used in its spatial meaning in the context of the sample, and is classified as non-SNoE otherwise. This dataset is called 'C1'.
In order to study the ability of different algorithms to detect new SNoE, we have extracted a set of sentences using a new lexicon of 15 new pivots extracted from the Geonto ontology [50]. These pivots do not correspond to any lexical entry of Aléa, therefore, no sentences containing any of these new pivots are present in the training, testing and validation datasets. A set of 93 sentences containing new pivots was then manually labelled according SNoE or non-SNoE.
This dataset allows us to evaluate the ability of the systems to detect new pivots that have not been seen before (during training). We have named this task as the emergence of new pivots and we identify this dataset as 'C2'.

Resources
According to the TL principle, all the experiments below use a pre-trained WEs. The WEs set has a dimension of d e = 300 and was produced by a Fastext [51] trained on Common Crawl and Wikipedia using the CBOW method. This WEs set is publicly available 22 . Furthermore, we have used the implementation of the deep learning algorithms (GRU, MLP+AE, MLP+ACP) provided within the python library Keras 23 . For the classical machine learning algorithms (SVM, RF) we have used the implementations provided by the python library Scikit-learn 24 . All the trained models supporting this publication are available in a github repository. 25

Evaluation results
We have conducted a series of experiments on two datasets in order to evaluate the absolute categorisation performances of each system (using the dataset 'C1') and study the emergence of new pivots (using the dataset 'C2'). We have evaluated the performance of each machine learning model (GRU, MLP+ACP, MLP+AE, SVM, RF) with three different context sizes (1 gram, 5 grams, 7 grams), which results in 15 systems.

Categorisation performances
We have conducted an experiment based on the validation dataset (from 'C1') in order to compare the performances of each algorithm, table 2 shows the performances of the 15 models. A general observation is that the results of almost all the tested models have better results when the value of n increases. This is consistent with the hypothesis that the context holds important information about the spatial semantics of a SNoE. An exception was found for models based on the MLP+ACP architecture (systems: 13,14,15) as there was a slight decrease in performance from 1 gram to 7 grams.
More precisely, we notice that both MLP+AE 7 grams and GRU 7 grams slightly outperforms the other algorithms. The MLP+AE-7grams had obtained an accuracy of 79,38% and a F1-score of 83,19%, while GRU-7grams obtained a closed result of 78,35% and 80,9% for accuracy and F1-score respectively. As the differences are rather small, this observation requires to be confirmed by a larger scale experiment.
It can already be said that the chosen approach is viable. In particular, this makes it possible to consider the use of neural network algorithms despite the fact that only a small corpus is available.
Emergence performance In order to study the emergence performance of the algorithms on the SNoE recognition task, we evaluate them using the 'C2' dataset. Table 3 shows the evaluation results of the emergence capacity of each system. As a reminder, the emergence capacity makes it possible to measure the ability of a system to recognise expressions (ngrams) whose pivot is used with a spatial meaning and is not part of the Aléa lexicon that helped to build both learning and validation datasets.
The same observation on the validation results can be made on the emergence results. The increase of the context size improves the global classification performance 22    of each algorithm. The GRU 7 grams system obtained the best results with 82,45%, 82,46% for both accuracy and F1-score respectively, which outperforms all the others algorithms regarding the accuracy score. Nevertheless, none of the tested algorithms differs with regard to the F1 score. It can be assumed that one way to improve the performance of most of the studied algorithms is to increase the size of the context. Another possibility that appears very promising would be to use contextualised WE models such as those produced by the BERT model proposed by Devlin et al. [52].

Conclusion
This paper presents a methodology comparing five supervised machine learning algorithms for the automatic identification of SNoE from raw texts. The approach uses a pre-trained WEs model as input according to the TL principle. The WEs used as input data for these algorithms, come from the FastText model pre-trained on a huge corpus of generic texts in French. The FastText model was chosen because it produced better results, compared to other equivalent WEs models, on so-called morphological rich languages such as French. The experimental results demonstrate: 1) the feasibility of our approach for the SNoE recognition task, 2) the importance of the context on this kind of task. Thanks to the use of the principle of transfer learning we have been able to show that it is possible to test methodological and algorithmic choices by relying on small corpora. Nevertheless, in order to obtain better performances, the size of our corpus seems insufficient. As a result, an extension of our dataset is already being developed. Given new models of WEs that seem to exceed the performance of the models we have used, we also plan to reproduce the same type of study using this time TL principle from a BERT model pre-trained on a French corpus like the one proposed by Le et al. [53].
According to the obtained results, none of the presented algorithms significantly outperforms, however, regarding the properties of each models presented in section 3.2 the GRU system seems to have a greater potential when working with the whole sentence. For this reason we are interested to invest more in this track. As future work, we aim to study the ability of the GRU to improve the performances on the SNoE recognition task, in particular by providing the whole sentence as input of the system (not only the n-grams) and thus fully use the ability of the GRU model to take into account the sequence aspect of the data in the input. Considering the entire of our WSD problematic it is necessary to be able to distinguish between sentences where the pivot is used to describe a static spatial situation and those where it is used to describe a motion (an itinerary). Then we will also work to categorise the context of SNoE in order to detect spatial relationships and different categories of verbs (e.g., displacement, description, perception) involved in the context.