Semantically-Enriched Search Engine for Geoportals: A Case Study with ArcGIS Online

Many geoportals such as ArcGIS Online are established with the goal of improving geospatial data reusability and achieving intelligent knowledge discovery. However, according to previous research, most of the existing geoportals adopt Lucene-based techniques to achieve their core search functionality, which has a limited ability to capture the user's search intentions. To better understand a user's search intention, query expansion can be used to enrich the user's query by adding semantically similar terms. In the context of geoportals and geographic information retrieval, we advocate the idea of semantically enriching a user's query from both geospatial and thematic perspectives. In the geospatial aspect, we propose to enrich a query by using both place partonomy and distance decay. In terms of the thematic aspect, concept expansion and embedding-based document similarity are used to infer the implicit information hidden in a user's query. This semantic query expansion 1 2 G. Mai et al. framework is implemented as a semantically-enriched search engine using ArcGIS Online as a case study. A benchmark dataset is constructed to evaluate the proposed framework. Our evaluation results show that the proposed semantic query expansion framework is very effective in capturing a user's search intention and significantly outperforms a well-established baseline-Lucene's practical scoring function-with more than 3.0 increments in DCG@K (K=3,5,10).


Introduction
The increasing growth of geospatial data poses a great challenge to data discovery, access, and maintenance (Jiang et al. 2018). In order to increase data reusability and facilitate geospatial knowledge discovery, many geoportals have been established to provide integrated access to geospatial resources (Hu et al. 2015a). Examples of geoportals include the DataOne Data Catalog 1 , U.S. Geological Survey Science Data Catalog 2 , NASA Earth data Search 3 , ArcGIS Online, and so on.
The most important component of a geoportal is its search functionality, which is usually supported by geographic information retrieval (GIR) techniques. Generally speaking, information retrieval (IR) aims at finding relevant entries based on a user's query. The entries can be documents, websites, services, maps, and so on, depending on the application scenarios. As a subfield of IR, geographic information retrieval (Jones & Purves 2008) adds space (and time) as additional dimensions to the traditional information retrieval problems (Janowicz et al. 2011). In addition to traditional thematic similarity, spatial (and temporal) similarity is considered when the relevance score between a user's query q and an entry d is calculated.
Despite the success of GIR in academia, in practice, the core search functionality of most existing geoportals is still based on Apache Lucene or Elasticsearch (Jiang et al. 2018). These Lucene-based engines use a term frequency-inverse document frequency (TF-IDF) approach to compute the similarities between a user's query and document entries, which is insufficient to completely capture a user's search intention. For example, when a user searches for natural disaster in California (Query q 1 ), (s)he is probably more interested in a document which describes the Kincade Fire that burned in Sonoma County on Oct. 23rd, 2019 since wild fires are a type of natural disaster and Sonoma County is a subdivision of California. However, if this document contains neither the term "natural disaster" nor "California", a Lucenebased model will give a zero relevance score between this document and the Query q 1 , thus resulting in a low recall. This highlights the necessity of understanding the user's search intentions both semantically and spatially in a (G)IR system.
According to Dominich (2008), IR can be formally defined as: where m is the degree of relevance; R is the relevance relationship; D is a set of (document) entries; q is the user's query; I and → are implicit and inferred information. The most challenging part in this equation is the question of how to obtain the implicit and inferred information I, → based on user queries. Query expansion techniques, which add terms and conditions to a user query with the goal of improving the query-object relevance score (Vechtomova 2009), can be utilized to semantically take the user's search intention into account. The traditional query expansion focuses on semantically-enriching a user's query from a thematic perspective. In the context of geoportals (e.g., ArcGIS Online) we argue that a user's query should be expanded (or semantically-enriched) from two perspectives: thematic and geospatial. In the thematic aspect, a query can be enriched/expanded by adding thematically similar concepts/terms. For example, as for Query q 1 , some highly related topics of "natural disaster" such as earthquake, wild fire, flood, and hurricane can be added to the original query. In a geoportal, extra attention should be paid to the geospatial aspect. Geospatially related terms can be added to the query. For example, as for Query q 1 , we can consider adding the names of the subdivisions of California to the query. Since this process relies on the place hierarchy, we call it platial query expansion. Moreover, the spatial scopes of the query and entries can also be used to compute the spatial similarity between them. After being enriched/expanded from these two perspectives, the new query is applied to the geoportal in the hope of improving the recall of the GIR system.
Note that the core idea of query expansion is to minimize the mismatch between a user query and candidate entries so that the recall of the IR system is improved. A similar idea can be applied when we calculate spatial similarities between a user's query and entries. Most of the traditional spatial similarity measures are based on topological relations between the spatial scopes of the user's query and an entry. For example, Jiang et al. (2018) defined the spatial similarity between a query q and a document entry d, denoted as S im(q, d), based on their geographic scopes Area(q), Area(d) as well as their intersection Area(q ∩ d) (See Equation 2).
According to Equation 2, if Area(q ∩ d) = 0, then S im(q, d) = 0 which means if the intersection of the geographic footprints of q and d is zero, the spatial similarity score is zero. This may lead to a loss of valuable spatial proximity information in many scenarios. To give a concrete example, if a user searches for Weather in Los Angeles (Query q 2 ), a map d 1 about Temperature in Oxnard should be considered more relevant than, say, d 2 which is about Temperature in Southern Africa. However, since the both geographic scopes of Oxnard and Southern Africa do not intersect with the footprint of Los Angeles (Area(q 2 ∩ d 1 ) = 0 and Area(q 2 ∩ d 2 ) = 0), we will have S im(q 2 , d 1 ) = 0 and S im(q 2 , d 2 ) = 0 according to Equation 2 which does not match our intuition.
In other words, it might be better to utilize a distance decay function here instead and minimize the mismatch between the current query q 2 and d 1 . Inspired by this observation, we utilize a Gaussain kernel distance decay function to compute the spatial similarity between the spatial scopes/geographic footprints between the query and documents. Using a distance decay function to optimize the querydocument relevance is also related to work on query relaxation in the context of geographic question answering (Mai et al. 2019).
The research contributions of this work are as follows: 1. We propose a semantic query expansion framework for geoportals which enriches a user's query from both thematic and geospatial aspects. 2. We develop a semantically-enriched search engine prototype for ArcGIS Online by implementing the proposed query expansion framework. 3. We collect a benchmark dataset to evaluate the presented framework against a widely used baseline model -Lucene's practical scoring function. The evalua-tion results show that our semantic query expansion framework outperforms the baseline by a significant margin.
The remainder of this work is structured as follows. In Sec. 2, several work about geographic information retrieval are discussed. Next, we present our query expansion framework and describe each component of this system in Sec. 3. Particularly in Sec. 3.1 we discuss about the reproducibility of our work and provide guidelines related to data sets and software that facilitate future research along this line. In Sec. 4, we introduce a benchmark dataset we collect to evaluate our GIR framework and then discuss the evaluation results. Finally in Sec. 5 we conclude our work and discuss the future research directions.

Related Work
The idea of query expansion is to reformulate a user's query by adding semantically related concepts (Azad & Deepak 2019) to minimize the query-object mismatch and increase the recall of an IR system. This typically comes at the expense of reducing the precision. Generally speaking, query expansion techniques can be classified into two categories: global analysis and local analysis (Azad & Deepak 2019). As for global analysis, the expansion terms are selected based on manually built knowledge bases, knowledge graphs, or large corpora. Finding semantically related terms based on word embedding (Mikolov et al. 2013, Mai et al. 2018 or topic modeling (Hu et al. 2015b) is an example. Local analysis refers to query expansion methods that select expansion terms based on the retrieved documents of the initial user's query. Example models include relevance feedback (Rocchio 1971) and pseudo-relevance feedback (Buckley et al. 1995). In this work, we adopt the global analysis method and use word embedding to select semantically related terms of query terms.
Many query expansion techniques are not directly applicable for geospatial terms. For example, it is more reasonable to select geospatially related terms based on place hierarchies (e.g., from a digital gazetteer) rather than using word embedding models. This suggests a need for separately handling geospatial aspect in a query expansion task. For instance, Huang et al. (2008) classified queries into two types -location sensitive and location non-sensitive -and then handled them by using different query expansion techniques.
In the field of geographic information retrieval, there are a few works aiming at ranking documents based on both textual and spatial relevance such as the multidimensional scattered ranking method proposed by Van Kreveld et al. (2005). Our work follows a similar research direction but also add platial similarity to the ranking algorithm.
In addition to query expansion, another line of work for building a semanticallyenriched search engine for geoportals is to enrich the metadata. For example, Hu et al. (2015a) converted the metadata of ArcGIS Online items into Linked Data and then enriched the metadata to enable semantic search. Similar to our idea, Hu et al. (2015a) also considered the semantic enrichment in two aspects: thematic and geospatial. However, converting data into another format for semantic enrich-ment requires additional processing steps, storage, and maintanance to keep both data sources in sync. In this work, we focus on enabling semantic search by using query expansion techniques in which the underlying data storage (e.g., Elasticsearch, Apache Lucene) remains unchanged.

Method
In this section, we will first describe the dataset and project setup in Section 3.1. Next, we describe our semantic query expansion framework in detail. The proposed framework is composed of two major components -geospatial component and thematic component -which focus on different aspects. Figure 1 shows the overall architecture of the proposed framework. We will present each component below with the example query Chicago traffic (Query q 3 ).

Data and Software Availability
Developed by Environment System Research Institute (ESRI), ArcGIS Online is one of the best-known web geoportals. It contains a collections of web maps, data layers, tools, services, and applications contributed from different GIS users all over the world (Hu et al. 2015a). Elasticsearch 4 , a widely used search and analytic engine, is utilized to store the metadata of these ArcGIS Online items and support the portals searching functionality. The metadata of each ArcGIS Online item has different fields such as "id", "title", "snippet", "description", "type", "location" (point), "coordinates" (the bounding box) and so on. The core search functionality of ArcGIS Online is based on Lucene's query-document similarity function which is computed based on term frequency and inverse document frequency (TF-IDF) scoring such as Lucene's practical scoring function 5 , Okapi BM25, and so on. Therefore, Lucene's practical scoring function is a natural baseline for our semantic query expansion framework.
In order to establish an evaluation dataset for our search engine prototype, we collect 53,404 items using the ArcGIS Online RESTful API which contains 1) all items published by Esri or its related organizations before September 2017; 2) all items published on ArcGIS Online between June and September in 2014 and 2017.
We use Elasticsearch to host all the retrieved ArcGIS Online items. The proposed semantic query expansion framework will serve as a middle layer as shown in Figure 1 to semantically-enrich the current user query. The expanded query will be sent to the established Elasticsearch index to get relevant ArcGIS Online items. The motivation here is to enable semantic search functionality on top of a portal such as ArcGIS Online without changing the underlying layers, e.g., data storage. In order to evaluate the proposed semantic query expansion framework and compare it with the baseline, namely Lucene's practical scoring function, we also conduct a human participant test to get query-document relevance scores through Amazon Mechanical Turk sandbox 6 . Detail description about this benchmark dataset can be found in Section 4.2. The data and source code are available at 7 including 1) the evaluation benchmark dataset; 2) the source code of our query expansion framework. The established database is hosted by Elasticsearch 5.4.0 8 with a vector scoring plugin 9 to enable word embedding computation. Given a query such as Chicago traffic, we need to first split it into a geospatial aspect and a thematic aspect. A place name recognition service (e.g., DBpedia Spotlight 10 ) is utilized to recognize the toponyms appearing in the query (in this case the city of Chicago) and then link it to the corresponding entities (dbo:Chicago) in a 6 https://www.mturk.com/ 7 https://github.com/gengchenmai/arcgis-online-search-engine 8 https://www.elastic.co/blog/elasticsearch-5-4-0-released 9 https://github.com/MLnick/elasticsearch-vector-scoring 10 https://www.dbpedia-spotlight.org/ knowledge graph such as Wikidata or DBpedia. The identified places are then handled by the geospatial query expansion component and the rest of the query is send to the thematic query expansion component.

Geospatial Query Expansion Component
The geospatial query expansion component focuses on improving the platial and spatial similarity between a user's query and a candidate ArcGIS Online item.
In order to facilitate the following query expansion process, we first enrich the identified geographic entities with additional information such as geographic coordinates, place names, total area, and their GeoNames identifier (See Listing 1). We call this GeoEnrichment step (See Figure 1).

Platial Component
The platial component focuses on finding similar geographic terms based on the place hierarchy. We use the GeoNames 11 service to get the top K subdivisions of the identified places. For example, we can add Belmont Cragin and Englewood as expanded geographic terms to the expanded query of Query q 3 . Here, the platial similarity between a query q and an ArcGIS item d o , denoted as S im platial (q, d o ), is defined as Here p i refers to the ith identified place from q; W geo (p i ) is the relative importance of place p i among all the identified places and p i in q W geo (p i ) = 1; Q platial (p i ) refers to the set of expanded geographic terms; W platial (p i , p i j ) indicates the importance of p i j ∈ Q platial (p i ) ∪ {p i } with respect to the corresponding place p i ; W f ( f k ) indicates the weight of matching one specific metadata field f k since matching some fields such as "title" is much more important than matching other fields such as "description" and f k in d o W f ( f k ) = 1; M(p i j , f k ) indicates the number of matches of the expanded geographic term p i j in the current field f k .

Spatial Component
The spatial component measures the spatial similarity between a query q and item d o . Frontiera et al. (2008) discussed different geometric approaches to accessing spatial similarity and most of them are computed based on the topological relationships between the geographic scopes of query q and item d o . An example of similarity measures is Jaccard similarity index (Jaccard 1912). Some non-topological relation based spatial similarity indices also exist such as Hausdorff Distance.
In this work, we use a distance decay approach with Gaussian kernels. Each identified place has a Gaussian kernel which is placed at the center of its bounding box. The bandwidth of a kernel is determined based on the bounding box of the corresponding place. The intuition comes from Tobler's First Law of Geography: the relatedness between query q and item d o decreases with respect to their distance. Here ArcGIS Geocoding API is utilized to obtain the bounding boxes of the identified places. The spatial similarity S im spatial (q, d o ) is defined in Equation 4 where Gauss(p i , d o ) is the Gaussian score between identified place p i and item d o . The impact of different spatial similarity measures on the performance of this semantic query expansion framework will be left for future work.

Thematic Query Expansion Component
As the name indicates, thematic query expansion focuses on minimizing the queryitem mismatch from a thematic, i.e., topic-based, point of view. To achieve this, we adopt two approaches: concept expansion and embedding-based document similarity. We will discuss each of them below. Before performing thematic query expansion, some text preprocessing steps such as tokenization, word lemmatization, and stop word removal have been taken to extract thematic concepts/terms from the user's query such as natural, disaster in Query q 1 and traffic in Query q 3 .

Concept Expansion Component
The idea of concept expansion is to find thematically similar terms to the query terms and add them to the expanded query clause. This is a common way to do query expansion (Jiang et al. 2018, Hu et al. 2015b. Unlike the previous work in GIR which use semantic knowledge base (Jiang et al. 2018) or topic modeling (Hu et al. 2015b) to find thematically similar terms, we use word embedding technique (Mikolov et al. 2013) to achieve this. A similar approach has been used in developing academic search engine (Mai et al. 2018). Given the term traffic, word embedding model finds thematically similar terms such as congestion, rail, train, roads, and so on.
Equation 5 shows the thematic similarity between q and d o based on concept expansion S im concept (q, d o ). Here, t i indicates a thematic term in the user's query such as traffic. W thematic (t i ) means the normalized weight of t i among all thematic query terms and t i in q W thematic (t i ) = 1. T w2v (t i ) indicates the set of thematically similar terms of t i based on a pretrained word embedding model such as GLove (Pennington et al. 2014) indicates normalized weight of term t i j with respect to t i based on their cosine similarity. M(t i j , f k ) refers to the number of matches of the expanded thematic term t i j in the current field f k .

Embedding-Based Document Similarity Component
Instead of explicitly matching the expanded thematic terms to ArcGIS Online items, the embedding-based document similarity compares query q and item d o in the hidden word embedding space. Equation 6 shows how the similarity score is defined. E query (q) = t i in q Word2Vec(t i ) is the embedding of query q which is computed by simply adding the word embeddings of each thematic terms in the query q. E doc (d o ) is the document embedding of d o which is computed based on TF-IDF weighted word embedding of each terms in its title, snippet, and description.

Expanded Query Construction
The overall similarity between a query q and an ArcGIS Online iterm d o is a weighted sum of all four components: platial (place-based) component, spatial component, concept expansion component, and embedding-based document similarity component. λ platial , λ spatial , λ concept , and λ doc are their corresponding weights.
S im(q, d o ) = λ platial * S im platial (q, d o ) + λ spatial * S im spatial (q, d o )+ λ concept * S im concept (q, d o ) + λ doc * S im doc (q, d o ) (7) In practice, each component can be written as a collection of function score query clauses in Elasticsearch. Figure 2 shows an example of Elasticsearch query constructed after the proposed semantic query expansion framework for the given Chicago traffic query. Each component is highlighted. Executing this expanded query in the established Elasticsearch index will give us the final search result.

Semantically-Enriched Search Engine
Based on the presented semantic query expansion framework in Section 3, we develop a semantically-enriched search engine prototype for ArcGIS Online on top of the established Elasticsearch index. Figure 3 is a screenshot of the developed system in which the radio buttons Semantic Search and Lucene correspond to our semantic query expansion based GIR model and the baseline -Lucene's practical scoring function based IR model which we will call it Lucene baseline in the fol-lowing. This web interface is available through here 12 A mobile application is also developed based on AppStudio for ArcGIS (See Figure 4) .

Evaluation
A collection of user search logs is an ideal benchmark dataset to evaluate the presented framework as well as the Lucene baseline as Jiang et al. (2018) did. As the search logs are not available for the current project, we decide to build our own evaluation dataset. The benchmark dataset construction process can be summarized as follows: 1. We collect a query set which consists of 20 queries. All queries can be seen in Table 1. The first 10 queries are obtained from Hu et al. (2015b), while we manually generate another 10 queries based on the topics and geographic coverage of the collected ArcGIS Online items. 2. For each query, we get the top 10 search results from our semantic query expansion model as well as the Lucene baseline. 3. We create a survey form for each query and each model. Each survey form consists of one query and 10 random ordered ArcGIS Online items. Users are then asked to judge the relevance between the query and each item on an ordinal scale, with labels such as"Perfect" (4), "Good" (3), "Some Relevance" (2),"Fair" (1), and "Bad" (0). The numbers in () are used as the corresponding relevance score. An example survey form can be seen in Figure 5. 4. To host these surveys, a crowd-facing Web interface is developed and deployed on Amazon Mechanical Turk sandbox environment. 5. Eight users completed these surveys who are from different departments of a US university.
In total, we have 40 survey forms, 20 for each GIR model, completed by 8 different accessors. The average relevance score among these 8 accessors' results is treated as the relevance score rel between a query and an item in one form.
Discounted Cumulative Gain at top K rank (DCG@K) (Carterette & Jones 2008, Järvelin & Kekäläinen 2002) is a typical evaluation metric for information retrieval system. DCG is the weighted sum of "gains" of presenting a specific item. The weight is a discounted factor by ranking an item at a particular position. For IR systems, DCG at top K rank is defined as shown in Equation 8 in which rel i indicates the relevance score between a query and an item, the said gain, and 1 log 2 i is the discounted factor based on the current rank i.
We choose DCG@3, DCG@5, and DCG@10 as the evaluation metrics and Table 1 shows the evaluation results of both our semantic search model and Lucene baseline on each query. Some interesting observations can be made based on Table  1: 1. By comparing the average DCG scores, our semantic search model outperforms Lucene baseline by a significant margin. 2. In 17 out of 20 queries, the semantic search model outperforms the Lucene baseline with ∆DCG@K > 3. 3. As for the two queries (Query 2 and Query 8), the semantic search model provides relatively similar DCG scores (< 1). 4. The only query in which our semantic search model performs clearly worse is Query 10 -Crimes in Tennessee. After examining the top 10 search results the two models, we find that: a. All top 10 search results of Lucene baseline are crime maps about other places such as New York, Miami, or world wide crime reports. Basically Lucene baseline fetches these items based on the thematic similarity. b. 9 out of 10 search results of semantic search model are about other topics in Tennessee such as public health, energy, banking while one item is about crimes in neighboring states. As for these 9 items, 7 of them do not contain any place names in their title, snippet, or description but with spatial footprints close to the center of Tennessee. This implies that semantic search model finds these items mostly based on spatial similarity. c. There is actually no correct answer about the crime in Tennessee. d. However, based solely on these observations we cannot conclude that people pay more attention to thematic similarity than spatial similarity. That is because this bias may be caused by the design of the survey form in which thematic similarity is relatively easy to judge, while spatial similarity is rather difficult as users need to click the link and go to the web map to see the geographic scopes of an item. e. These observations raise an interesting question. How to design an appropriate survey form for evaluating GIR systems in contrast more general IR systems.

Conclusion
In this work, we present a semantic query expansion framework for geographic information retrieval systems. It enriches a user's query from both geospatial and thematic perspectives. Two components are developed for each perspective. By using ArcGIS Online as an example, we develop a semantically enriched search engine prototype by following the proposed query expansion framework. We constructed a benchmark dataset to evaluate the proposed GIR model as well as a widely used baseline model -Lucene's practical scoring function model. The results demonstrate that our semantic query expansion model significantly outperforms the Lucene baseline, thereby highlighting the effectiveness of our proposed approach. As for future research, we want to improve the efficiency of the presented semantic query expansion framework. We also want to investigate other ways to measure spatial similarity such as Space2Vec (Mai et al. 2020). In addition, we are interested in evaluating the impact of different spatial similarity measures on the performance of GIR systems more generally. Moreover, we plan to investigate the question of whether the added geospatial aspect of GIR will affect the way how we evaluate the system. UC Santa Barbara for evaluation data annotations: Jingyi Xiao, Ning Zhang, Haoxin Zhou, and Yao Xuan.