Spatial Data Lake for Smart Cities: From Design to Implementation

In this paper, we propose a methodology for designing data lake dedicated to Spatial Data and an implementation of this specific framework. Inspired from previous proposals on general data lake Design and based on the Geographic information — Metadata normalization (ISO 19115), the contribution presented in this paper integrates, with the same philosophy, the spatial and thematic dimensions of heterogeneous data (remote sensing images, textual documents and sensor data, etc). To support our proposal, the process has been implemented in a real data project in collaboration with Montpellier Métropole Méditerranée (3M), a metropolis in the South of France. This framework offers a uniform management of the spatial and thematic information embedded in the elements of the data lake.

not already defined [27]. To overcome this issue, data lake [16] is a new fashion way for data management with total or partial storage of the associated elements (data and metadata). In this new wave, there is a lack of methodological proposals on the design of such data infrastructure considering that it requires more technical skills than design skills. There is still a lack of methodologies or success stories in the data lake design domain. In this paper, we particularly address this topic, starting from large heterogeneous data with a strong spatial dimension and elaborating on how to design and implement this specific case study. Inspired from previous works on spatial data normalization [19] and data lake design [40,27,36], we develop a specific methodology and the associated implementation code shared with the community 4 . We demonstrate that data lake infrastructures are not only expert dedicated but could be end-users oriented by offering a suitable query interface. This paper is organized as follows. In Section 2, we present the definitions and the works related to data lake design and spatial information management. The proposed methodology is detailed in Section 3 followed by the implementation description in Section 4. Section 6 concludes with the discussions and future works on Spatial data lake Solution.

Related Work
Many data management methods have emerged under the advent of Big Data [28]. These are essentially NoSQL databases (Not Only SQL) [7], data warehouses [21,31] and data lakes [35,16].

Data warehouse and Data Lake
Data warehouses have been designed as an optimization of relational databases for querying, and are used to support decision making in organization. Conceptual models of data warehouses are based on the following concepts: facts and measures, dimension, hierarchies and members [22]. In fact, designing a data warehouse is defining a space of possible cross-tabulations, that will be used by users to explore the data. Data warehouses enable the easy exploration of a large dataset by users. But the implementation of a data warehouse implies the normalization (and the automation of this normalization via ETL (Extract Transform Load) processes) of each entering data, from various data sources. Despite some propositions, the integration of documents and satellite images into a data warehouse is not a simple task. In summary, the implementation of a data warehouse goes a long way towards normalization of data. Data Lake definitions have been introduced in [11], a detailed comparison with data warehouses is proposed on [27] and revisited in [36]. Data lake is a recent solution that has emerged to meet the needs related to the management and use of big data on which the data warehouse have shown their weaknesses. The main problem being related to the management of the heterogeneous nature of data. A data lake is a storage structure allowing to store massive data, from different sources in their native format without the need to perform processing beforehand [35,16]. According to [36], a data lake is a scalable storage system and analysis of data of all types, stored in their native format and essentially intended for data specialists who are among others statisticians, analysts and data scientists. The main characteristics associated with data lakes are: metadata catalog to facilitate the access and to reinforce the quality of data, policies and governance tools, accessibility for users, management of evolving items, ingestion of any type of data, physical and logical organizations.
Being a new Big Data technology, Data Lake is addressed by many studies. Thanks to its ability to handle large volumes of structured and unstructured data, an exploratory study was conducted to improve the understanding of the use of the data lake approach in enterprises context [25]. In [15,29], new architectures of data lakes were designed in order to extract relevant information within heterogeneous data based on their sources. In [33], a Generic and Extensible Metadata Management System for data lakes (called GEMMS) was developed to extract automatically metadata from a wide variety of data sources. In [16], they designed a metadata management system, firstly to extract metadata from data sources, and secondly to enrich data sources by using semantic information coming from whole data and metadata. A wide number of metadata system are proposed by the community, but data lake data management still faces some challenges that have to be overcome [30].
We define a data lake as a storage structure composed of datasets, having parts of previously cited characteristics as detailed in Section 3.

Geographical Information
Several definitions are associated with the concept of territory depending to the study domain. Among them, according to [38], a territory is the combination of three dimensions: geographical space, time and social relationship. The territory is thus defined as a complex system located in a specific geographical space that emerges from the co-evolution of a bundle of heterogeneous processes (anthropological-cultural, relational, cognitive and economic-productive) that characterizes that space in a unique and unrepeatable way. By taking into account these existing definitions, we define the concept of territory as: a set of physical and/or legal actors. Physical in the sense that it is inhabited by one or more groups of people interacting with each other, and legal in the sense that it is composed of several political, economic organizations, etc. described by a set of geographic information, namely spatial entities, thematic and temporal entities that interact with each other. This information evolving in time and space.
In this work, we mainly focus on the geographic information produced and managed by the city. We thus base our proposal on the spatial data normalization [19] to support the spatial dimension. To the best of our knowledge, there is no studies dealing specifically with the design of a Spatial Data Lake. In the rest of this paper, we present a new methodology in order to provide to the users a guideline for conceiving and implementing such a framework. The design is performed to store the data produced and used by the french metropolis: Montpellier Métropole Méditerranée (3M). Our case study dictates some constraints: spatial description of datasets and spatial analyses are essential, particularly, we need to store satellite images, the proposed system should be inter-operable with other local, national and European systems, the users need to explore the data lake in order to find relevant data, and eventually discover new ones.
The proposed system is composed of three main parts: the data section, the metadata section and the inter-metadata section. The data section is the core storage structure, based on Hadoop Distributed File System (HDFS) [37]. The metadata section is a data catalog [24], describing the data stored in the data lake. The inter-metadata section is a part of the metadata section, that enables the storage of richly described relationships between the data in the data lake.
The HDFS is an efficient system to store big data, but cannot be used alone by our users. The users of a data lake need to explore data in order to find the most relevant data regarding to their query, and maybe discover new data, new knowledge. These functions (exploration, querying, discovery) are offered by the data catalog, by providing to the data lake users interesting metadata presented with a user friendly interface. The proposed conceptual model is an extension of the norm ISO 19115 [19]. This norm includes spatial representations and is the basis of several metadata profiles (INSPIRE, Dublin core) used by public institutions.
In the following, we present our proposal of extension. See Figure 1 to have an overview of the proposed conceptual model. The white classes come from [19], and the yellow classes are our additions. In order to lighten the figures, we represent only the classes and not the details containing with attributes.
In the Data section (see Figure 2), we define a data lake as a set of resources. A resource can be a service (see ISO 19115) or a data series. A data series is composed of one (or more) datasets, that share a feature. A dataset is a collection of identifiable data. Three types of particular dataset are defined: document, vector and raster.
The Metadata section describes metadata records (see Figure 3). Each resource is associated to a metadata record. One metadata record is composed by: an identification (mandatory) that enables the recognition of each resource by users, a spatial representation (optional), a reference system (optional) and a spatial and/or temporal extent (optional) that describes the spatiality of the resource, a content description (optional), a lineage (optional), that explains how the resource has been obtained, one or several associated resources, a reference system information (optional), that identifying the spatial, the temporal, and parametric reference system(s) used by a resource, an extent, that describes the spatial and temporal extent of a resource.
Finally, the Inter-Metadata section (see Figure 4) records relationships between datasets, and enables users to have a view of data related to their initial query. Four types of relationships are proposed, based on [36]: parenthood, containment, similarity and thematic grouping.
white classes are in [19], yellow classes are added to the norm.

Implementation for a specific city use-case
In this Section, we aim to report the first implementation of the Spatial Data Lake provided for 3M 5 .

Presentation of the infrastructure
As described in the previous section, the data lake is composed of two parts: Data and Metadata. The Data section is based on a HDFS cluster. Our implementation uses three white classes are in [19], yellow classes are added to the norm.

Fig. 2. Section data
white classes are in [19], yellow classes are added to the norm.

Fig. 3. Section Metadata
HDFS nodes. The first one is the name-node which distributes or aggregates blocks of datasets to the two other data-nodes.
Concerning the Metadata section, HDFS does not provide any indexing system nor a search engine. They have to be build upon the data lake by the administrator [20]. Elasticsearch, based on Apache Lucene, meets those two needs [8] and [20]. Our Metadata System implementation uses GeoNetwork 6 . This web application embeds Apache Lucene and implements the ISO 19115 conceptual model. GeoNetwork stores the mandatory and optional metadata described in the previous section, including the HDFS path inside the Data Lake. So when the user queries the search engine, 6 GeoNetwork: a catalog application to manage spatially referenced resources.

Fig. 4. Section Inter-metadata
GeoNetwork responds with a collection of metadata describing datasets and offers HDFS links to download the corresponding data.
Inserting and Indexing datasets inside the Data Lake As shown in Figure 5, dataset insertion occurs in 5 steps. Discover and access to datasets using the datalake Users can discover and access the datasets using the GeoNetwork search engine. Queries can be a combination of three dimensions: 1. semantic: based on keywords or on a full text search on title, abstract and lineage, 2. spatialized: drawing a bounding box on a map to filter by a geographical extent, 3. temporal: filtering by years, month and day.
GeoNetwork returns a collection of metadata that describe the data. Users can browse these metadata and find HDFS links to download the corresponding dataset ( Figure 6).

Deploy and populate the Data Lake
We have automated the deployment and the data ingestion of the data lake through two steps. First, the system infrastructure must be created, configured and initiated. The data and the metadata zones are atomically deployed on four computers.

Fig. 5. Inserting and indexing datasets in datalake
Secondly, a python script must be started. Reading a CSV file, it will find links and data sources available on the Internet. Thanks to these information, the script will download data, insert them inside the data lake and reference them through GeoNetwork server.

Deploy a HDFS cluster and a GeoNetwork server
The full cluster is deployed and maintained using these opensource projects:  The HDFS cluster can be deployed and configured using a single vagrant command line.
Then the administrator has to connect to the namenode in order to format the file system. Afterwards, GeoNetwork can be deployed using a similar vagrant command line. After starting the two parts (the metadata Management system and the HDFS cluster) of the data lake, four virtual machines are started and set-up, three for the hdfs cluster and one for the metadata management system. More technical information or instructions on how to deploy a HDFS Cluster with a GeoNetwork as a metadata system management, could be found in the README.md inside the git repository of our project.
If default variables are used, the data lake file system can be browsed graphically by connecting to the web server of the namenode at: http://10.0.0.10:9870. Other information, such as cluster health and log accessibility are also available.
Populate the Data Lake The populating step is implemented by using two scripts written in python and R. Indeed, Python offers an excellent library to interact with HDFS, while R has interesting modules to manage ISO 19115 metadata. In order to reduce the complexity generated by the concomitant use of these two languages, the R script has been encapsulated inside the Python script. Thus, the administrator only needs to run the Python script. As mentioned above, all the code files are available (see section 4.4). Environment requirements (as dependencies) can be found in requirement.txt file inside the repository. Instructions on how to install and start the script are described in the README.md file.
The main Python script works on five steps. First, it parses information given by 'datasources.csv' such as data provider, dataset name and keywords. Secondly, script browses data provider's website in order to build a json file that contains web-links to download the corresponding data. Then, these data files are downloaded and stored inside the data lake. Finally, the R script is executed in order to create ISO 19139 (which is a standard of the ISO 19115 implementation into xml) xml files which are uploaded to the GeoNetwork.
In Figure 7, a screenshot of a metadata sheet of a dataset is presented. A set of data files is associated with a name and a link to download the data from the HDFS cluster.
Data can be easily obtained by following the given link. The namenode of the HDFS cluster offers a REST API that ensures the transfer of the data to the user.

Example of an user query
The user can build complex queries mixing the three dimensions: spatial, temporal and semantic. These queries are made through the GeoNetwork search engine. The three dimensions can be requested in a full text query, such as the following example: "3M mobility 2013". The search engine will propose all datasets which the temporal period includes or intersects the year 2013 and which the spatial extent includes or intersect 3M spatial extent. Finally, only the datasets that have the keyword "mobility" in their metadata, will be shown to the user.
The spatial dimension can also be built by using a map. The user can draw a bounding box around a region. She/He gets a first filtering on the selected extent. She/He can enrich his query by adding in full text temporal and semantic constraints.

Data and Software
Computational environment The computational environment (HDFS clusters and GeoNetWork) is provided by using four virtual machines. The implementation and deployment of these machines has been automated and scripts are available at: https://github.com/aidmoit/ansible-deployment, with instructions included in the file README.md in the repository. The corresponding commit number is: 65de950a336ee2828cdb19db976b7946649c439c and the repository is published under GPL-3 license.
software All software for retrieving data, ingesting them in the HDFS clusters and their descriptions in the GeoNetwork is orchestrated by a python script. Implementation resources are available through this repository: https://github.com/aidmoit/collect. The corresponding commit number is: da9f63f9287a191d7e8fd24884a731bae02e1034 and the repository is published under GPL-3 license. They operate two R packages: geometa [6] and geonapi [5]

Discussion
In this section, we discuss about the main features (enumerated in [36]) provided by our proposal in comparison to existing solutions. Among these features, we notice: -Semantic Enrichment (SI): offers semantic data related to the data in the Lake, either by context description or by descriptive tags.  LG DP DV UT SPAR [14] ⧫ □ ✓ ✓ ✓ ✓ Alrehamy and Walker [42] ⧫ ✓ ✓ Data wrangling [41] ⧫ ✓ ✓ ✓ ✓ Constance [16] ⧫ ✓ ✓ GEMMS [32] ◊ ✓ CLAMS [12] ⧫ ✓ Suriarachchi and Plale [40] ⧫ ✓ ✓ Singh, K. et al. [39] ⧫ ✓ ✓ ✓ ✓ Farrugia et al. [13] ⧫ ✓ GOODS [17] ⧫ ⧫ : Data lake implementation ◊ : Metadata model □ : Model or implementation assimilable to a data lake ✓ : feature is available Table 1 shows the state-of-the-art approaches and the associated features provided. Among seventeen (17) proposals, only one approach [34] proposes a data lake implementation associated with a metadata management system. Moreover, this approach, like the majority, has been set up for a very specific case study, and does not allow, or hardly takes into account, the case of complex data such as spatial data (satellite data). In terms of completeness of the state-of-the-art, regarding the features mentioned above, approaches such as [4,18,17,34,36,41] respectively cover more than the half of them. Unfortunately, these solutions are difficult to apply, since they are not very understandable, in the implementation ways, or have been proposed but not yet implemented.
Our solution does not only offer a data lake implementation, but also an associated metadata management system. It clearly shows how the two concepts are integrated to each other with a fully open source code. It also covers all the features described in Table 1, while implementing the data lake and the metadata management system. Moreover, our approach solves problems related to complex data storage (e.g. spatial), and takes into account any type of data, thanks to the ISO 19115 standard. Also, it is easily reproducible and compatible with international catalogs systems. In terms of access, our system offers through GeoNetwork, an user interface that allows any user 7 OpenStreeMap: https://www.openstreetmap.org to explore or retrieve data from the lake.
Our solution can be used by two types of users. Firstly, the 'general public' users or any user, the system allows them to explore and retrieve data available in the lake through the web interface offered by GeoNetwork without the need of data lake exploitation skills. Then advanced users, who in addition to exploring, can perform processing and analysis directly thought the data lake using tools such as Apache Spark.

Conclusion and Perspectives
In this paper, we presented a new methodology for Spatial Data Lake Design. The main contributions are the introduction of the spatial dimension in the data lake design process based on the Geographic Information Metadata as well as an overall code process provided to the scientific community. We also showed that a data lake could be end-users oriented with a specific query interface. Future works are dedicated to better manage the evolution and the behavior of a territory. These concern mainly these two objectives: 1. How heterogeneous data can be linked semantically for an analysis of complex spatio-temporal phenomena on a territory, 2. Define original data mining techniques mainly suited for the processing and analysis of this massive heterogeneous data.
Achieving these objectives will allow us to describe the relationships between the themes taking into account the spatio-temporal aspect on the one hand, and on the other hand to show how, these themes contribute to the description of the territory evolution. In other words, Spatial data lake is becoming a fundamental element to reach Smart Territories.