On the Evaluation and Comparison of Region Interpolation Methods

This work presents a methodology for the comparison and evaluation of region interpolation methods proposed in the spatiotemporal databases literature. An evaluation performed using the methodology proposed is also presented. To the best of our knowledge this is the first time that such a methodology is discussed and presented in the spatiotemporal databases literature, and that region interpolation methods are evaluated and compared using the same dataset and using data from the evolution of real-world phenomena.


Introduction
Several methods have been proposed in the spatiotemporal databases literature to solve the region interpolation problem, i.e., to create moving regions from a set of observations. This allows the representation of the continuous evolution of a phenomenon in spatiotemporal databases, e.g., the evolution of a forest fire, the evolution of an oil spill or the evolution of biological cells. Region interpolation methods should be robust, correct, efficient and allow the development of operations for querying the evolution of deformable moving objects and the relationships that they establish with other objects. However, different methods can produce different results when used to represent the evolution of the same phenomenon. As a consequence, there is a need to compare and evaluate these methods. But. How do we compare the output of two different methods? How do we measure the quality of the representation of the evolution of a phenomenon generated by a method? Which method is the best for a specific use case? Can an approximation error be defined between the representation and the actual evolution of a phenomenon?. To the best of our knowledge, these questions have not been discussed and considered in detail in the spatiotemporal databases literature. The contributions of this work are the following. We discuss and present a methodology for the comparison and evaluation of region interpolation methods, and we present an evaluation using the methodology presented here. This paper is organized as follows. Section 2 presents a brief discussion on the state of the art of metrics and measures of similarity and a discussion about related work. Section 3 discusses and presents a methodology for the comparison and evalua-tion of region interpolation methods. Section 4 presents an evaluation performed using the methodology proposed, and Section 5 presents the conclusions and future work.
cording to the authors, it can be used with polygons with a different number of vertices (although it is not clear how it can be used to compute the similarity between, e.g., a multi polygon and a polygon with holes.), it is not topology independent, and seems to penalize missing estimated areas more than different generalizations of the boundary. The dissimilarity value can be underestimate if one of the polygons being compared has a much larger number of vertices. The authors also present a comparison between the Hausdorff, the Chamfer and the PoLiS distances when used to compute the similarity between building footprints. According to them, the Hausdorff distance can be thought of as a measure of highest dissimilarity and the normalized Chamfer distance as a measure of the overall average dissimilarity between two sets of points. Both distances are topology independent and sensitive to collinear points.

Methodologies to Compare Different Interpolations
In this work we are interested in methodologies to compare and evaluate the quality of two interpolations, in particular, interpolations of regions in 2D space.
In [3] the authors compare four different spatial interpolation methods for the generation of continuous surfaces from irregularly distributed data. The authors consider two different viewpoints, prediction and characterization, to evaluate the quality of the interpolation. Prediction considers that the best interpolation minimizes the prediction error at unknown points. Because the true value at an unsampled point is not known the authors use a procedure in which the interpolation is generated ignoring known data points. Then the values of those data points are compared with the values generated by the interpolation. Characterization considers that the surface generated by the interpolation must globally look like the actual surface that is not known. To overcome this situation, statistical characteristics obtained from the set of known data points are associated to the surface. Finally, indices are defined and used to measure the quality of the interpolation.
However, the methodology presented in [3] cannot be used as is in our context since, (1) region interpolation methods do not interpolate points, they interpolate 2D geometries (regions), and (2) we want to evaluate the quality of the interpolation not the quality of the surface generated as in the work cited.
Morphing techniques, used to interpolate a geometry between two known observations or states, are, in general, evaluated and compared qualitatively by observing their results, and region interpolation methods, proposed in the spatiotemporal databases literature, are in general evaluated w.r.t efficiency, robustness and genericity but not w.r.t the quality of the interpolation they generate. To the best of our knowledge there is no well-establish methodology to evaluate and compare the quality of two interpolations.

Methodology for the Evaluation and Comparison of Region Interpolation Methods for Spatiotemporal Databases
Real-world phenomena can go through different changes, e.g., (1) they can merge and split (e.g., cell division), (2) change to a different state (e.g., ice melting), leading to phenomena of appearing and disappearing, and (3) develop holes and concavities. These changes can occur in a more or less dynamic way following more or less welldefined patterns, e.g., the shape of fluids and of a forest fire can change dramatically in a short period of time, but the shape of an iceberg can change rather slowly. Each phenomenon can evolve differently depending on the circumstances with which it interacts. This makes the evaluation of the representation of their evolution complex. Furthermore, the level of complexity of a geometry representing the continuous evolution of a real-world phenomenon can change significantly during interpolation, e.g., it may start as a complex multi-polygon and change to a simple polygon during interpolation.
Evaluation can be qualitative and quantitative. Qualitative evaluation can be a powerful tool, e.g., the evaluation of an interpolation through visualization. However, certain characteristics are hard to evaluate visually. For example, when observing the evolution of an iceberg visually it may not be possible to detect area oscillations that, depending on its size and scale, can be more or less significant. A quantitative measurement of the evolution of the area during interpolation can measure these phenomena precisely. In general, a quantitative evaluation is less subjective, gives additional information about the geometries generated during interpolation, and quantifying things precisely can help make decisions, in particular, can help make decisions without the need for user intervention.
Assumption 1 A methodology for evaluation and comparison of region interpolation methods should consider these changes and complexity and include both a qualitative and a quantitative evaluation and comparison.
Assumption 2 The evolution of the characteristics of a phenomenon can be measured and quantified by studying (analyzing) the characteristics of the geometries generated during interpolation. These measurements can then be used for evaluating the quality of the interpolation w.r.t some known or expected values. For example, it seems to make sense to compare the similarity between the geometries generated during interpolation and a reference geometry or set of reference geometries.
Assumption 3 The study of the geometric similarity of the geometries generated during interpolation can give an idea about the characteristics (quality) of the interpolation method, e.g., Do the in-between geometries change smoothly during interpolation or are there abrupt changes?.

Methodology
Following the previous assumptions, we construct a methodology by answering the following questions. What to evaluate?, What are the characteristics of interest that should be evaluated?, Which metrics should be used?, and What is the meaning, the relevance, and the significance of the results obtained when using a specific metric or set of metrics? Our goal is to develop a methodology that is generic, not dataset, use case or application dependent. That (1) is efficient, to allow the processing of a large number of geometries with arbitrary complexity that can be very different from each other, both topologically and geometrically, (2) does not require the existence of a correspondence between the geometries being compared, and (3) uses metrics and measures, and methodologies used successfully in similar works.
To construct this methodology, we proceed as follows. We follow a strategy similar to [3], and we consider 3 perspectives for evaluation: the representation, the prediction and the characterization perspectives. The representation perspective evaluates the ability of a method to represent certain characteristics of interest that can occur during the evolution of real-world phenomena (this perspective is not presented in [3]). Table 1 presents the characteristics considered. The prediction perspective considers that a good interpolation minimizes the error at an unknown instant (not at a point as in [3]). The characterization perspective considers that the interpolation should globally represent an evolution similar to the actual evolution of the phenomena (here we consider the interpolation not a surface as in [3]). Ability to represent geometries that split or merge during interpolation. We define split as an evolution where n objects, faces or holes, form progressively from m objects, n > m. An object can only form from an object of the same type, and if k objects form from an object then k > 1. We define merge as an evolution where n objects form progressively from m objects, m < n. An object can only form from an object of the same type, and if k objects form from j objects then k < j, k ≥ 1. Concavities, Holes 1. Ability to represent the evolution of holes and concavities during interpolation, i.e., concavities and holes that exist in both the source and target observations. 2. Ability to represent the transformation of a concavity into n holes, n ≥ 1, and a hole into m concavities, m ≥ 1.

Translation, Rotation, Deformation
Ability to represent translation, rotation, and deformation, in particular nonuniform deformation, during interpolation.

Appearing, Disappearing
Ability to represent the appearance and disappearance of faces (regions), holes, and concavities during interpolation. We consider appearing and disappearing as an evolution that occurs progressively not instantaneously. Split and merge, and appearing and disappearing are considered different phenomena, i.e., objects do not appear (disappear) from (to) an existing object.
Therefore, we propose a qualitative evaluation through visualization using a score defined in the set {0, 1, 2}, where 0 means that the method cannot represent the characteristic, 1 means that the method can represent the characteristic partially, and 2 means that the method can represent the characteristic. It is important to notice that in this perspective the quality of the representation is not evaluated.
Prediction. The geometry at an unknown (unsampled) instant is not known. We can, however, given a set of n observations of the evolution of a phenomenon, generate the interpolation between observations oi-1 and oi+1 and compare the result generated by the interpolation at instant t with the know observation oi taken at that instant. Then, the prediction performance can be computed using, for example, the mean absolute error (MAE). Prediction performance can be evaluated w.r.t the area, the perimeter, the position, the geometric similarity, the number of faces, the number of concavities and the number of holes of the geometry, rotation and orientation (direction).
Characterization. We can assume that the actual evolution of a phenomenon is, in general, not known. However, we need some information to be used as a reference for comparison and evaluation. For example, Fig. 1 shows the evolution of a phenomenon generated by three region interpolation methods. Which one generates the best representation? This question can only be answered if we know or make assumptions about the actual evolution of the phenomenon. For example, (1) the evolution of the area or the perimeter during interpolation follows a specific pattern or function, (2) the geometry changes slowly or abruptly, (3) the geometry rotates a certain amount in a specific direction, (4) a certain characteristic or set of characteristics evolve in a certain way, e.g., a hole appears (disappears) during interpolation or a geometry splits into two geometries.
In this perspective we consider the evolution of the area, the perimeter, the position, rotation and direction, the number of faces and holes, and the evolution of the geometric similarity w.r.t a reference geometry or set of geometries during interpolation. This perspective includes a qualitative evaluation that is performed visually to help evaluate the quality of the representation.  The previous discussion should answer the first two questions, i.e., 'What to evaluate?', and 'What are the characteristics of interest that should be evaluated?'.

Which Metrics Should be Used?
We consider only rigid geometries in 2D space. We want to compute a full similarity between them, and we need to compute the similarity between two, potentially, very different shapes, e.g., between a polygon and a multi-polygon. In this work we do not consider the topology or a representation of the geometry to compute similarity. In the first case, a region interpolation method can impose the restriction that the topology of the geometry does not change during interpolation. Therefore, topological changes may not be appropriate to compare region interpolation methods. In the second case, we do not know how to construct an appropriate representation of a multi-polygon with an arbitrary complexity, and even if a representation could be constructed, e.g., by considering its polygons individually, we still do not know how to compare two such representations.
In the following, we define the properties of interest that a metric or measure should ideally have to be used in our scenario. Given three shapes, a, b and c, and the distance between two shapes d (a, b), the metrics or measures to be used should have the following properties: • Nonnegativity, d(a, b) ≥ 0. Identity, d(a, a) (a, c). Invariant to translation, rotation, and scale. Because the shape of the geometries being compared is expected to have non-uniform affine transformations, the property of being invariant under affine transformations is not considered critical in our scenario 2 . • Can handle shapes with an arbitrary geometry, with a different number of vertices, e.g., to compute the similarity between a polygon and a multi-polygon. • Can compute the similarity between a large number of shapes in a reasonable amount of time.
• Use geometric characteristics (information), e.g., about the boundary, to compare two geometries. • Are generic, i.e., are not specific to a dataset or application and can compute the similarity without the need of any prior knowledge of other shapes or the existence of a correspondence between the geometries being compared.
Therefore, we select the Jaccard distance, the Hausdorff distance 3 , and the Chamfer distance to compute the geometric similarity between two geometries. This is mostly motivated by the fact that these metrics can compute the similarity between geome-2 It is important to notice that if the geometries are aligned as a preprocessing step to measure their similarity, the alignment considered can influence the measurement. Computing the best alignment between two polygons can be hard and time-consuming. 3 The Hausdorff distance can be reformulated to satisfy all properties of a metric, i.e., dH(a, b) = max (d(a, b), d(b, a)), where d(a, b) is the usual Hausdorff distance between a and b. tries with an arbitrary complexity and a correspondence between them is not required. They compute the similarity from different perspectives. The Jaccard distance measures the dissimilarity w.r.t the area, the Hausdorff distance measures the highest dissimilarity, and the Chamfer distance measures the overall average dissimilarity.

Datasets Used for Evaluation
Ideally, diverse datasets of real data should be used for evaluation, in particular: datasets with specific cases where a particular characteristic occurs during interpolation should be used for representation evaluation, datasets with a ground truth should be used for prediction evaluation, and different datasets should be used for characterization evaluation.

Experimental Results
In this section we compare and evaluate three methods, two region interpolation methods proposed in the spatiotemporal databases literature and a method that is being developed to solve the region interpolation problem as part of a PhD thesis, i.e., Librip [5], PySpatiotemporalGeom [9] and Morphrip, respectively. Currently, Morphrip can only handle simple polygons. The evaluation is performed using a dataset extracted from a sequence of satellite images tracking the evolution of two icebergs in the Antarctic [10], following the methodology discussed and presented in this work.
It is important to note that (1) this is a preliminary evaluation, (2) the main objective is to apply the methodology presented in this work in practice, and (3) the methods being evaluated provide different options that can have a significant impact on the result of the interpolation. For example, PySpatiotemporalGeom allows the user to provide an explicit matching between the components of the source and target geometries (observations), i.e., that component si in the source geometry corresponds to component tj in the target geometry (we use this option). Librip provides several matching strategies, including strategies defined by the user (we use the default option provided by the library).
In this study, we use the implementations provided by the authors. In some cases, the implementation provided (made available) is not a full implementation and/or has not been fully tested. As a consequence, we do not evaluate certain characteristics.
Finally, this section also serves the purpose of trying to answer the fourth question. What is the meaning, the relevance, and the significance of the results obtained when using a specific metric or set of metrics?

Representation Perspective Evaluation
We use synthetic data to perform the evaluation proposed in the representation perspective (the results are presented in Table 2). This is because we do not have real data where the various characteristics considered for evaluation occur. In Table 2 A., D., and # mean: evolving, appearing, disappearing, and not evaluated, respectively. We give score (see Section Methodology) 1 to Morphrip when evaluating rotation because although it can represent rotations, the method may consider that a rotation does not exist. We consider that the other two methods can represent rotations in the range [0º, [45º, 90º[] 4 (clockwise and counterclockwise) but we do not establish a precise range for each method. When evaluating E., A., and D. concavities we give a score of 1 to Librip because a concavity can unexpectedly evolve as a hole during interpolation, and a score of 1 to Morphrip because it can only handle simple polygons. When evaluating A. and D. faces we give a score of 1 to Librip because the existing face(s) disappear and then the new face(s) appear, i.e., it is not a continuous phenomenon (see Fig. 2 (bottom)). When evaluating Merge Faces we give a score of 2 to PySpatiotemporalGeom but notice that we are using the option that allows the user to define an explicit matching between the components of the observations (see Fig. 2 (top)). PySpatiotemporalGeom does not interpolate holes and faces appear instantaneously (see Fig. 2 (middle)).   Table 2 shows that PySpatiotemporalGeom and Librip are able to represent a larger number of characteristics, and some characteristics are not represented by any of the methods. This could be an indication that there are still open problems for future investigation in the field. We also notice that a score in the range {0, 1, 2} may be too coarse because two methods may represent a characteristic partially, but differently. It may also make sense to perform independent evaluations considering different types of geometries and characteristics that can occur during their evolution, e.g., perform an evaluation using only simple polygons or only polygons.

Prediction Perspective Evaluation
For the evaluation presented here we use observations taken from the evolution of an iceberg represented by a simple polygon with no holes, that changes relatively smoothly and slowly with a small rotation between consecutive observations (see Fig.  3). The prediction perspective evaluation was performed by generating moving regions between observations oi and oi+2 and then comparing observation oi+1 with the geometry generated by the method at the middle of the interpolation between the two observations. Table 3 presents the results of the evaluation using the Mean Absolute Error (MAE). We did not remove outliers. In the following PySpatiotemporalGeom is abbreviated to PySptGeom to save space in the tables. When comparing PySpatiotemporalGeom and Librip, the area and the perimeter are the characteristics with the most significant differences. An important observation is the fact that these are global results that can be influenced by outliers, e.g., in the case of Librip the values observed for the perimeter and the Hausdorff distance are mostly affected by an outlier (the value of the first geometry 'predicted' by the method). Also, Librip obtained a small error w.r.t the expected number of holes. This is because Librip considers that the geometry develops a hole during interpolation.
Morphrip obtained the smallest error for the position, the area and the geometric similarity, i.e., for the Hausdorff and the Jaccard distances.

Characterization Perspective Evaluation
The characterization perspective evaluation was performed by creating moving regions between pairs of consecutive observations, i.e., between observations oi and oi+1. Then, we collected 100 observations (generated during interpolation) from each moving region and studied the evolution of several characteristics of the geometry during interpolation (see Fig. 4 to help clarify this). The geometric similarity is measured after the two geometries being compared have been aligned by using the iterative closest point (ICP) algorithm presented in [4] w.r.t the source geometry (any geometry can be used here). It is important to note that the alignment can have an impact in the measurement.  Table 4 presents the MAE of the area, the perimeter (P), the Hausdorff distance (HD), and the Jaccard distance (JD) computed using the results generated during interpolation and a model (function) assumed to represent the actual evolution of these characteristics. We assume a linear function as a model for all the characteristics considered, i.e., we assume that these characteristics evolve linearly during interpolation. PySpa-tiotemporalGeom seems to preserve the perimeter more than Librip, while Librip seems to preserve more the area, and obtains better results for the Hausdorff and the Jaccard distances. In some cases, the two methods generate very similar results during interpolation. This explains why the maximum MAE values are equal for some characteristics, i.e., they occur during the evolution of a moving region where the two methods generate similar results. In average, Morphrip obtains the smallest values for the area and the geometric similarity, i.e., the Hausdorff and the Jaccard distances.      On average, the area and the perimeter deviations are relatively small and a deviation w.r.t the minimum known values was not observed. The maximum area deviation occurs between observations 41 and 67, and 47 and 56 for Librip and PySpatiotem-poralGeom, respectively. The Hausdorff maximum deviation occurs between observations 45 and 100 (the last observation). Its average maximum deviation is 2.86% and 4.36% respectively. The Jaccard maximum deviation occurs between observations 37 and 100. The average maximum deviation is 2.33% and 2.61% for Librip and PySpatiotemporalGeom, respectively. The maximum HMD and JMD values are similar for both methods because they were observed during the evolution of a moving region where the two methods generate similar results. By observing the interpolation generated by Librip and PySpatiotemporalGeom and the graph of the functions that approximate the values collected during interpolation (see Fig. 6 for an example) we observe the following. The evolution of the perimeter is similar in both methods and approximately linear. In this case, the difference in the results obtained is mostly explained by the fact that in some cases Librip generates holes during interpolation (see Fig. 7 (top)). When a small rotation exists Librip tends to represent the evolution of the area in a slightly quadratic way. As the rotation increases this representation tends to be clearly quadratic. PySpatiotem-poralGeom represents the evolution of the area in a quadratic way in both situations. In general, the representation is non-monotonic. Except for the cases where the two methods generate similar results and where Librip generates holes during interpolation, the geometric similarity (measured using the Hausdorff and the Jaccard distance) evolves more smoothly in the case of Librip.
In the case of PySpatiotemporalGeom the Hausdorff distance tends to increase until a maximum is reached, usually approx. at the middle of the interpolation, and then starts to decrease (see Fig. 6). Morphrip obtained the smallest deviation and the geo- metric similarity evolves more smoothly during interpolation. Fig. 6 presents an approximation of the evolution of the Hausdorff distance during the interpolation between two observations when using the three methods. Fig. 6. Evolution of the Hausdorff distance between a pair of consecutive observations when using Librip (black), PySpatiotemporalGeom (magenta), and Morphrip (red).

Discussion.
Morphrip seems to generate the most natural interpolation. Librip may generate holes that appear and disappear during interpolation even if the known observations used to create the moving region are simple polygons with no holes. PySpatiotemporalGeom seems to generate an interpolation where the area increases at the middle. The maximum area deviation observed was 10%. The results presented here should not be generalized and used to fully characterize the methods evaluated. It is important to note that the main goal of this work is to propose a methodology for evaluation and comparison of region interpolation methods. A more detailed comparison and evaluation of these methods requires appropriate datasets. The area is a good metric for finding phenomena of shrinkage and enlargement during interpolation. The perimeter is difficult to interpret, e.g., the perimeter can increase in two different scenarios, one where the area of a geometry increases and the other where it decreases. However, it can give relevant information if the evolution of the perimeter is an important factor to consider. The Jaccard Index may give results easier to interpret than the Jaccard distance, and both metrics are not a good measure of similarity. We need a scale to interpret the results of the Hausdorff distance and establish the degree of similarity between two geometries. It would be interesting to use more metrics to establish similarity. Overall, we need a variety of datasets representing the evolution of phenomena with different characteristics, e.g., rigid, fluid, dynamic and biological, that have a large number of observations, provide a ground truth, and represent all the changes that can occur during the evolution of a phenomenon.

Conclusions and Future Work
We presented and discussed a methodology to compare and evaluate region interpolation methods developed in the context of spatiotemporal databases. We performed an evaluation using real data to show the methodology being used in practice. To the best of our knowledge, this is the first time that such a methodology is discussed in detail in the spatiotemporal databases literature and that region interpolation methods are compared and evaluated using real data. Future work includes (i) finding robust methods proposed in the literature to align two geometries before measuring their similarity, and to find the rotation and the orientation of a geometry, and (ii) performing a more detailed evaluation using more datasets. This is a first step that can make contributions to the establishment of a benchmark for the evaluation of region interpolation methods.
The results were obtained and collected using a java application running in a Windows 10 OS. This application uses the region interpolation methods through a middle layer that can communicate with C++, Python and Secondo. The application and the data used in this paper can be found at: https://hfduarte@bitbucket.org/hfduarte/evaluation-of-region-interpolationmethods.git.
A video showing the application is available at: https://drive.google.com/open?id=1FzhOUTcEXhxHoEOfDcBfea_w0LFWs6E6.