Assessing shop completeness in OpenStreetMap for two federal states in Germany

The completeness of the number of OpenStreetMap (OSM) retail stores was estimated for two federal states of Germany at district level. An intrinsic measurement was applied that fits saturation models on the cumulative curve of the number of OSM retail stores over time. Even though the mean completeness of retail stores was estimated high in both states, the values within the states varied between 42 % and 100 %. The question therefore arises in which areas retail stores are well represented in OSM and whether economically weaker regions are possibly also digitally disadvantaged on the map. We investigated the influence of the urban-rural gradient as well as the influence of socioeconomic factors (gross domestic product, the unemployment rate, the proportion of academics) on the estimated completeness by means of a generalized linear model. Our results indicate that average big cities with low unemployment rate are better mapped with respect to retail stores.


Introduction
The ubiquity of smartphones has lead to a continuous availability of geodata. In day to day life, especially at less familiar locations, shops and restaurants are some of the most frequently searched points of interest (POI). Having an up to date and complete collection of these POIs is of great interest for the potential customers. Shop owners equally have great in-terest in being well represented in these POI collections for advertisement and visibility reasons. While big players such as Apple, Google and Microsoft dominate the market, the Volunteered Geographic Information project OpenStreetMap (OSM) provides an established non-commercial crowd sourced alternative. OSM contains an enormous amount of various geodata, that are continuous edited by the great number of more than 7.5 million volunteers (state of March 2021, OpenStreetMap contributors (2021)). The open nature of the OSM project additionally provides the potential to cast niches for specialized services such as for shops without packaging (https://cartovrac.fr), cigarette vending machines (http://www.ubahnverleih.de/osm/zigaretten) or farm shops (https://farmshops.eu). According to https://taginfo.openstreetmap.org/keys/shop 4.5 million objects with the key shop=* existed in OSM at the end of March 2021. Especially in Germany where the active OSM community has created an extremely rich data set that is well known and used, OSM would be suitable as a flexible platform for online-shopsearching and navigation apps. The advantage of OSM over commercial providers of geodata is the free availability under the Open Data Commons Open Database License that allows an unrestricted use in many commercial or non-commercial applications. A persistent question for the usability of OSM is the quality of its data base. OSM roads are often mapped first and are now considered almost complete in many regions (Zielstra and Zipf, 2010). Due to the high growth rates of OSM in recent years, a high completeness can also be assumed for other objects such as buildings or stores.
Even though extensive research has been carried out on the OSM data quality, no single study exist -to the best of our knowledge -on the completeness of OSM retail stores. Such an investigation is not only necessary to clarify for which areas OSM data is suitable for shop finding platforms, but also to clarify the usability of OSM for researchers to analyze spatio-temporal patterns of the stationary retail sector. In addition, knowledge on influencing factors on shop-completeness can be used to predict, which parts of the physical world are digitally mapped or digital lacking and to counteract a digital disadvantage of retailers in particular regions, that may even already be economically weaker. Moreover, the socio-economic system of retailing is an interesting study field in OSM, because it can be assumed, that -in contrast to roads -local knowledge is necessary to locate, tag and add specific information (e.g. opening hours) to these locations.
For these reasons, this study examined the completeness of retail stores, a main quality criterion for online-searching-platforms, using the case study of two economically different states in Germany, Baden-Württemberg and Saxony. We further investigated how the urban-rural gradient and socio-economic factors (gross domestic product, unemployment rate and proportion of academics) were associated with completeness of OSM retail stores.

Methods and data
OSM quality analyses can be categorised into extrinsic or intrinsic approaches. Extrinsic approaches compare OSM with an external data set of presumably higher quality (see for example Zielstra and Zipf (2010) Neis et al. (2012). A major drawback of extrinsic approaches is the necessity for a compatible external data set, which may not always be available. For example for shops, official statistics, if available at all, may only be in reference to a certain level of administration and a specific definition of 'retail', that cannot be directly transferred to the definition of OSM. Therefore, we assess the fitness for purpose of an intrinsic completeness estimation using only OSM data itself (see Ballatore and Zipf (2015); Degrossi et al. (2017); Barron et al. (2014); Barrington-Leigh and Millard-Ball (2017) for some examples on intrinsic OSM data quality assessment).
The underlying idea of the intrinsic completeness analysis is that the added number of OSM objects of a specific feature class per time period decreases as the number of mapped objects converges against the (unkown) true number of objects. The cumulative number of OSM objects would then saturate. Given a sufficient mapping activity it is possible to estimate the saturation level using a suitable function in the context of Table 1. Regional data and economic information of Baden-Württemberg and Saxony: Number of administrative districts, the GDP in 1000 euros per employed person (for 2016), unemployment rate as percentage of unemployed in the civilian labor force (for 2017), proportion of academics as employees to social security contributions at the place of residence with an academic qualification per 100 inhabitants of working age, total area and population density (2019)  a non-linear regression approach. Baden-Württemberg and Saxony were particularly suitable for this intrinsic investigation, as no bulk data imports of retail stores into the OSM database have been recorded so far.

Experimental setup
Retail stores are defined as places, where goods or services are sold to the final consumer (Bankim and Vaja (2015)). Our analysis was restricted to stationary retail stores, that were tagged as "shop" or "amenity" and that were listed in the OSM Wiki (OSM Wiki (26.02.2021)). We included all key-value combinations that could not be clearly excluded from retail trade (for a detailed list of the used tags see the linked source code in section 2.2). These included combinations of retail trade and direct marketing -such as farm shopsor services -such as car repair shops.
The research area is characterized by contrasts, both economically and in the degree of urbanisation. Baden-Württemberg, located in South Germany, has been one of the economically strongest federal states of Germany. Saxony in East Germany has been economically weaker. Altogether, both states contain 57 administrative districts of rural as well as urban character (table 1).
We fitted various limited growth curves to the OSM history for each administrative district and estimated the completeness level via their saturation parameter. The curves used originate from two families. On one hand, the family of sigmoid curves seems adequate for a three-phase mapping process as also described by Barrington-Leigh and Millard-Ball (2017). On the other hand, curves of the non-logistic growth curves family tend to represent a mapping process without the initial phase of slow growth. In this analysis, we used the following functions: the three and four parameter logistic function (equation 1 and 2), that are assigned to the sigmoid curve family as well as the rectangular hyperbola (equation 3) and the asymptotic function (4) of the non-logistic growth curves family . where: Asmp = a numeric parameter representing the saturation to which the curve converges Asmp low = lower Asymptote t = time at which half the saturation level is attained t mid = mid point of the logistic curve scale = the steepness of the logistic curve t 1/2 = time, 50 % saturation y 0 = parameter, that specifies the value of y (here count of OSM contributions) at the begin of the period rc = 'rate constant', parameter that determines the spread of the curve with time lrc = log of the 'rate constant' The reliability of estimation the number of retail stores in a district depends on the development of OSM contributions in several ways. First and foremost, the data history was checked for a decline in growth at all, that is a fundamental criterion to estimate a saturation level as a proxy for the number of retail stores.
Fitted models were filtered for unrealistic fits where the asymptote was estimated to be lower than the current number of OSM retail stores. To account also for the uncertainty of the models, we accepted fits whose asymptote was at most 2% lower than the actual latest amount. We chose the best fitting functional form of all accepted curves for each administrative district based on two criteria: i) the relative residual standard error and ii) the relative deviation of the slope between the historic development and the fitted curve during the last two years of the analysis period. If both criteria contradicted each other, the selection was made based on visual assessment. The completeness level was estimated as the quotient of the current number of retail stores and the asymptote of the estimated saturation curve.
Finally, we investigated the influence of factors on the completeness level based on a generalized linear model (GLM) with a negative binomial distribution and a loglink. We used the estimated asymptote (the estimated number of retail stores) as the response variable and used the logarithm of the number of shops as an offset. This standard procedure allowed to model the relation between estimated and observed counts without mixing up distributional assumptions. Since completeness was inversely proportional to the asymptote, negative coefficients of the regression indicated a higher completeness level, whereas positive coefficients meant a lower completeness level.
We examined the influence of the urban-rural gradient by the district type of the administrative units defined by Bundesinstitut für Bau-, Stadt-und Raumforschung (2019). Type 1 are independent cities with at least 100,000 inhabitants. Type 2 are urban districts with a medium population density of at least 150 inhabitants/km 2 . Type 3 are rural districts with a low population density less than 150 inhabitants/km 2 . We tested in addition the effects of three socioeconomic factors as predictors: the gross domestic product (GDP) in 1000 euros per person in employment (for 2016), the unemployment rate as the percentage of unemployed in the civilian labor force (for 2017) and the proportion of academics as employees to social security contributions at the place of residence with an academic qualification per 100 inhabitants of working age (for 2017). We hypothesized that the completeness level would increase along the rural-urban gradient and with higher GDP, lower unemployment rate and higher share of academic employees. These hypotheses are based on Neis et al. (2013) who found a positive link between urban and OSM activity as well as GDP and OSM activity. Further statistical information (population density, area) were queried via the regional database of Germany from the statistical offices of the federation and the federate states (Statistische Ämter des Bundes und der Länder, Regionaldatenbank Deutschland https:// www.regionalstatistik.de/genesis/online/, data licence Germany -attribution -Version 2.0 www.govdata.de/ dl-de/by-2-0).
All source code, preprocessed data and results can be found at https://github.com/GIScience/ shop-completeness.

Results
Well fitting saturation models were generated for 44 of the 57 districts in both federal states. For five districts the data history showed a steady high increase in the number of OSM retail stores and did not indicate any slow down in growth rate while eight regions produced low quality saturation models (figure 1c) due to complex temporal pattern. These issues occurred independent of influencing factors such as population density due to non continuous mapping activities and the respective regions had to be ignored for the GLM. The non-linear fit of a sigmoid curve produced the best results for 28 districts (figure 1a), while a non-logistic curve showed the best fit for 16 districts (figure 1b). Ten of the thirteen districts for which no saturation level could be estimated were visually categorized as relatively far from saturation (table 2).
The mean completeness level of Baden-Württemberg was approximately 88 %, slightly higher than the mean value of Saxony of about 82 %. The completeness ranged from 42 % to almost 100 %. Even though the results showed heterogeneity in the completeness, the majority of 38 districts achieved at least 80 %.
Completeness was significantly higher in the independent cities than in the urban and rural districts (table 3).
The completeness level of the data decreased significantly with a higher unemployment rate. In total, the GLM explained 18 % of the deviance in the data.
Districts, for which no suitable saturation model could be estimated, were represented in all district types.  However, districts, that were clearly not saturated, were mostly of type rural or urban district and had a GDP, a proportion of academic as well as an unemployment rate slightly below the average of each fed-eral state. Fitting a binomial GLM with a log-link (a logistic regression) did not reveal any significant relationship between the four predictors and successful fitting of a saturation curve. However, if compared visually for each district category unemployment was higher on average for those districts where the saturation level could not be estimated reliably. Table 3. Coefficient estimates, standard errors and p-values of the GLM for the 44 districts with a reasonable fit of the asymptote. Coefficients and standard errors are provided at the link scale. The response was the asymptote -for a given observed count (included as an offset in the model) completeness goes up if the asymptote is lower. Negative coefficients therefore indicate a positive effect on completeness and vice versa. Rural districts were used as the reference level -the coefficient therefore represents the intercept. The θ parameter of the binomial distribution was estimated as 54.5 with a standard error of 12.3.

Discussion
In comparison to previous studies e.g. on the completeness of OSM buildings in Germany ( (Törnros et al. (2015))), the estimated completeness of retail stores in OSM was relatively high. It is in general more problematic to estimate the saturation level for incomplete districts than for complete districts. With this in mind, the mean completeness values tended to be overestimated since districts with lower saturation are not considered. Saturation may also occur due to a decrease in mapping activity resulting in a false intrinsic estimate of completeness. In our analysis, a sufficient number of active users was present in all districts which provides reasonable support for the assumption that saturation did not occur to a lack of user activity. Events, such as bulk data imports or mapping parties affect the form of the data history and require fitting functions of respective forms. In our analysis, the data history of only 8 of the 57 districts showed one of this deviating forms, due to which no suitable fit function was found. However, it might be suitable to included additional function types such as multiple sigmoid forms as well as step forms in other regions and for other OSM feature class, whose data history reflects such events.
The higher data completeness found for districts with a low unemployment rate was consistent with our hypothesis. The higher completeness level of indepen-dent cities -the district type with the highest population density -was similar to those reported by studies on the completeness of other OSM feature classes (Zielstra and Zipf, 2010;Mashhadi et al., 2015;Wang et al., 2020). However, the completeness for the category urban districts could not be distinguished from the completeness of rural districts. So the hypothesis, that completeness increases with the rural-urban gradient was not supported by our data.
The relatively small sample size makes the results sensitive to outliers. Two rural districts could be identified as influential by means of the usual regression diagnostics: "Nordsachsen" with the lowest of all estimated completeness levels (42%) with a high leverage and a high cook's distance and "Görlitz" with a high leverage. If both districts would be omitted simultaneously, regression coefficients estimates would remain the same with slightly higher standard errors due to the reduced sample size. If only "Görlitz" would be ommited, the regression coefficients would be of similar magnitude and sign but if "Nordsachsen" would be omitted all coefficient estimates would render insignificant. Since we had to exclude districts with low completeness since we could not reliably estimate the saturation level, and those districts show a tendency for higher unemployment rate and to belong to rural or urban district types our results might be to conservative. To prove and clarify the effects of the factors on the completeness, further studies including a larger amount of data are necessary.
The major challenge of the intrinsic completeness estimation is the selection of the best fit among multiple options. Additionally, using different models to estimate the completeness makes comparison of results for different districts more challenging. In our analysis, curves of the non-logistic curve family tended to estimate a higher asymptote -and therefore a lower completeness level -than curves of the sigmoid family.
However, the diversity of OSM contribution histories seems to not allow a "one fits all" approach. We have started to extend our research in this direction to overcome this limitation.

Conclusion and outlook
The presented approach allowed a reliable completeness estimation and comparison of OSM data between regions with individual contribution histories. This study was applied to the use for case of retail stores but the approach may be transferred to e.g. roads or land use data by substituting the store count with road network length or land use area.
The estimated completeness level of more than 86 % on average indicated the high potential of OSM to be used as a database for platforms offering onlineshop-searching in densely mapped countries such as Germany. For a real world application, further quality elements like positional accuracy and moreover the completeness of the various attributes, such as opening hours, contact information as well as accessibility's, would also need to be investigated. Future research should further study how the completeness differ in the various types of retail stores, such as supermarkets or clothing stores, to identify lacks regarding store types.
The results of the GLM suggest that especially big cities with low unemployment rates can be expected to be of higher completeness of retail stores and therefor presumably fit for purpose. Furthermore, we expect of OSM to catch up in disadvantaged areas soon. This is due to the high estimated completeness level of retail stores compared to previous studies of other feature classes, that demonstrate the continued growth of OSM and the overall high activity of the OSM community.