Articles | Volume 7
https://doi.org/10.5194/agile-giss-7-29-2026
https://doi.org/10.5194/agile-giss-7-29-2026
10 Jun 2026
 | 10 Jun 2026

Enhancing OpenStreetMap Building Footprints through nDSM-Based Geometric Segmentation for AI Training Data

Paul Kuper, Ruiqi Liu, Hanwen Deng, and Martin Breunig

Keywords: OpenStreetMap, nDSM, DBSCAN, Region Growing, Label Enhancement

Abstract. High-quality building footprint labels are a critical prerequisite for training AI-based segmentation models, yet reliable ground truth data is rarely available at scale. On the one hand, vegetation often prevents the reliable determination of buildings when only using imagery data and on the other hand, community-driven open data sources such as OpenStreetMap (OSM) frequently exhibit spatial inconsistencies and incompleteness. This study brings both data sources together: it investigates the potential of utilizing airborne LiDAR-derived Normalized Digital Surface Models (nDSM) to improve building extraction and refine OSM labels. Two automated strategies are implemented and compared: 1) a rule-based region growing algorithm and 2) a Density-Based Spatial Clustering (DBSCAN) pipeline leveraging a multi-dimensional feature space that incorporates nDSM heights and local roughness. As a result, more reliable building footprint labels are generated to be used as training data for AI-based building segmentation. The two methods are evaluated on orthophoto-based ground truth data in Karlsruhe, Germany. Quantitative results demonstrate that the nDSM-based DBSCAN approach yields the most robust performance, achieving an F1-score of 0.94 and an Intersection-over-Union (IoU) of 0.89. This method systematically improves upon the raw OSM baseline by effectively filtering vegetation and correcting geometric misalignments through multi-source constraints, specifically the Normalized Difference Vegetation Index (NDVI) including OSM map data overlap. Finally, conclusions are drawn and the outlook indicates the way to AI-based building segmentation, trained on such labels, to be used in scenarios where high-quality ground truth is unavailable.

Share
Download
Share