Tracking Hurricane Dorian in GDELT and Twitter

GDELT is a machine coded database of events that uses both foreign and domestic news feeds and contains over a quarter of a billion worldwide event records categorized into three hundred categories. This paper compares the spatial footprint of GDELT event mentions with those of event related geotagged tweets for Hurricane Dorian in the South-Eastern United States. Besides examining event related GDELT and Twitter data abundance, the study relates areas of elevated GDELT news and tweeting activities to the locations of the hurricane track over a six-day period, and statistically analyzes distances between daily GDELT event mentions and tweets, and the hurricane center on different days. It assesses the potential role of the geographic coverage of the cone in hurricane prediction maps on the level of event related news and tweeting activities. The study also discusses pros and cons of both data sources for event tracking with regards to data abundance, spatial and temporal resolution, and thematic accuracy.


Introduction
Social media platforms have revolutionized communication patterns between users and the propagation of news as these platforms facilitate the sharing of content and instant responses in quasi real time [1,2,3]. The widespread use of social media platforms, such as Twitter and Facebook, provides an additional source of news to its users, though social media still play a relatively limited role as a way of finding news compared to television, news media websites or print newspapers, even among 18 to 24 year olds [4]. GDELT (Global Database of Events, Language, and Tone) relies primarily on the second, more traditional type of information channels. It is a news repository based on monitoring news media in over 100 languages across the world that is updated every 15 minutes. It provides a rich archive of events since 1979 and includes scores about tones and emotions in the news besides location and time of the event. The GDELT project contains over 250 million geocoded events with an even higher number of event mentions and location references, along with a massive Global Knowledge Graph (GKG) that connects people, organizations, locations, themes and emotions underlying those events [5].
Social media platforms have been extensively used for the detection, monitoring and tracking of man-made or natural events (e.g., earthquakes, floods, or revolutions), and analyzing their impact on society and its responses [5,6,7,8]. Several studies also addressed the relationship between geographic and social space when analyzing community interaction in social media platforms [10]. Other studies evaluated the spatial and temporal accuracy of event representation in GDELT [11]. However, potential differences in spatial and temporal coverage of an event between social media platforms (especially Twitter) and GDELT are largely underexplored. This makes it difficult to decide which platform is most suitable in order to obtain the best overview over a specific event (e.g. its spatial or temporal extent) and its consequences for the local population. To address this issue, this study uses Hurricane Dorian, a category 5 hurricane, which hit the Bahamas and part of the Southeastern U.S. in September 2019, as a showcase.
The study has the following objectives: 1) Compare the abundance and spatial resolution of data points (GDELT event mentions and tweets, respectively) related to this event for the Southeastern U.S. (Florida, Georgia, South and North Carolina); 2) examine the proximity of daily GDELT event points and tweets to the hurricane eye throughout a 6-day observation period; 3) explore the effect of the geographic coverage of the cone used in daily hurricane forecast maps on GDELT event mentions and tweeting activities. Twitter's relative data sparsity of geo-tagged tweets, with only 1-2% of tweets being annotated with geographical coordinates, have long been recognized as a challenge for geo-applications and analyses using geo-tagged tweets [12,13]. In this context, this study will evaluate the potential of GDELT to represent a viable data alternative to tweets for the purpose of determining areas affected by Hurricane Dorian.

Literature Review
The enormous growth of data generated by social media results in an abundance of online information that facilitates event detection and related applications, such as emergency, traffic, and health and management, rescue operations [14], tracking of refugee movements [13], delineation of areas affected by local events, such as floods [15], or compliance assessment with governmental regulations before and during natural disasters [16]. A wide range of event detection and event extraction methods from social media and more generally, crowd-sourced geo-information data have been developed and classified over the years [14]. Whereas specifically tweets, and some other prominent crowd-sourced data sources, such as Flickr [17] or OpenStreetMap (OSM) [18] for event detection and management have been frequently discussed in the literature, the role of GDELT for mapping and tracking of events is less known in the geo-community. An earlier study addressing spatial aspects of the GDELT platform [19] analyzed 10,009 news media and 195,513 disasters appearing in the GKG between April 2013 and July 2014 in a hierarchical (mixed-effect) regression model, finding that disasters received more attention if they occurred in politically instable countries, if many people were affected, if counts of kidnappings and killings were provided, and if disasters were aftershocks, radiation leaks, flooding, ice, or landslides. A validation of GDELT data for five event types occurring in Sudan or South Sudan between April 2013 and November 2013 found that, on average, 81.2 percent of the event codes in the database accurately reflect the nature of the articles, and that the geography of events in the database is correct 75.4 percent of the time [11]. A paper presenting a taxonomy of event extraction from Twitter [20] points out that the use of GDELT for the evaluation of event extraction from Twitter is limited by the coarse spatial level (city) and the lack of semantic elements in GDELT event messages.
Only a few studies analyze the spatial overlap between GDELT news coverage and that from crowd-sourced data. One study, for example, compared the spatial extent of the GDELT report density with the OSM edit density for the Moore Tornado in Oklahoma (November 2013) and Typhoon Haiyan (May 2013) [21]. Results revealed that the majority of OSM edits were within the location of where the natural disaster occurred while GDELT reports were predominantly located in the largest city near the disaster affected location.
Previous research showed that both social networks and news media rapidly increase their attention to natural disasters especially within the first 72 hours, as demonstrated for an earthquake in GDELT [7] or for a typhoon in Twitter [22], among other examples, such as hurricanes [23]. In the case of tweets, hashtags were found to be important semantic features since they help to identify the topic of a tweet and to estimate the topical cohesiveness of a set of tweets [24]. The frequency of disaster related tweets was found to be highest in the spatial proximity of a disaster [25].

Data and Methods
This research covers four U.S. states, namely Florida, Georgia, North Carolina, and South Carolina, and analyzes hurricane related GDELT event mentions and tweets within that study area shared between September 1 and September 6, 2019. Dorian became a hurricane on August 28 north of the Greater Antilles, made landfall on the Bahamas on September 1 as a Category 5 Hurricane, and began moving northwestward on September 3, parallel to the coast of Florida, after weakening considerably. On September 6 it made landfall on Cape Hatteras in North Carolina at Category 1 intensity, after which it headed northeast as a tropical storm before dissipating near Greenland on September 10.

Tweets and GDELT event mentions
Geo-tagged tweets posted between September 1 and September 6, 2019 were downloaded for the study area through the Twitter Streaming API using the Tweepy python library and stored in a PostgreSQL database. Both tweets with precise coordinates as well as tweets geocoded with a place type city were downloaded. In June 2019, Twitter disabled precise location sharing from Twitter apps for mobile devices so that precise location is available only through the camera by sharing a photo with the tweet 1 . Other tweets, when location sharing is on, share Twitter place (city level) position information by default, with an option of selecting a nearby point of interest (POI) on Twitter apps. For our research, most tweets used in the analysis came with Twitter place position information (i.e., city or POI). Although this information is spatially not as accurate as precise coordinate information, this was not viewed as a significant problem for the analysis given the geographic scope of the study which expands over several states. Tweets from automated Twitter accounts (bots) were subsequently removed using the Botometer web service [26], where Python was used to call the Botometer API for Twitter users in the analyzed dataset. The JSON object for each tweet contains among others a list of all hashtags that are used in a tweet.
Hashtags provide a platform for the discussion of a specific topic and can be used to classify information [27]. Hashtag strings can be clicked to trigger a global search of tweets related to a topic of interest. To extract hurricane related tweets, tweets that contained the term "Dorian" or "Hurricane" in their hashtag or text were retrieved. This led to the retrieval of 8539 hurricane related tweets in the study area. GDELT data can be mined in two ways. The first option is to run queries on Google Big Query for a fee, based on different available pricing schemes. A limited number of data can also be downloaded for free when a new account is opened. The second option is to download (for free) the three main tables of events, event mentions and the GKG that make up the GDELT database and which are updated every 15-minutes. The Events table contains event logs which are uniquely coded using a GLOBALEVENTID key, the Event Mentions table stores all the mentions that reference an event, and the GKG contains themes to which various events cited in the Event Mentions table belong to. These tables can then be integrated and analyzed in a database management system, such as PostgreSQL. For this study both methods were tested for the extraction of hurricane related GDELT event mentions, which returned the same results. However, about 200 GKG files could not be loaded into the PostgreSQL database due to data encoding errors. In Google Big Query, an SQL join between the three tables was run to retrieve all the event mentions about the hurricane. The SQL query is available for download on GitHub (see section 3.3). The extraction started by identifying all GKG records falling under the "HURRICANE" theme. Based on this, the Event mentions table was searched for HURRICANE related articles (mentions) during the study period, which finally led to searching the Events table for "HURRICANE" events. GDELT entries in the Events table were filtered to USCITY location type, which corresponds to the size of a US city or landmark. This is the finest spatial resolution for GDELT entries available, besides more coarse location types, such as COUNTRY, USSTATE, WORLDCITY, and WORLDSTATE. Each location in the GKG comes with a latitude/longitude pair which represents the centroid of the location.
The hurricane track position for every 12 hours between September 1 and September 6, 2019 was retrieved from the National Oceanic and Atmospheric Administration (NOAA) website 2 . Time stamps of tweets, GDELT event mentions and hurricane tracking points are all given in UTC and were therefore not converted to local time.

Analysis methods
Related to objective 1, descriptive statistics and maps will provide an overview of the abundance of daily hurricane related GDELT event mentions and tweets. Objective 2 determines how closely GDELT event mentions and tweets follow the hurricane path over the analyzed time period. Two general approaches were applied for this, as follows.
Since the hurricane eye was travelling mostly over the ocean, and GDELT news and tweets are primarily posted from land, we could not directly compare the locations between the news sources with the location of the hurricane eye. Instead, for the first approach, median distances were computed between GDELT/tweet locations and the location of the hurricane eye for all combinations of days between September 1 and 6. The hypothesis with this analysis is that the GDELT/tweet locations generally follow the hurricane eye closely on land, so that the distance between them and the hurricane eye is shortest when the positions of the same day are compared. For example, GDELT news/tweets from Sep 6 are expected to be closest to the September 6 hurricane tracking point. As opposed to this, the distance can be expected to be longer for other date combinations, such as when measuring distances from GDELT news/tweets shared on September 6 to the hurricane eye position on September 4 or 5. This hypothesis will be statistically tested through a series of Wilcoxon-signed ranked tests for GDELT and Twitter.
For the second approach, the number of GDELT event mentions and tweets per day are counted for all 413 counties in the four analyzed states. To reach a sufficient sample size only those counties with at least 30 tweets (n=48 counties) or at least 30 GDELT event mentions (n=23 counties), respectively, over the observed six-day period were retained for further analysis. Next, daily GDELT event mention and tweet numbers for these counties were corrected by a scaling factor to obtain the same total number of observations each day. Next, the mean number of GDELT event mentions and tweets was computed for each county across the six-day period, followed by the computation of the z-score for each day and county. The hypothesis related to this objective is that z-scores of GDELT news and tweet counts tend to be high for counties that are located near the hurricane eye on a given day, and that more distant counties have a lower z-score.
For objective 3, the potential effect of the spatial coverage of the hurricane forecast cone on activity levels of GDELT news sharing and tweeting is examined. The hypothesis is that areas under the cone, which have a higher chance to be hit by the hurricane in a matter of days according to forecasts, receive relatively more attention in GDELT and Twitter than areas outside the cone. To test the hypothesis, two consecutive dates are picked (September 2 and 3) which show little hurricane 2 https://www.nhc.noaa.gov/archive/2019/DORIAN_graphics.php movement in-between, but that show a significant change in the cone shape. In addition, a reference area close to the hurricane eye but outside the cone (on both dates) is chosen. The number of GDELT/tweet posts is counted for September 2 and 3 for the reference area and the cone area from September 2. If the hypothesis is true, one can expect a significant drop in GDELT/tweet counts in the September 2 cone area between September 2 and 3, relative to the drop in the reference area between both days.

Data and software availability
IDs of hurricane related tweets (in .txt file format) as well as records of GDELT news mentions (in .csv format) used in this study can be obtained from the data folder on https://github.com/InnocensiaO/Tracking-Hurricane-Dorian-in-GDELT-and-Twitter. Tweet IDs are shared as described in the Twitter API developer agreement and policy which gives permission for the distribution of this data for academic research purposes. The queries that were used to retrieve GDELT event mentions through Google Big Query can be found in the code folder. The hurricane cone graphics can be accessed at https://www.nhc.noaa.gov/archive/2019/DORIAN_graphics.php.
The computational environment used during this research included a Lenovo W530 laptop with a Windows 10 operating system, 24 Gigabyte RAM and a 2 TB hard disk. ArcGIS Pro 2.4.3 was employed to carry out spatial analysis and map creation. Statistical charts were created in R [28]. The R code can be found in the GitHub link above. Table 1 shows for the four analyzed U.S. states the daily counts of hurricane related GDELT event mentions and tweets between September 1 and September 6, 2019. A total of 3340 event mentions were retrieved from the GDELT master database in comparison to a much higher number of 8539 tweets. Tweets are only more abundant, however, if tweets of place type city are included, since only 14.9% of geo-tagged tweets obtained for this study had exact coordinates. Among the 1274 tweets with exact latitude/longitude coordinates, all except for four came with images. Only three out of the 8539 city level tweets came from the Twitter Web Client source, all others came from Twitter apps for mobile devices. Both in GDELT and Twitter datasets daily record numbers are highest when the hurricane is close to the Miami Metropolitan Area (September 2 and 3), which can be expected due to the large population in that area. Numbers drop significantly once the hurricane reaches the northern fringes of the study area in North Carolina on Sep 6.  The map illustrates that on that day GDELT events and tweets are primarily found along the Florida East coast but only occasionally in the more northern states Georgia, South Carolina, and North Carolina. Furthermore, tweets, due to their relative abundance, appear in more cities than GDELT events, which is most noticeable along the Florida East coast.

Distance to hurricane tracking points
Fig 2 plots the median distances between GDELT event mentions (a) and tweets (b) for a given day (between September 1 and 6) and hurricane center points at a given day and time (provided in 12-hour increments). The latter is shown along the x-axis of each chart. The GDELT and Twitter sample size for each median computation can be obtained from Table 1. The line patterns reveal a drop in median distance during the first three days both in GDELT and Twitter, which can be expected as in these few days the hurricane eye moves towards the coast (compare Fig 1). Similarly, median distances increase towards the end of the analyzed time period where the hurricane eye shifts towards the fringe of the analyzed region. It is important to notice that the curves associated with different GDELT and tweet dates tend to have lowest points at different hurricane dates. This means that the average positions of GDELT or Twitter posts from such a day are also closest to the hurricane eye. For example, the GDELT September 5 curve has the lowest distance value among all six curves for the September 5 hurricane eye position. This pattern is more pronounced for Twitter than for GDELT which shows that Twitter more accurately follows the hurricane track. Exception to this are the distance curves for September 1 through 3 which nearly overlap in both figures. That is, the median distances for tweets between September 1 and 3 are similar since in the first 48 hours of the analysis the hurricane moved towards the Southeast Florida perpendicular to the coastline, and not in a lateral direction along the coast.
As can be seen, between the September 1 and 6 hurricane dates (i.e. the left-most and right-most position in the charts), the order of GDELT/Twitter distance lines flips. In other words, for the hurricane eye location on September 1, GDELT/Twitter posting locations from around that day (September 1-3) are closest to it, whereas for the September 6 hurricane eye position GDELT/Twitter positions from September 6 are closest.  The boxplots in Fig. 3 provide a more detailed insight into the distribution of median distances between GDELT event mentions (a, b) and tweets (c, d) posted on September 1 through 6, and hurricane eye positions on September 1 and 6. The plots show that distances between GDELT positions and the September 1 hurricane location gradually increase from September 2 on (a), with the opposite trend for the September 6 hurricane location (b). For tweets, a similar pattern can be observed in Fig. 3c and d. This illustrates again the influence of the hurricane eye position on the locations from which GDELT event mentions and tweets are shared. A series of Wilcoxon signed rank tests was carried out to determine if differences between median distances on consecutive days in Fig. 3 were statistically significant or not. Each row in Table 2 corresponds to the comparison of two adjacent box plots in the different charts shown in Fig. 3. With the exception of some date pairs at the beginning and end of the study period, differences in median distances were significant, which means that the center of GDELT and tweet activities, respectively, shifted between days relative to the hurricane eye position on a given day (September 1 or 6). Tweets had more differences that were statistically significant, indicating that the tweets tracked the hurricane path better than GDELT did.   Fig 4 and 5 show daily z-score maps for GDELT event mentions and tweets from September 1 through September 6. The position of the hurricane eye on the corresponding day is visualized as a red triangle. Resulting z-scores for each map were classified into five categories using Jenks natural breaks classification to facilitate an easier visual comparison between the different days. Class 1 covers the lowest range of z-scores and class 5 the highest. The maps show that z-scores for GDELT event mentions and tweets on a selected day tend to be highest in counties that are close to the eye of the hurricane. This pattern is more discernible for tweets ( Fig 5) than for GDELT events mentions (Fig 4), suggesting that tweets are more suitable for event tracking. A possible explanation for this is the larger numbers of hurricane related tweets than GDELT events mentions (compare Table 1). The Pearson correlation between GDELT event mentions and tweets in 224 counties with at least one recorded point in either data source results in r = 0.78 (p < 0.001). This shows that both news sources share similar spatial patterns, but that they may still be able to complement each other since there is no perfect match. Fig 6 shows a corresponding scatter plot of the count of GDELT event mentions versus tweets in these counties. One outlier in Miami-Dade County revealed an unexpectedly high number of GDELT contributions, possibly due to a high number of news stations in that area. Removing this outlier leads to an increased r = 0.83 (p < 0.001).   Table 3 reports corresponding counts of GDELT event mentions and tweets in these two areas, respectively, as well as the ratio of counts between the two areas. Fig. 7b and d map the same type of information for September 3, where, however, now the green dots show the locations of GDELT event mentions and tweets within the September 3 hurricane cone. What can be observed is that, although on September 3 the cone turned significantly away from the coast towards the ocean, the pattern of GDELT event mentions and tweets along the coast barely changes. This implies that the coverage of the cone does not noticeably affect news and tweeting activities. Instead, it appears that the distance to the hurricane eye is the driving factor of news and social media activities. If the hypothesis related to objective 3 was true, fewer GDELT mentions and tweets should be observed in the September 2 cone on September 3 than on September 2 relative to the reference area. However, that ratio increased from 1.19 to 1.31 for GDELT and from 1.23 to 1.89 for tweets, which means that people were still actively tweeting from the same areas on September 3 although the cone had already shifted away towards east. Therefore, the hypothesis needs to be rejected since the cone area has no apparent effect on GDELT news mentions and tweeting activities.

Discussion and Conclusions
GDELT event mentions and tweets are two examples of big data platforms with free data access that provide significant potential for tracking of natural disasters, such as hurricanes. The goal of this study was to compare the suitability of these sources for the tracking of Hurricane Dorian over a six-day period. The Twitter data count was about by a factor 2.6 higher compared to the GDELT event mentions when using city level data, hence Twitter is more suitable than GDELT in this regard. Twitter also provides the option to use tweets with exact coordinates, which facilitates more accurate event mapping than GDELT does, however, at the cost of a lower sample size. Distance analysis between GDELT mentions/tweets and the hurricane center across the six days suggests that the average position of data points both platforms generally follows the hurricane path on land, and that Twitter is more accurate in following the hurricane path than GDELT. A combination of both GDELT and Twitter will lead to an increased sample size and could therefore also improve spatial and temporal accuracy for event detection and monitoring, which will be explored as part of future work. The Twitter streaming API provides tweets in quasi-real time, which makes it possible (though technically challenging through presence of bots, etc.) to monitor and map events in real-time. For GDELT the update rate is every 15 minutes. Depending on the type of analysis this may or may not be an issue. Extraction of tweets linked to a particular event poses some challenges, as this is typically done using hashtag filters or text mining, which may miss tweets that are not tagged accordingly [29]. This issue aside, tweets can, however, be used to study the spatiotemporal characteristics of any topic of interest (e.g. food, health, tourism, politics, travel). A challenge with extracting topical data from GDELT is that direct searching of keywords on GDELT event mentions is not possible through Google BigQuery. Instead, one needs to first pick a theme from a long list of pre-defined themes in the GKG table that the event of interest may fall under. One obvious advantage of GDELT over tweets is its historic dimension since its events date back to 1979, whereas access to older tweets is technically more challenging if not impossible.
Previous research has shown that during an event, areas closest to the event receive the most attention. This trend is generally confirmed through z-score maps in this study, more so for tweets than for GDELT event mentions. This suggests that tweets are a more reliable data source for the spatial analysis, e.g. tracking or regional impact assessment, of natural disasters. An expected association between the spatial coverage of a hurricane prediction cone and GDELT news/tweeting activities in these areas could not be confirmed. Future work will expand this analysis to other event types, 16 of 18 such as protests, elections, sports (e.g. Olympic games) and other natural disasters, such as flooding.