Short-term traffic demand prediction using graph convolutional neural networks

. Short-term traffic demand prediction is one of the crucial issues in intelligent transport systems, which has attracted attention from the taxi industry and Mobility-on-Demand systems. Accurate predictions enable operators to dis-patch their vehicles in advance, satisfying both drivers and passengers. This study aims to predict traffic demand over the entire city based on the Graph convolutional network (GCNN). Specially, we divide the study area into several non-overlap sub-regions. Each sub-region is treated as a node, and a traffic demand graph is constructed. Then, we build three graph convolution networks based on three different weighted adjacency matrices, which represent three graph struc-tures. Furthermore, a data-driven graph convolutional network (DDGCNN) is developed, which can capture the correlation between pairs of sub-regions automatically. Finally, we compare our models with other prediction methods, including three GCNNs with a normal adjacency matrix, an existing data-driven graph convolutional neural network, historical average, and random forest. Results show that the weighted adjacency matrix can improve the prediction performance compared with a normal adjacency matrix. In addition, we proved that our DDGCNN outperforms other predictors in three aspects, i.e., performance over the test set, performance over the time aspect, and the performance over the spatial aspect.


Introduction
Mobility-on-Demand (MoD) systems, such as Uber, Didi, have gained great popularity all over the world. These systems are much convenient and flexible compared with traditional transport modes, such as subways and buses, by providing point-to-point travel with better comfort and convenience. In addition, they promote the sharing economy and enlarge the transportation capacities of the cities. One crucial operational challenge in MoD systems is the vehicle imbalances due to asymmetric vehicle demand. Vehicles might accumulate in some areas while being insufficient in others, which would impede the operations of the system. One approach to address this challenge is through surge pricing, which has been adopted by many MoD companies to attract drivers to the areas short of vehicles [1,13,21,22]. Another approach is through vehicle such as Uber, Lyft, and Didi, also have a strong motivation to obtain accurate forecasting for rider demands to improve their operational efficiency. Various short-term demand prediction methods have been proposed, which can be categorized into parametric methods and nonparametric methods [9]. Parametric methods aim to use the historical data to fit a predetermined function linking the past and the present, where the most popular models are the Autoregressive Integrated Moving Average (ARIMA) models. Moreira-Matias et al. [11] proposed an ensemble framework consisting of three times-series analysis models, e.g., time-varying Poisson model, weighted time-varying Poisson model, and ARIMA (Autoregressive integrated moving average) model. The predicted demand was a weighted ensemble of predictions from three models. Instead of fixed ensemble weights, the weights were updated based on the prediction performances of previous time-steps. Davis et al. [2] shortlisted a couple of time-series techniques to fit the taxi data. In addition, they developed a multilevel clustering technique that can explore the correlation between adjacent subareas. These time-series methods are easy to be deployed, but they may not be able to capture the nonlinear relationship in traffic data.
Nonparametric methods, on the other hand, do not assume such a predetermined relationship, but rather attempt to identify historical data that is similar to the prediction instance. These methods can deal with nonlinear and non-stationary time series of traffic demand. Mukai and Yoden [12] predicted the taxi demand by a neural network, which adopted multi-features (e.g., demands in each region, day-of-week, and amount of precipitation) as the input data. Zhao et al. [24] implemented three predictors (i.e., Markov Predictor, LempelZiv-Welch Predictor, and Neural Network Predictor) to predict traffic demand at the building block level. In addition, the authors proved that the taxi demand is highly predictable by using entropy theory. Jiang et al. [6] predicted the demand for car-hailing service by using the least squares support machine (LS-SVM).
Deep learning (DL), a particular type of neural network, has been shown to be a reliable methodology that enables researchers to model complex nonlinear relationships between dependent variables and features. Several DL models have been applied to predict the traffic demand, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) [19]. Among these models, CNNs exploit the local relation between grid cells, which have been widely adopted to capture spatial dependencies. Ke et al. [7] proposed a fusion convolutional long short-term memory network (FCL-Net), combining convolutional operations and LSTM layers to predict traffic demand. The explanatory variables include historical demands, travel time per unit travel distance, time-of-day, day-of-week, and weather conditions. To rank the importance of the explanatory variables, the authors employed a tailored spatially aggregated random forest. Yao et al. [20] proposed a Deep Multi-View Spatial-Temporal Network (DMVST-Net) framework to predict demand. In this framework, CNN is used to capture the features from the spatially nearby sub-regions.
However, CNN still has limitations. CNN is defined to deal with the data with regular grids, such as images, but the study area is usually in an irregular shape. Data preprocessing work is necessary to apply CNN in transportation [10]. A general approach is to fill the outside study area with zeros, which would, nevertheless, result in unnecessary storage. In addition, CNN only captures the local dependencies, but far-away areas could influence traffic demand in a cell. Graph convolutional network (GCNN) is an extension of CNN, which can deal with the data on an irregular domain [8]. By using different graph structures, GCNN can capture the relationship between different sub-regions. Therefore, we adopt GCNN to predict traffic demand.

Methodology
In this section, we first give some critical definitions related to the question. Then, we introduce the normal Graph convolutional neural network, in which we introduce how to construct the weighted adjacency matrix. Finally, a data-driven graph neural network is developed.

Problem Definition
Traffic request.
Traffic request describes when and where passengers need a vehicle, which is expressed as a tuple (id,t, lng, lat), where id is the identification of an order; t describes when passengers need the vehicle; lng and lat represent where passengers need a vehicle. The vehicle can be car-sharing vehicles, taxis, etc. In this study, without loss of generality, we use taxi data as an example in the experiment section.

Traffic demand.
Traffic demand represents the number of traffic requests in a sub-region at a specific time interval. For a study area, all the sub-regions are non-overlapping, which are denoted as (l1,...,lN). The time intervals are denoted as (I1,...,IT). The traffic demand in area i at interval t is denoted as di,t.

Short-term traffic demand prediction problem.
The demand prediction problem aims to predict the number of requests at a specific future period throughout the whole city, given the historical demands and some context features.
where n is the length of historical time series, i is the identification of subregion, Xcontext represents context features, and T is the length of the time series needed to be predicted.

Graph Convolutional Neural Network
A graph convolutional neural network (GCNN) is a type of neural network that performs convolutional operations on graphs. We divide the whole city into several subregions and model the traffic demand as a graph ( ) , where each vertex i represents a sub-region, characterized by a feature vector i x , consisting of 0 d features. Each edge represents the connection between pairs of vertices. The graph structure is described as an NN  adjacency matrix A, whose elements Aij encode the connection degree between the signals at two vertices. We aim to build a GCNN model to predict traffic demand for the next period. Suppose the GCNN model has several layers from the input to the output. For each layer l, the input is a feature vector H + includes the traffic demand information in the next time period at each sub-region. Specifically, when describing a graph with adjacency matrix A, the layer-wise propagation rule can be defined as a nonlinear function: In this way, features for each node could become more abstract after each layer. Various types of GCNNs can be obtained by defining propagation rules in Eq. (2). In this paper, we adopt a two-layer GCNN neural network and the following layerwise propagation rule as [8]: is the adjacency matrix of a graph with added self-connections; IN is the identity matrix; D is a diagonal matrix with j DA ii ij =  ; l W is the trainable weight matrix; ()   is an activation function, such as ReLU, Tanh.

Normal adjacency matrix.
The adjacency matrix A represents the structure of the graph, which need to be defined in advance. The definition of such a matrix can be based on different rules [10]. This section proposes three data matrices to quantify the correlations between sub-regions.

Spatial distance matrix.
According to the first law of geography [16], near things are more related than distant things. A simple way to encoder the correlation is based on their spatial spherical distance, which is calculated by spherical distance (the shortest distance along with the Earth) in this study. Elements of Spatial distance matrix (SD) are then defined as: where li and lj are the central locations of sub-region i and sub-region j. Each entry of Demand correlation matrix. The Demand Correlation matrix (DC) captures the correlations between sub-regions by employing historical demands. Each element is calculated by the Pearson Correlation Coefficient (PCC) based on the demand series between sub-region i and sub-region j.
where hi and hj are the traffic demand series for sub-region i and sub-region j. A binary adjacency matrix is formed such that 1 DC  is a pre-defined threshold.

Demand euclidean matrix.
The demand euclidean matrix also measures the relationship by using historical demands. Instead of using PCC, we use mean euclidean distance to represent the absolute closeness between pairs of sub-regions. Weighted adjacency matrix.
The above mentioned three adjacency matrices contain only zeros and ones, which only represent whether a relationship exists between two sub-regions, and the related subregions have the same effects for the target sub-region. However, in the real-world, a sub-region is related to various sub-regions in different weights. For example, for a given sub-region i, the adjacent sub-region j should have a more substantial influence compared with a farther sub-region. Therefore, we build three weighted adjacency matrix based on the above three rules. For the DC matrix, a higher value means a better relationship. A weighted adjacency matrix is simply built as: However, for SD and DE matrix, since a small number represents a good relationship, we build the corresponding adjacency matrix by using the reciprocal. To avoid zero denominators, we first replace zero elements with the minimum value of the remaining non-zero elements. Then, the two weighted adjacency matrices are denoted as:

GCNN with data-driven adjacency matrix
The hidden correlations between the two sub-regions are heterogeneous. Thus, the predefinition of the adjacency matrix A is not trivial. Hence, it is difficult to encode them using just a matrix such as the SD, DC, and DE matrix. The intuitive idea is to learn the filter from the data. Let However, Equ. (10) has restrictions for Â . 1) It considers Â as a symmetrical matrix, which means pairs of sub-regions have the same and mutual influence. 2) Â only has positive elements, which only allow the positive influence between pairs of sub-regions. In this paper, we use a similar form as, where Ǎ is an adjacency-like matrix, which is learned from data as Wl. Each entry in Ǎ could be negative, positive, and zero.

Experiment setup
The data utilized in this study are from TIL between June 2015 to June 2016 [15]. Each record includes pickup time, dropoff time, pickup longitude, pickup latitude, dropoff longitude, dropoff latitude. Since over 80% of these records have origins and destinations falling within the Manhattan island, we only focus on taxi data within Manhattan island for demand prediction. This study area is furthermore divided into grid uniformly, where each cell refers to a sub-region. The edge of the cell if 500 meters. In addition, these pickups are aggregated in a 15 minutes time interval. Furthermore, we only consider only regions where at least has one request per day on average had been made. After eliminating the regions that do not satisfy our limit, 269 regions were left.
In the experiment, the first 11 months' of the data sets (between June 2015 to April 2016) are used as the training set, the following month (May 2016) is used as the To consider the historical demands, we select the past eight intervals' demands as features for each sub-region, which are normalized to [0,1]. In addition, the features also include day of week and hour, which are categorical variables and encoded using Embedding methods. Based on section 3.2, we build six types of GCNN models based on the known adjacency matrix, which are labeled as GSdEa, GPdEa, GEdEa, GSdUa, GPdUa, GEdUa, where Sd, Pd, and Ed means spatial distance, demand correlation, and demand euclidean based adjacency matrix respectively. Ea and Ua denote normal adjacency and weighted adjacency matrix, respectively. In addition, we build our datadriven GCNN, according to section 3.3, which is denoted as DSu. We also select another data-driven GCNN for comparison, which is based on [10], denoted as DSp.

Performance Metrics
Here, we use Symmetric Mean Absolute Percentage Error (sMAPE) and Root Mean Square Error (RMSE) to examine the performance of GCNN and other comparison methods [19]. From the statistic perspective, RMSE shows the absolute difference between the predicted value and the real value, while the sMAPE describes a percentile error. Both metrics are expressed as where di is the real demand, i d is the predicted demand, and M is the length of individual demand. c in sMAPE is a small number (here, we make c=1), which is to avoid zero in the denominator. By using a different set of individual demands, we can get the prediction performance in various sub-regions and periods.

Performance between GCNN-based models
In this section, we compare different GCNN-based models, i.e., six GCNN models with a given adjacency matrix and two data-driven GCNN models. The results are listed in Table 1. The RMSE and sMAPE are calculated for the same test set. In the first block, the first three rows show the results of GCNN models with normal adjacency matrices, followed by the corresponding GCNN models with given weighted adjacency matrices. The second block gives results for two data driven GCNN models. We can also get some other conclusions. 1) The first block shows GSdUa, GPdUa, and GEdUa get a lower RMSE than GSdEa, GPdEa, and GEdEa, respectively. In addition, although GPdEa and GPdUa get a similar sMAPE. GSdUa and GPdUa get a much better performance in terms of sMAPE. It means that the weighted adjacency matrices can improve the performance of normal adjacency matrices. 2) The improvement of the weighted adjacency matrix depends on how to define the relationship between subregions. 3) Although weighted adjacency can enable GCNN to get a better a good performance, DSp model obtains the best performance under RMSE and sMAPE. In addition, we also compare GCNN models with another two prediction models, the Historical Average (HV) and Random Forest (RF).

Historical Average.
This approach predicts future demand as the average value of the same period and the same sub-regions during the past weeks. For example, if it is 8:00 am on Monday, the predicted value would be the average of demands at 8:00 am in the past 5 Mondays.

Random Forest.
Random forest (RF) is a powerful ensemble learning model, which has been widely used in many regression and regression tasks. The method constructs a handful of decision trees at the training process, and output is the mean of predictions of all the individual trees. Compared to other ensemble strategies, the particularity of random forests lies in the process by which the trees are built. Moreover, the obtained trees are not pruned. These models can also be used for estimating variable importance. They are reasonably fast to obtain and can be easily parallelized if more speed is required.
We first report the prediction performance over the whole test set. Then, we show the prediction performance over the entire city. Finally, we show the performance at all the sub-regions.

Prediction performance Over test set.
In this section, the three models are trained using the same training set, and the inference results are based on the same test set. Instead of only showing the performance based on the whole test, we test the performance for the demand in various sizes, which is reported in Figure 1. The x-axis is quantile, which means only the elements that are higher the given quantile are in the test set. The two y-axises are RMSE or sMAPE. For example, if x is 0.5, we get the partial test set where all the elements are higher than 0.5 quantiles. Then, the RMSE and sMAPE are calculated based on the partial test set and the corresponding predicted values.
We can easily find that when real demand increases, the RMSE also becomes larger while sMAPE decreases. The DSu can always get the best performance under RMSE (Figure 1a). In terms of sMAPE (see Figure 1b), although DSu does not get the best performance for the whole data set, it gives a more accurate prediction when the demand is high. In fact, the demand intensity area needs much more attention to vehicle dispatching. Thus, predicting demand accurately in high demand set is more useful.  Prediction results Over the whole day. Figure 2 reports the prediction performance over the entire city between the three predictors. All methods share some common patterns in both metrics. For instance, RMSE reaches its minimum values at about 4 am and maximum values at about 11 pm. sMAPE gets the largest value at about 3 am. In terms of RMSE, DSu gets the best performance at most times, but RF gets a better performance under sMAPE. One possible reason is that the test set contains more low demand value throughout the day.   Performance over the area. Figure 3 reports the prediction performance over the space. In Figure 3a, each cell represents the average daily demand in the test set, where red color means high demand, and green color means low demand. The imbalance of traffic demand can be easily observed. The high demand cells are mainly distributed in the southwestern part of Manhattan. In Figure 3b and 3c, we represent each cell with a method that gets the best performance according to RMSE and sMAPE, respectively. The two sub-figures reflect that DSu gets the best performance in most sub-regions. In the 269 sub-regions, 155 sub-regions get the best performance under RMSE, and 145 sub-regions get the lowest sMAPE. In addition, most of the cells in DSu are mainly distributed in the high demand area. The historical average (HV) also gets the best performance in several sub-regions, but all of these cells are concentrated in the low demand area.

Conclusions
In this study, we propose a data-driven graph convolutional neural network (DDGCNN) for short-term traffic demand predicting. In addition, three weighted adjacency matrix is provided for the normal Graph convolutional neural network. By learning from historical demand patterns, the proposed models can make traffic demand the entire city.
To evaluate the proposed models, the proposed models are compared with another four graph convolutional neural networks and two typical prediction methods by using NYC data. Several findings can be drawn based on the results. 1) The weighted adjacency matrix enables the GCNN to get better performance compared with the normal adjacency matrix. The degree of improvement differs in the construction of matrices.
2) The proposed model outperforms the other predictors, especially for high demand. 3) Our DDGCNN gets the best performance at a different time and most of the sub-regions. Although our methods get excellent performance, they still suffer some limitations. The limitations and the possible improvements are summarized as follows. 1) In our method, we only consider historical demand information. It is not enough to describe all the influence factors. In future research, the work will consider more information to the input, such as weather, land use, and social events. 2) GCNN is good at learning and utilizing spatial dependencies. Time factors are also essential influence factors for traffic demand prediction. In the next step, the recurrent neural network is another type of deep learning model to deal with time-series data, which is worthy of fusing into the