Short‐term traffic forecasting using self‐adjusting k‐nearest neighbours

Artigo Acesso aberto Revisado por pares

Short‐term traffic forecasting using self‐adjusting k‐nearest neighbours

2017; Institution of Engineering and Technology; Volume: 12; Issue: 1 Linguagem: Inglês

10.1049/iet-its.2016.0263

ISSN

1751-9578

Autores

Bin Sun, Wei Cheng, Prashant Goswami, Guohua Bai,

Tópico(s)

Time Series Analysis and Forecasting

Resumo

IET Intelligent Transport SystemsVolume 12, Issue 1 p. 41-48 Research ArticleFree Access Short-term traffic forecasting using self-adjusting k-nearest neighbours Bin Sun, Bin Sun orcid.org/0000-0001-5824-425X Blekinge Institute of Technology, Karlskrona, 37179 SwedenSearch for more papers by this authorWei Cheng, Corresponding Author Wei Cheng wei.cheng@bth.se orcid.org/0000-0002-2646-7747 Kunming University of Science and Technology, Kunming, 650093 People's Republic of China Blekinge Institute of Technology, Karlskrona, 37179 SwedenSearch for more papers by this authorPrashant Goswami, Prashant Goswami Blekinge Institute of Technology, Karlskrona, 37179 SwedenSearch for more papers by this authorGuohua Bai, Guohua Bai Blekinge Institute of Technology, Karlskrona, 37179 SwedenSearch for more papers by this author Bin Sun, Bin Sun orcid.org/0000-0001-5824-425X Blekinge Institute of Technology, Karlskrona, 37179 SwedenSearch for more papers by this authorWei Cheng, Corresponding Author Wei Cheng wei.cheng@bth.se orcid.org/0000-0002-2646-7747 Kunming University of Science and Technology, Kunming, 650093 People's Republic of China Blekinge Institute of Technology, Karlskrona, 37179 SwedenSearch for more papers by this authorPrashant Goswami, Prashant Goswami Blekinge Institute of Technology, Karlskrona, 37179 SwedenSearch for more papers by this authorGuohua Bai, Guohua Bai Blekinge Institute of Technology, Karlskrona, 37179 SwedenSearch for more papers by this author First published: 02 November 2017 https://doi.org/10.1049/iet-its.2016.0263Citations: 31AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinkedInRedditWechat Abstract Short-term traffic forecasting is becoming more important in intelligent transportation systems. The k-nearest neighbour (kNN) method is widely used for short-term traffic forecasting. However, the self-adjustment of kNN parameters has been a problem due to dynamic traffic characteristics. This study proposes a fully automatic dynamic procedure kNN (DP-kNN) that makes the kNN parameters self-adjustable and robust without predefined models or training for the parameters. A real-world dataset with more than one year traffic records is used to conduct experiments. The results show that the DP-kNN can perform better than the manually adjusted kNN and other benchmarking methods in terms of accuracy on average. This study also discusses the difference between holiday and workday traffic prediction as well as the usage of neighbour distance measurement. 1 Introduction Increasing road traffic is nowadays causing more congestions and accidents which gain more attention from authorities and road users due to the severe loss of life and property [1]. The intelligent transportation system (ITS) is expected to provide efficient traffic management and automatic accident detection [2]. The efficient control of traffic flow on motorways or freeways can lead to shorter travel time, fewer pollutant emissions, and increased road safety [3]. Efficient traffic management and accident detection rely on reliable and accurate short-term traffic forecasting [4]. Since the transportation system and traffic flow are the results of the complex interplay among several aspects including the people, vehicles, roads, environment and information [5], predicting short-term traffic situation for the ITS is a complex task, which has been an active research subject in the past few decades [6]. One frequently used method is a k-nearest neighbour (kNN) [7] based prediction, but the self-adjustment of kNN parameters has been a problem due to dynamic traffic characteristics. The calibration or training of kNN parameters may discourage traffic management centres from using the algorithm [8]. We aim to make kNN parameters fully self-adjustable for traffic prediction to increase accuracy and to reduce human actions. We choose kNN because of the substantial increase in data availability [6] and its flexibility of solving non-linear problems [9]. To have self-adjusting parameters, we develop dynamic procedure kNN (DP-kNN). DP-kNN is inspired by parameter space searching combined with kNN robustness [10]. When kNN parameters change, the prediction accuracy also changes. To get more accurate results, ensemble methods can be used to aggregate several prediction results [11]. The results show that the DP-kNN can improve prediction accuracy compared with other methods, especially during holidays. 2 Related work Many studies have been conducted for short-term traffic forecasting. The existing methods can be divided into two categories which are parametric and non-parametric methods [6, 12, 13]. Typical parametric methods are algorithms designed using the seasonal autoregressive integrated moving average (SARIMA) [14]. Traditional subjective methods to choose values for the SARIMA parameters include the usage of the autocorrelation function, partial autocorrelation function and extended autocorrelation function. Objective methods include Akaike information criterion and Bayesian information criterion, which can be used to automatise SARIMA [15]. Previous work has used the SARIMA as a benchmarking method but did not get the same conclusion when compared with kNN [16-20]. Thus, we compare the automatised SARIMA with our proposed algorithm. For non-parametric methods, most work focuses on neural networks [21-24], as well as pattern searching [12, 19, 25, 26], besides other techniques, for example, Bayesian hierarchical model [27]. Within the pattern searching subcategory, one of the most popular algorithms is kNN. Some researchers attempted to improve this method for traffic forecasting from different aspects, such as spatial aspect [26], toll data aspect [28], complicated mathematical model aspect [29], among others. 2.1 Adaptive kNN methods One main parameter of kNN is the number of neighbours (k). If k is optimised, kNN can perform better [30]. There are many studies regarding optimisation of k. Ling et al. used a gap of the distance sequence to select neighbours where the gap implies the inherent boundary of a small region centred at new query instance. It is believed that the data within that region are densely gathered [31]. He et al. used an algorithm to adjust k from 1 to 10 according to the noise level (cache miss level) during query [32], and more noise leads to bigger k. They investigated kNN from the database performance aspect. Zhang and Song trained artificial neural networks (ANN) to map from the dataset characteristics to a suitable k value for classification [33]. Hong et al. optimised k by merging different values from similar sites and devices [34]. Dell et al. considered the flow with dynamic characteristics during different times of the day [20]. Singh et al. considered time series having piecewise linear nature and update k continuously [35]. There are also some studies that tried to optimise the selection of nearest neighbours from the searching strategy aspect [36, 37]. Though some research (including [16, 38]) considered search step length (h) should be bigger than to be able to represent a current state where D is the number of input data dimensions. Other studies considered that it is possible to get the best performance when [39]. Some studies investigated both k and h. For example, Chang et al. [40] worked on both parameters, but the parameters are not optimised at the same time. Yoon and Chang [16] updated k periodically (daily, weekly, monthly) and concluded that k and h should be optimised simultaneously. Some researchers used a part of the data as an estimation dataset to evaluate the parameter combinations and used the best combination as the optimal settings [41]. This is part of their Pattern-kNN algorithm which overcomes memoryless issue in kNN regression. The Pattern-kNN algorithm is similar to our idea and can handle missing data. Thus, it is used as one benchmarking algorithm in this work together with the SARIMA. For different traffic data, it is necessary to choose suitable parameter values differing from case-to-case and time-to-time. Three typical methods were used in previous work to make kNN adaptive. Those methods can have pre-defined models, for instance, holiday versus workday and free flow versus saturated flow, or calibrated/trained models [9, 17] or manual trial-and-error [20]. Based on the literature review, we found that the fully automatic kNN parameter self-adjustment is lacking. Thus, we introduce a fully automatic dynamic procedure to choose parameters and improve the robustness without extra information beyond the traffic data itself. 2.2 Distance measurement Distance measurement during neighbour selection for multi-variable should be considered. Among many distance measurements, Euclidean distance [42] is a widely used traditional and ordinary distance for finding nearest neighbours [40, 43-45]. It is easy to understand, implement and fast to calculate [46] but cannot represent a concept-related distance, as it does not consider the shape of distribution (scatter) [47]. Mahalanobis distance (M-distance) [48] shows advantages when handling multivariate problems. When compared with conventional thresholds, Fig. 1 from [49] shows that M-distance is suitable for multivariate outlier detection and can provide better thresholds by considering the shape of the distribution. However, M-distance is rarely used in traffic data analysis. One possible reason is that few researchers are considering more than one metric together, so multivariate oriented M-distance is not necessary for previous work. Clark proposes to use multivariate distance for this, and they are using all the three fundamental traffic metrics (flow rate, speed, and occupancy) with a traditional Euclidean distance [3]. In this work, M-distance [48] based analysis is compared with Euclidean distance. Fig. 1Open in figure viewerPowerPoint Comparison of different distances. A, dotted square is the threshold considering each metric separately. B, threshold considering two metrics together. C, M-distance threshold considers two metrics together as well as the shape of the distribution [49] 3 Methodology In this section, we introduce the basic kNN algorithm and describe DP-kNN so that the parameters can be self-adjusting. The overview of the algorithm logic is shown in Fig. 2. In summary, DP-kNN contains four stages. The pre-processing stage generates parameter options automatically according to the size of history data. The DP-kNN level-1 evaluates all pairs of parameters using normal kNN. The second level evaluates the combinations of the pairs and balances the weights between flow rate measurement and speed measurement. According to those evaluations, the final stage can use the optimal settings to conduct dynamic predictions. Level-1 and Level-2 need the parameters from pre-processing. Level-2 depends on the results from Level-1. The prediction stage needs to check evaluations from both Level-1 and Level-2. Later in this section, we present an application of M-distance measurement in multivariate difference traffic data. Fig. 2Open in figure viewerPowerPoint DP-kNN algorithm logic overview. Every newly added record (time point) will trigger DP-kNN to go through those four stages. The pre-processing stage generates options for parameters which are evaluated in the first and second level. The prediction stage uses the evaluations to get final predictions 3.1 Features and metrics selection Flow rate, speed, and density are three fundamental metrics in traffic engineering [5]. It is possible to calculate one metric given the other two, so two metrics should be considered at the same time. Previous research prefers to consider only one of them within the time domain, usually the flow rate or speed. We compare the usages of only the flow rate with the usages of both the flow rate and speed within the time domain. 3.2 Basic kNN and problem settings A basic kNN algorithm can be described as follows. Suppose that there is time series data until but not including now and the current time is . New data arrives at every time interval which differs among systems and it can be 1, 5 or 15 min and so on. The system interval is also the unit for search step length (h) and predicts step length (m). If we consider one time interval as one basic time unit, the data is time series with data on time points: . To predict the traffic, three steps are needed: constructing state vector, selecting nearest neighbours, and conducting prediction according to the selected nearest neighbours’ future (yellow parts after dashed squares in the figure). Firstly, we select the latest data to construct the state vector or matrix. Here we use h as the search step length which means h data points backwards from the current time until is selected as a new query in the kNN algorithm. The step length is the number of intervals. The search step length is also known as a window in some research areas. To distinguish it from another definition when using shifting search step length, we will use the phrase search step length in this work. If traffic on time t is to be predicted, the state vector of flow rate is (1) Similarly, the state vector for speed is (2) For the distance measurement which considers two variables, the state matrix is (3) Secondly, we compare the latest data (state vector/matrix) with the same time range for every previous day, say time series from to of day (yesterday), from to of day (the day before yesterday), and so on. By comparison, we mean the calculation of the distance between the latest data and previous each day's data. Therefore, the smallest distances of previous k days indicate the kNN. For instance, in Fig. 3, the shadowed part of time sequence from and is selected as nearest neighbours. Fig. 3Open in figure viewerPowerPoint Traffic prediction for time t using basic kNN. History data until and including the time point . Search step length (lag) is h. Dashed white squares are selected nearest neighbours Finally, the averaged result of the values from the data at time points t of all the nearest neighbours’ corresponding days is calculated. The result is the prediction. The whole basic kNN algorithm is shown in Fig. 3 Although the basic kNN can be used, traffic flow has varied and non-stable statistic characteristics, so kNN needs different k and h pairs under different situations. As we can see later that manual adjusting is hard to give the best solution. DP-kNN provides a way to fully automatically adjust k and h and their ensemble results for the system. 3.3 DP-kNN pre-processing When traffic data is provided to DP-kNN, or a new record is appended to the existing data, the pre-processing phase firstly generates the values’ options to be assigned to DP-kNN parameters. The principles and procedures of generating the possible values of parameters are as follows. Most parameters are directly or indirectly dependent on the number of history instances (), exponential growth is suitable. For the number of neighbours k, it starts with k = 2 which grows exponentially with power two (double the value every time), until k can cover half of potential neighbours. For the search step length h, it starts with h = 2 which grows exponentially with power two until h can cover half of a day. Then the set of p () is generated as all possible pairs of k and h, i.e. Cartesian product. On the second level of DP-kNN, for the number of p to choose (), it starts with which grows exponentially with power two, until can cover half of the items in . For the speed error weighting , as its range is from 0 to 1 and it is not related to the number of history instances, an arithmetic sequence 0, 0.2,…, 1 is used. Those principles are trying to cover a bigger parameter space with fewer samples. Beyond the parameters, for the ensemble and aggregation of previous evaluations, and should cover the last 1 h. The reason to cover 1 h is as below. Short-term prediction is to predict future traffic not more than 1 h [50], so traffic data can be divided hourly and hypothesis testing is used to check differences among hourly traffic datasets. Previous work has shown that, according to the p -values, 1 h before a time point is the maximum safe range to get a similar traffic [51]. 3.4 DP-kNN level 1 The first level of DP-kNN evaluates different pairs of kNN parameters. Another kNN procedure (the second level) is used to decide which parameter pairs from the first level should be, and how to be, aggregated. To predict traffic at time t, we will find several best (k, h) pairs according to the recent–history situation, and use the averaged results of those pairs to reduce the influence of variance. The content below explains how to predict the flow rate and speed at time t. We have two sets here which contain available values of k and h : which contains values and which contains values. We need a measurement of how well the k and h pairs work. Firstly, there is a set of the pairs (4) where and contains pairs of k and h. For each , we use the paired values to set up kNN and do prediction. After getting the predicted data, we can measure the flow rate prediction error () for each at time t (5) where is the predicted flow rate at time using and f is the prediction error measurement function. The same measurement is used for speed prediction error () (6) where is the predicted speed at time using . A typical error measurement used is mean absolute error (MAE), which is used here. Another widely used measurement is mean absolute percentage error which is not always suitable because the flow rate can sometimes be zero, especially during nights. The measurement of flow rate prediction error from time t + 1 till time t + q (q instances) is (7) where is the predicted flow rate and r is the real flow rate, respectively. Then, we calculate the averaged measurements of each for the latest intervals for the flow rate and speed (8) (9) where is the number of instances per hour. Now we construct a matrix using p, , as a Level-1 DP-kNN evaluation result matrix (before normalisation) (10) To reduce the influence of the random error, we will use several different p s to make the prediction at time t and calculate the average. However, the new challenge now is determining how many pairs (p) to choose (i.e. to choose the value of ) and define the goodness metric of p. We will firstly describe normalisation for the flow rate and speed, then introduce weights () to have a weighted measurement. To have the fair influence of both the flow rate and speed, we need to unify the measurements (normalisation) in (11) (12) where u is the unifying factor, is the unified , is the first column of and so on. is the final Level-1 DP-kNN evaluation result matrix with normalisation. Now, two new parameters compose the second layer of our DP-kNN of kNN algorithm, which are and . 3.5 DP-kNN level 2 For notation, suppose the possible values of and are in two sets: which contains values and which contains values, then we can define a set containing all () pairs as (13) where and contains pairs of (). Given a specific time point (), similarly like , we measure the performance of each . The indices of , i.e. , are repeated in , but it is needed to distinguish each pair's from other pairs. Thus, for a specific pair , consider the corresponding specific index of as . Similar to , for the specific pair , the corresponding specific index of is also considered as , hence . Thus, and its elements can also be noted as (14) After noting those elements, the evaluation of begins. For time point , we add weighted measurement as a new column into , which becomes (15) The rows in are then sorted by the last column (i.e. weighted measurement) in increasing order. Mathematically speaking, a row permutation is applied to so that the values in the last column are monotonically non-strict increasing when the index is increasing. For the sake of convenience, the symbolic representation of the matrix () is not changed. After permutation, use each p from the first row until row for basic kNN parameters to predict the flow rate and speed. The results of those p are averaged to get predictions noted with : and . Thus, the measurements of DP-kNN Level-2 flow rate prediction error () and speed prediction error () of pair i.e. at time are (16) (17) Now, we will choose the best (a pair of and ). Similar to the first level of DP-kNN, we need to calculate the average performance of each for the last intervals (18) (19) We can consider , , as columns to construct the Level-2 DP-kNN evaluation result matrix (20) This gives us the evaluation of all . The Level-2 evaluation result () can then be used together with the Level-1 evaluation result () for parameter selections during prediction. 3.6 DP-kNN prediction To predict the flow rate at time t, the following steps are taken. is sorted by the last column, i.e. a row permutation is applied to so that the values in the column (flow rate prediction error) are monotonically non-strict increasing. The first row of the sorted indicates the least prediction error. The in the first row is selected as the best choice of and noted as . The corresponding and in are the best choices and noted as and . By using to weight columns and , a new column is added into to become . (21) Lastly, after sorting each p (i.e. k, h pair) from the first row until the row are used to predict the flow rate at time t. By averaging the results, we get the final prediction of the flow rate. For predicting the speed at time t, a similar procedure is used. The only difference is applying a row permutation to so that the values in the last column (speed prediction error) are monotonically non-strict increasing. 3.7 Difference Mahalanobis distance As there are many ways of applying M-distance, here is a brief description of M-distance and how it is used with difference data. Suppose ) is the instance index and is the variable index in dataset (with elements ) that contains observations of variables. The covariance between the variable and the variable is (22) where and are variables’ expected values, respectively. Thus, the covariance matrix of dataset can be expressed as a -by- matrix (with element ), . This is calculated from the history data. Finally, the M-distance to the centroid is the distance between instance and centroid (23) Although the M-distance can also be used to measure the distance between any two points, it stands for the M-distance to centroid in our work. A nature of the traffic time sequence is the difference characteristic. This is noticed in [52] and they also used the first derivative of the speed but in a different way. We can get one difference data from two consecutive instances. That is, the difference data is a result of subtracting one data instance with its previous neighbour in a consecutive time series. If we note an arbitrary instance with both the flow rate and speed as , we can get the difference data as (24) As there are originally instances in , we can construct difference dataset () which have difference instances (). Though the covariance matrix is calculated hourly, the algorithm does not need to know the exact time of day. Thus, the calculation is still automatic. One problem during the distance measurement is the curse of dimension which is caused by the sparse sampling space due to the big value of h [53]. To avoid this problem, we treat the time domain as one dimension instead of multi-dimension by averaging the distance values from all search steps. Thus, the final distance of the state vector (or matrix) at time t when considering an arbitrary day d is (25) where MD is the M-distance. This equation is also applied to other distance measurements. The whole DP-kNN algorithm is illustrated in the pseudo-code Algorithm 1(see Fig. 4) which can be considered as a detailed description of Fig. 2. Note that the cache is a part of random access memory that stores many different matrices and values. The time point indicator (subscripts) should be checked carefully during the implementation. Recent measurement and evaluation results are temporarily cached so that the DP-kNN does not need to recalculate them. Due to the usage of the cache, the pseudo-code for implementation looks a little different to the mathematical description, but the logic is the same. Fig. 4Open in figure viewerPowerPoint Algorithm 1 DP-kNN 4 Experiments 4.1 Data collection The data is collected by the traffic management centre of Yunnan province, China. The data from April 2013 to May 2014 of one monitoring device is cleaned using a tool from [51]. The road usually carries under-saturated flow except for holiday noons. A new data record is generated every 5 min, i.e. the interval is 5 min, and one predict step (m = 1) is equivalent to 5 min. Each record contains an identifier and some statistically measured values. Among those attributes, we select the total number of vehicles passing the monitoring point (flow rate) and the average speed of those vehicles (speed) as mentioned in Section 3.1. 4.2 Experimental design During neighbour searching, any neighbour is discarded if it has an overlap with query instance's search step length or predicts step length or any possible evaluation results. Thus, the error is the out-of-sample test error. Besides, if more than ten per cent data in search or prediction steps are missing or influenced by events according to authority's information, the results are ignored during analysis. As the SARIMA processes and predicts one-dimension data, to compare with the DP-kNN, we use the two types of Euclidean distances. The phrase Euclidean distance, in our case, is used for prediction using only the flow rate (the errors become absolute values), and E2d for prediction using both the flow rate and speed. As the DP-kNN is using 1 h data to find the parameter settings, the same data is used as an estimation set (as defined in [41]) for the pattern-kNN. The experiment environment for the pre-processing stage is Matlab R2016b. DP-kNN Level-1 is conducted in CUDA version 8.0 (runtime version 7.5) [54] on a Graphics Processing Unit (GPU), Nvidia Titan X Pascal. The second level of the DP-kNN and experiment result analysis is done using R programming language version 3.3.3 and R-studio software version 1.0.143 running in Windows 10 on Intel Core i7-6850 K with 32 GB memory. 5 Results We plot the results for holidays and workdays separately only for analysis purpose because they have different characteristics. As the results contain quite a lot of data, part of results are plotted. Only flow rate prediction is shown because speed prediction results have similar trends and the SARIMA has only the flow rate output. To plot two-dimensional figures, k = 8 and h = 4 are used while changing another parameter. This can be considered as slicing of the multi-dimension result data. The patterns of data beyond those plots are not changing much unless otherwise mentioned. The parameters in DP-kNN are not fixed but self-adjusted and aggregated, so the values are displayed as straight horizontal lines. According to the generating principles in the DP-kNN pre-processing, the potential values for parameters are automatically generated as: , . contains pairs of , so , . Thus, contains pairs of . In addition to the parameters, is used to cover 1 h as the interval of data is 5 min. 5.1 Influence of k and h The plotted search step length h is 4, which means that the search time is 20 min. The plots below, Fig. 5 is showing the comparison between the DP-kNN and manually tuned kNN on holidays and workdays. The distance is E2d and other distances have similar patterns. On holidays, the DP-kNN performs well and can touch the curve's minimum or below the minimum depending on h. On workdays, the DP-kNN is also near the minimum, but cannot go below the curve's minimum. When compared with the holiday, working day's manual tuning curve has a higher k value of the turning point. Fig. 5Open in figure viewerPowerPoint Influence of number of nearest neighbours k on performance (MAE), when search step length h = 4 (20 min). Two horizontal lines are DP-kNN results from self-adjust parameters. The horizontal lines are near the minimum of manually adjusted kNN curves. In this plot, DP-kNN is below all the MAE values on the manually adjusted kNN's error curve on holidays. This means the manual adjustment of k cannot be better than DP-kNN due to the very small h The influence of search step length h is shown in Fig. 6. The patterns of the influence of h are similar to k, and the workday curve also has a higher value of turning point. Fig. 6Open in figure viewerPowerPoint Influence of search step length h on performance (error) with the number of nearest neighbours k = 8. One step (h = 1) is equivalent to 5 min. Two horizontal lines are DP-kNN results from self-adjusting parameters. The horizontal

Ver no editor

Altmetric

PlumX

Entrar

Lembrar minha senha

Receber meu e-mail de confirmação

Short‐term traffic forecasting using self‐adjusting k‐nearest neighbours