Artigo Revisado por pares

Novel training algorithms for long short‐term memory neural network

2018; Institution of Engineering and Technology; Volume: 13; Issue: 3 Linguagem: Inglês

10.1049/iet-spr.2018.5240

ISSN

1751-9683

Autores

Xiaodong Li, Changjun Yu, Fulin Su, Taifan Quan, Xuguang Yang,

Tópico(s)

Target Tracking and Data Fusion in Sensor Networks

Resumo

IET Signal ProcessingVolume 13, Issue 3 p. 304-308 Research ArticleFree Access Novel training algorithms for long short-term memory neural network Xiaodong Li, Xiaodong Li Department of Information and Communication Engineering, Harbin Institute of Technology, Harbin, People's Republic of ChinaSearch for more papers by this authorChangjun Yu, Corresponding Author Changjun Yu yuchangjun@hit.edu.cn Department of Information and Communication Engineering, Harbin Institute of Technology, Weihai, People's Republic of ChinaSearch for more papers by this authorFulin Su, Fulin Su Department of Information and Communication Engineering, Harbin Institute of Technology, Harbin, People's Republic of ChinaSearch for more papers by this authorTaifan Quan, Taifan Quan Department of Information and Communication Engineering, Harbin Institute of Technology, Weihai, People's Republic of ChinaSearch for more papers by this authorXuguang Yang, Xuguang Yang Department of Information and Communication Engineering, Harbin Institute of Technology, Harbin, People's Republic of ChinaSearch for more papers by this author Xiaodong Li, Xiaodong Li Department of Information and Communication Engineering, Harbin Institute of Technology, Harbin, People's Republic of ChinaSearch for more papers by this authorChangjun Yu, Corresponding Author Changjun Yu yuchangjun@hit.edu.cn Department of Information and Communication Engineering, Harbin Institute of Technology, Weihai, People's Republic of ChinaSearch for more papers by this authorFulin Su, Fulin Su Department of Information and Communication Engineering, Harbin Institute of Technology, Harbin, People's Republic of ChinaSearch for more papers by this authorTaifan Quan, Taifan Quan Department of Information and Communication Engineering, Harbin Institute of Technology, Weihai, People's Republic of ChinaSearch for more papers by this authorXuguang Yang, Xuguang Yang Department of Information and Communication Engineering, Harbin Institute of Technology, Harbin, People's Republic of ChinaSearch for more papers by this author First published: 01 May 2019 https://doi.org/10.1049/iet-spr.2018.5240Citations: 3AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinkedInRedditWechat Abstract More recently, due to the enormous potential of long short-term memory (LSTM) neural network in various fields, some efficient training algorithms have been developed, including the extended Kalman filter (EKF)-based training algorithm and particle filter (PF)-based training algorithm. However, it should be noted that if the system is highly non-linear, the linearisation employed in the EKF may cause instability. Moreover, the PF usually suffers from the particle degeneracy. Therefore, the PF-based training algorithm may only find a poor local optimum. To solve these problems, an unscented Kalman filter (UKF)-based training algorithm is proposed. The UKF employs a deterministic sampling method; hence, there is no linearisation in it and it does not have the degeneracy problem. Moreover, the computational complexity of the UKF is the same order as that of the EKF. To further reduce the computational complexity, the authors propose a minimum norm UKF (MN-UKF) to obtain a good trade-off between performance and complexity. To the best of the authors' knowledge, this is the first reported solution to this problem. Simulations using both benchmark synthetic signal and real-world signal illustrate the potential of the algorithms developed. 1 Introduction Thanks to the inherent memory of the recurrent neural network (RNN), it is more suitable to deal with the time series forecast problem than the feedforward neural network (FNN). However, traditional RNN can only model the time series which is short-term relevant and usually suffers from the exploding gradient and vanishing gradient [1]. To deal with these problems, the long short-term memory (LSTM) neural network has been developed [2]. By using the special memory block, the LSTM can model commonly time series, both short-term relevant and long-term relevant. The LSTM has been widely employed in various fields including residential load forecasting [3], traffic speed prediction [4], and scene labelling [5]. Here, in particular, we focus on the time series forecast problem. Traditionally, the LSTM is trained by using the stochastic gradient descent (SGD) algorithm. However, it is important to note that the SGD algorithm only utilises the first-order derivative information. Therefore, it can be expected that the SGD algorithm may be inefficient when the optimisation problem is ill-conditioned. To address this issue, several efficient algorithms have been proposed to train the LSTM, such as the extended Kalman filter (EKF)-based training algorithm [6] and particle filter (PF)-based training algorithm [7, 8]. Since the EKF-based training algorithm can utilise the second-order derivative information, it is able to obtain a faster convergence rate and better performance than the commonly gradient descent algorithms [9, 10]. However, it should be noticed that the EKF needs a linearisation of the observation equation and, consequently, the computation of a Jacobian matrix. If the dynamic system is highly non-linear, the EKF-based training algorithm may be unstable and have poor performance [11]. More recently, the PF-based training algorithm has been developed. The PF employs the sequential Monte Carlo methodology to deal with the non-linear filtering problem and is suitable for the non-linear dynamic system with any non-linearity or distribution [12, 13]. Owing to the requirement of the importance sampling, a suitable importance function must be designed before using a PF (see [14] for details). Traditionally, the prior importance function is employed for simplicity [7]. However, because of using the prior importance function, the PF may be inefficient and obtain poor performance [14]. Generally, the PF suffers from the particle degeneracy. To deal with the particle degeneracy problem, a resampling methodology must be employed. However, using the resampling methodology may result in a loss of diversity, thereby a failing determination of the parameters [15]. Furthermore, to achieve satisfactory performance, a mass of particles must be employed, leading to expensive computation. In consideration of the limitations of the EKF-based training algorithm and PF-based training algorithm, an unscented Kalman filter (UKF)-based training algorithm is proposed in this paper. Since the UKF employs a deterministic sampling methodology to deal with non-linear parameter estimation problem without any linearisation, it is able to achieve better performance than the EKF [16]. Moreover, the UKF does not have the particle degeneracy problem and the computational complexity of the UKF is usually lower than that of the PF. As a result, it can be expected that the UKF-based training algorithm will be a better choice for LSTM. Considering that the computational complexity of the UKF may be unacceptable for some specific applications, a minimum norm UKF (MN-UKF) is proposed. The users can choose a proper configuration of the MN-UKF to obtain a good trade-off between performance and complexity for their applications. Notice that all of the training algorithms are suitable for both batch and online training and we focus on batch training in this paper. Simulations employing benchmark synthetic time series and real-world data justify the great advantages of the algorithms proposed. 2 LSTM network Ever since the LSTM was proposed, several modified LSTMs have been developed. The LSTM employed in this paper is a widely used version [17]. Fig. 1 shows the architecture of a cell in the LSTM. The state dynamics equations of the LSTM can be written as (1)where , , and are the input gate activation, forget gate activation, and output gate activation at time k, respectively. is the input vector of the network, is the cell output vector. is the output vector of the network. is the standard logistics sigmoid function. is the element-wise product between two vectors. The weight matrices for the input vector are , , , , and . The weight matrices for the cell output vector are , , , , and . , , , , and are the bias vectors. Fig. 1Open in figure viewerPowerPoint Architecture of a cell in the LSTM 3 Training algorithms for LSTM To train the LSTM using the filters, a state space model must be introduced. By utilising the state space model, the filters can estimate the parameters of the LSTM, recursively. In the following, we present the state space model used in this paper, and develop the UKF-based training algorithm and MN-UKF-based training algorithm. 3.1 State space model for LSTM training In order to get a state space model, the state dynamics equations of the LSTM should be rewritten as (2)where is a collection of the adjustable parameters, including the , , , , , , , , , , , , , , and . and are the input vector and output vector of the LSTM, respectively. denotes the non-linear dynamics of the LSTM. and are the state noise and observation noise, respectively. The covariance matrix of is and the covariance matrix of is . 3.2 UKF-based training algorithm The training procedure of the UKF-based training algorithm is similar to that of the PF-based training algorithm [7]. The UKF-based training algorithm is summarised in Algorithm 1, where is the column of the matrix square root (if , is the p th column of ). The weights associated with the sigma points are (3)where , is a positive constant (e.g. ), is usually set to , and is optimal for Gaussian distributions [10]. Algorithm 1.UKF-based training algorithmInitialised with Calculate sigma points where .Compute predictions Measurement update 3.3 MN-UKF-based training algorithm Since the computational complexity of the UKF-based training algorithm may be unacceptable for some specific applications, a novel methodology must be proposed to obtain a good trade-off between performance and complexity. Based on (2) and UKF-based training algorithm, the more sigma points the UKF has, the more computational resources the UKF-based training algorithm consumes. Therefore, in order to decrease the computational complexity of the UKF-based training algorithm, the number of the sigma points should be reduced. As the number of the sigma points is which depends on the length of the state vector , a fast and efficient method is to employ the matrix transformation to reduce the length of the state vector, given by (4)where and . The minimum norm solution of (4) is [18] (5)Note that a special can be chosen to avoid the matrix inversion in (5). For example, the consists of the first M rows of a random orthogonal matrix . Then we can get (6)and the state space model can be written as (7)where is the state noise and the covariance matrix of is . Consequently, the MN-UKF-based training algorithm can be derived by using (4), (6), (7) and the standard UKF, and the users can choose a proper size of the to obtain a good trade-off between performance and complexity. Note that using the MN-UKF-based training algorithm reduces not only the number of sigma points but also the computational requirement of the error covariance matrix (i.e. ) update. The computational complexities of the EKF-based training algorithm and UKF-based training algorithm are of order [10] and the PF-based training algorithm is of order , where is the particle number [19]. Based on (4), the MN-UKF-based training algorithm is of order . Hence, among the four training algorithms, the MN-UKF-based training algorithm has the lowest computational complexity. 4 Simulations Fig. 2Open in figure viewerPowerPoint Testing data sets of the experiments (a) Mackey–Glass time series, (b) Lorenz chaotic time series, (c) Surface temperature time series Fig. 3Open in figure viewerPowerPoint Learning curves of the training algorithms for the three time series (a) Mackey–Glass time series, (b) Lorenz chaotic time series, (c) Surface temperature time series Fig. 4Open in figure viewerPowerPoint One-step-ahead prediction errors of the training algorithms for the three time series (a) Mackey–Glass time series, (b) Lorenz chaotic time series, (c) Surface temperature time series Table 1. One-step-ahead prediction performance comparison of the algorithms SGD EKF UKF PF Mackey–Glass 48.7233 60.6484 84.93150 51.1922 Lorenz 35.5079 38.8211 42.6218 37.9841 surface temperature 27.1765 37.7266 48.4938 30.1729 * Bold values indicate best result To justify the potential of the algorithms proposed, we performed a series of experiments for various signals, including a Mackey–Glass time series, a Lorenz chaotic time series, and an hourly surface temperature time series. The Mackey–Glass time series can be obtained by using a differential equation, given by (8)where , , , , and were chosen for the experiments. The Lorenz chaotic time series can be obtained by employing three differential equations, given by (9)where , , , , , and . The sampling times of the Mackey–Glass time series and Lorenz chaotic time series were 1 and 0.017 s, respectively. The was predicted in the experiments. The temperature time series was collected in Xian, China, from January 1990 to March 1990. The experiments were performed using the Matlab on a computer with i3-6100 processor, 3.7 GHz CPU, and 16 GB RAM. The LSTM was composed of two input neurons, seven memory blocks and one output neuron, and the memory block had one cell. Therefore, there were 290 adjustable parameters (i.e. ). For the Mackey–Glass time series and Lorenz chaotic time series, the quantities of the training sample and test sample were set to 2000 and 300, respectively, the learning rate of SGD algorithm was set to 0.2 with a momentum of 0.99, the initial error covariance matrix was set to , , , and , where is the identity matrix, is annealed from to , and is annealed from 0.1 to . For the surface temperature time series, the quantities of the training sample and test sample were set to 1500 and 300, respectively, the learning rate of SGD algorithm was set to 0.15 with a momentum of 0.99, the initial error covariance matrix was set to , is annealed from 0.1 to , and is annealed from 0.2 to . To assess the performance quantitatively, the mean square error (MSE) and prediction gain were used, given by [20] (10)where and are the estimated variances of the input signal and the prediction error, respectively. All of the results were averaged over 100 trained networks with independent initialisation of the network parameters. By employing the method in [7], we found that the reasonable particle numbers are 1200, 1500, and 1600 for the Mackey–Glass time series, Lorenz chaotic time series, and surface temperature time series, respectively. Fig. 2 shows the testing data sets of the experiments. Fig. 3 shows the learning curves of the training algorithms for the three time series. As shown in Fig. 3, the UKF-based training algorithm can achieve the lowest steady-state MSE. Moreover, it can be seen that the UKF-based training algorithm has the fastest convergence rate for the Lorenz chaotic time series and the convergence rate of the EKF-based training algorithm is comparable to that of the UKF-based training algorithm for the Mackey–Glass time series and surface temperature time series. Fig. 4 shows the one-step-ahead prediction errors of the training algorithms for the three time series. Table 1 compares the one-step-ahead prediction performance of the training algorithms. It can be seen that the UKF-based training algorithm can achieve the best performance. Fig. 5Open in figure viewerPowerPoint Learning curves of the MN-UKF-based training algorithm for different (a) Mackey–Glass time series, (b) Lorenz chaotic time series, (c) Surface temperature time series In order to illustrate the performance of MN-UKF-based training algorithm for different sizes of the , we performed a series of experiments. Fig. 5 shows the learning curves of the MN-UKF-based training algorithm for different , where . Notice that when , the MN-UKF is equivalent to the standard UKF. It can be seen that although when the value of decreases and the MSE increases, the convergence rates of MN-UKF-based training algorithm are almost the same for different values of . Fig. 6Open in figure viewerPowerPoint One-step-ahead prediction performance of the MN-UKF-based training algorithm for different Hence, a fast convergence rate can be achieved even if a small is employed. Fig. 6 shows the one-step-ahead prediction performance of the MN-UKF-based training algorithm for different . It can be seen that when the value of decreases, the performance does not reduce significantly. Fig. 7Open in figure viewerPowerPoint Training times of the MN-UKF-based training algorithm for different Fig. 7 shows the training times of the MN-UKF-based training algorithm for different . As shown in Fig. 7, the training time can be reduced significantly by using the MN-UKF-based training algorithm. Since the structures of the LSTMs employed in the experiments are fixed, the training times of the MN-UKF-based training algorithm for the Mackey–Glass time series and Lorenz chaotic time series are almost the same. Based on Table 1 and Fig. 6, even though , the MN-UKF-based training algorithm can achieve better performance than the SGD algorithm, EKF-based training algorithm, and PF-based training algorithm. 5 Conclusion To overcome the limitations of the EKF-based training algorithm and PF-based training algorithm, the UKF-based training algorithm has been proposed to train the LSTM for the first time in this paper. The performance comparison of the SGD algorithm, EKF-based training algorithm, PF-based training algorithm, and the UKF-based training algorithm has been made by performing sufficient experiments. It can be seen that the UKF-based training algorithm is able to obtain the best performance for all experiments. Moreover, the computational complexity of the UKF is the same order as that of the EKF. Consequently, the UKF-based training algorithm can be employed as an efficient training algorithm for the LSTM. In order to further decrease the computational complexity, MN-UKF has been proposed. Thereby, a good trade-off between performance and complexity can be obtained by using the MN-UKF. The experiment results justified the great advantages of the UKF-based training algorithm and MN-UKF-based training algorithm in LSTM training. Note that the MN-UKF can also be used to train the FNN and other RNNs such as Elman neural network and gated recurrent unit neural network. 6 Acknowledgments This work is supported by National Natural Science Foundation of China under grant no. 61571159 and no. 61571157 and the National Marine Technology Program for Public Welfare under grant no. 201505002. We would like to express his/her sincere thanks to members of the School of Electronics and Information Engineering, Research Center, Harbin Institute of Technology, for technical support. 7 References 1Zhou, G.B., Wu, J., Zhang, C.L. et al: 'Minimal gated unit for recurrent neural networks', Int. J. Autom. Comput., 2016, 13, (3), pp. 226– 234 2Hochreiter, S., Schmidhuber, J.: 'Long short-term memory', Neural Comput., 1997, 9, (8), pp. 1735– 1780 3Kong, W., Dong, Z.Y., Jia, Y. et al: 'Short-term residential load forecasting based on LSTM recurrent neural network', IEEE Trans. Smart Grid., 2017, 10, pp. 841– 851 4Ma, X., Tao, Z., Wang, Y. et al: 'Long short-term memory neural network for traffic speed prediction using remote microwave sensor data', Transp. Res. C, Emerg. Technol., 2015, 54, pp. 187– 197 5Byeon, W., Breuel, T.M., Raue, F. et al: 'Scene labeling with LSTM recurrent neural networks'. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Boston, 2015, pp. 3547– 3555 6Perez-Ortiz, J.A., Gers, F.A., Eck, D. et al: 'Kalman filters improve LSTM network performance in problems unsolvable by traditional recurrent nets', Neural Netw., 2003, 16, (2), pp. 241– 250 7Ergen, T., Kozat, S.S.: 'Efficient online learning algorithms based on LSTM neural networks', IEEE Trans. Neural Netw. Learn. Syst., 2017, 29, pp. 3772– 3783 8Ergen, T., Kozat, S.S.: 'Online training of LSTM networks in distributed systems for variable length data sequences', IEEE Trans. Neural Netw. Learn. Syst., 2017, 29, pp. 5159– 5165 9Williams, R.J.: 'Training recurrent networks using the extended Kalman filter'. Int. Joint Conf. on Neural Networks, Baltimore, 1992, vol. 4, pp. 241– 246 10Haykin, S.: ' Kalman filtering and neural networks' ( Wiley, New York, 2001) 11Wan, E.A., Van Der Merwe, R.: 'The unscented Kalman filter for nonlinear estimation'. Proc. Symp. Adaptive Systems in Signal Processing, Lake Louise, AB, 2000, pp. 153– 158 12Gordon, N.J., Salmond, D.J., Smith, A.F.: 'Novel approach to nonlinear/non-Gaussian Bayesian state estimation', IEE Proc. F, Radar Signal Process., 1993, 140, (2), pp. 107– 113 13Djuric, P.M., Kotecha, J.H., Zhang, J. et al: 'Particle filtering', IEEE Signal Process. Mag., 2003, 20, (5), pp. 19– 38 14Doucet, A., Godsill, S., Andrieu, C.: 'On sequential Monte Carlo sampling methods for Bayesian filtering', Stat. Comput., 2000, 10, (3), pp. 197– 208 15Arulampalam, M.S., Maskell, S., Gordon, N. et al: 'A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking', IEEE Trans. Signal Process., 2002, 50, (2), pp. 174– 188 16Julier, S., Uhlmann, J., Durrant-Whyte, H.F.: 'A new method for the nonlinear transformation of means and covariances in filters and estimators', IEEE Trans. Autom. Control., 2000, 45, (3), pp. 477– 482 17Gers, F.A., Schmidhuber, J., Cummins, F.: 'Learning to forget: continual prediction with LSTM', Neural Comput., 2000, 12, (10), pp. 2451– 2471 18Hayes, M.H.: ' Statistical digital signal processing and modeling' ( John Wiley & Sons, New York, 1996) 19Gustafsson, F.: 'Particle filter theory and practice with positioning applications', IEEE Aerosp. Electron. Syst. Mag., 2010, 25, (7), pp. 53– 82 20Xia, Y., Jahanchahi, C., Mandic, D.P.: 'Quaternion-valued echo state networks', IEEE Trans. Neural Netw. Learn. Syst., 2015, 26, (7), pp. 663– 673 Citing Literature Volume13, Issue3May 2019Pages 304-308 FiguresReferencesRelatedInformation

Referência(s)
Altmetric
PlumX