Artigo Acesso aberto Revisado por pares

Short‐term load demand forecasting through rich features based on recurrent neural networks

2020; Institution of Engineering and Technology; Volume: 15; Issue: 5 Linguagem: Inglês

10.1049/gtd2.12069

ISSN

1751-8695

Autores

Dongbo Zhao, Qian Ge, Yuting Tian, Jia Cui, Boqi Xie, Tianqi Hong,

Tópico(s)

Image and Signal Denoising Methods

Resumo

IET Generation, Transmission & DistributionVolume 15, Issue 5 p. 927-937 ORIGINAL RESEARCH PAPEROpen Access Short-term load demand forecasting through rich features based on recurrent neural networks Dongbo Zhao, Dongbo Zhao Energy Systems Division, Argonne National Laboratory, Lemont, Illinois, USASearch for more papers by this authorQian Ge, Qian Ge Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, North Carolina, USASearch for more papers by this authorYuting Tian, Corresponding Author Yuting Tian tianyuti@msu.edu Energy Systems Division, Argonne National Laboratory, Lemont, Illinois, USA Correspondence Yuting Tian, Energy Systems Division, Argonne National Laboratory, Lemont, IL 60439, USA. Email: tianyuti@msu.eduSearch for more papers by this authorJia Cui, Jia Cui School of Electrical Engineering, Shenyang University of Technology, Shenyang, Liaoning, ChinaSearch for more papers by this authorBoqi Xie, Boqi Xie School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, Georgia, USASearch for more papers by this authorTianqi Hong, Tianqi Hong orcid.org/0000-0003-1307-6808 Energy Systems Division, Argonne National Laboratory, Lemont, Illinois, USASearch for more papers by this author Dongbo Zhao, Dongbo Zhao Energy Systems Division, Argonne National Laboratory, Lemont, Illinois, USASearch for more papers by this authorQian Ge, Qian Ge Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, North Carolina, USASearch for more papers by this authorYuting Tian, Corresponding Author Yuting Tian tianyuti@msu.edu Energy Systems Division, Argonne National Laboratory, Lemont, Illinois, USA Correspondence Yuting Tian, Energy Systems Division, Argonne National Laboratory, Lemont, IL 60439, USA. Email: tianyuti@msu.eduSearch for more papers by this authorJia Cui, Jia Cui School of Electrical Engineering, Shenyang University of Technology, Shenyang, Liaoning, ChinaSearch for more papers by this authorBoqi Xie, Boqi Xie School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, Georgia, USASearch for more papers by this authorTianqi Hong, Tianqi Hong orcid.org/0000-0003-1307-6808 Energy Systems Division, Argonne National Laboratory, Lemont, Illinois, USASearch for more papers by this author First published: 31 December 2020 https://doi.org/10.1049/gtd2.12069Citations: 1AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinked InRedditWechat Abstract With the emerging penetration of renewables and dynamic loads, the understanding of grid edge loading conditions becomes increasingly substantial. Load modelling researches commonly consist of explicitly expressed load models and non-explicitly expressed techniques, of which artificial intelligence approaches turn out to be the major path. This paper reveals the artificial intelligence-based load modelling technique to enhance the knowledge of current and future load information considering geographical and weather dependencies. This paper presents a recurrent neural network based sequence to sequence (Seq2Seq) model to forecast the short-term power loads. Also, a feature attention mechanism, which is along channel and time directions, is developed to improve the efficiency of feature learning. The experiments over three publicly available datasets demonstrate the accuracy and effectiveness of the proposed model. 1 INTRODUCTION With the continuous development of economy, society and smart grid, the power system makes increasingly higher requirements on the accuracy of load forecasting [1]. An accurate short-term load prediction is an essential backbone of the smart grid operation, which can facilitate the economic dispatch, demand response program etc. [2]. However, the load characteristics of power systems are not the same in different regions and the increasing penetration of distributed renewable energy also brings great unpredictability and uncertainty to the load side of the grid. These place higher challenges to the power load forecasting [3]. Researchers have endeavoured to improve the accuracy and speed of the forecasting methodologies applied in power systems over the past decades [4, 5]. The statistical model is one of the most widely used forecasting methods, which aims to describe the relationship between the time series of the forecasted values and actual history load data [6]. The commonly used statistical methods include the time series model, the regression analysis model and Kalman filter model [7, 8]. Statistical models can solve the problem of forecasting delay effectively, but the accuracy of long-term forecasting is unsatisfactory. Artificial intelligence, another group of methods, is extensively applied for load forecasting [9-11], such as neural networks and support vector machine (SVM). Neural networks are able to learn complicated non-linear relationship between input and output. They are simple and efficient techniques with good performance and have been widely applied in load and renewable energy forecasting [4]. For instance, authors in [10] present several improved models using artificial neural networks to predict the aggregated load demand with good performance. However, most neural network models require a large training set to prevent over-fitting and the training process is usually unstable [12]. SVMs work well with small training set and are able to model non-linear relationship through the kernel trick, but choosing a good kernel function is non-trivial [13]. Authors in [14] propose a least-squares SVM for annual electric load forecasting. Recently, deep learning approaches have attracted lots of attentions and have made tremendous achievements in the forecasting area. They have the capability of learning and generalisation on large dataset and have made several breakthroughs in the fields of computer vision and natural language processing [15, 16]. The core idea of deep learning approaches is to model complicated relationship between input and output data by stacking several layers of non-linearities. When applied to load forecasting tasks, instead of designing models manually, these powerful models are able to learn the hidden relationship between prediction and historical data automatically given large enough training data. Recurrent neural networks (RNNs), one of the deep learning models, is designed to model the time dependency of data sequence, which is usually applied in forecasting tasks [17]. However, the vanilla RNNs suffer from gradient vanishing for large recurrent steps, which prevents vanilla RNNs modelling long term time dependency of data. In [18], a long short-term memory (LSTM) cell is proposed to avoid this by applying gates mechanism. Authors in [11] present an LSTM RNN based framework to forecast the electric use of individual residential customers. Reference [19] also applied the RNN with LSTM unit for household load forecasting. In [20], gated recurrent units (GRUs) are proposed with a simpler architecture, but provides comparable performance with LSTM. Authors in [21] have adopted the gated RNN model for day-ahead load forecasting in commercial buildings. In this paper, we proposed a forecasting framework with the sequence to sequence (Seq2Seq) RNN, which has not been adopted for load forecasting so far. The Seq2Seq model is originally proposed to address the end-to-end sequence learning problem with neural networks [22]. This model is able to handle various length of input and output sequence and provide good performance on machine translation tasks. In [23], an attention mechanism is proposed to let the Seq2Seq model to soft select a set of relevant inputs when predicting each output. This improves the machine translation performance especially for long sentence cases as well as provides interoperability of the model. Besides machine translation and speech recognition [24], Seq2Seq has been successfully applied to several other tasks including image captioning [25], answering questions [26], video representation learning [27] and human motion prediction [28]. Many research achievements on feature selection have been carried out to reduce the computational complexity as well as improve the performance of downstream tasks by removing redundant features [29]. In [30], a subset of weather and historical load features is extracted for load forecasting by ranking candidate features using a conditional mutual information approach. In [31], the optimal feature subset is selected based on the correlation between features and the load. In the proposed model, instead of selecting a fixed set of features for forecasting, different sets of features are selected to predict the power load at different time step through our feature attention mechanism. The main contributions of this article can be summarised as follows, An RNN-based Seq2Seq model is proposed to predict short-term loads. The Seq2Seq model has rarely been applied in the load forecasting area and this paper will fill the knowledge gap in this domain. The method is suitable for the load forecasting purpose and is demonstrated to be accurate and efficient Feature attention mechanism is involved to learn rich feature representations. Unlike the original Seq2Seq where the decoder is only fed with the previous output, in our model at each output time, the decoder also takes features that are computed from raw input features through a feature attention layer as input. Specifically, a new set of features are computed from the raw features along with both channel and time directions before fed into the decoder The rest of the paper is organised as follows. Section 2 presents the background knowledge of RNNs and GRUs. Section 3 describes the proposed Seq2Seq method for load forecasting. Section 4 discusses the feature attention mechanism applied in our model. Section 5 documents the solution procedure and the experiment setup. Case studies and results are presented in Section 6. Finally, Section 7 concludes this paper. 2 BACKGROUND In this section, we briefly introduce the background knowledge of RNNs [32] and GRUs [20] used in our model and more details can be found in [32, 20]. 2.1 Recurrent neural networks RNN is a neural network for modelling sequential data by taking an input sequence with length T to generate T steps output sequence and hidden states. At each time step t, RNN takes current input xt and the previous hidden state ht−1 to produce ht and the output yt. Thus, it is able to model the time dependencies of the data and is suitable to capture the temporal relationship between the previous and current states. Traditionally, the RNN cell is defined as [17]: h t = tanh W h t − 1 + U x t + b . (1) y t = V h t + c . (2)where tanh () is the hyperbolic tangent function, and W, U and V are weight matrices along with bias vectors b and c, which are usually learned through backpropagation through time (BPTT) algorithm [33] during training. 2.2 Gated recurrent units The vanilla RNNs can suffer from vanishing/exploding gradients, thus have difficulty to learn long-term dependencies. Some sophisticated recurrent cells including LSTM [18] and GRU [20] are proposed to avoid this issue by creating paths through time with derivatives that neither vanish nor explode [17]. GRU is applied in our proposed model and the definition is expressed as follows [34], z t = σ ( W z x t + U z h t − 1 + b z ) . (3) r t = σ ( W r x t + U r h t − 1 + b r ) . (4) h ∼ t = tanh ( W x t + U ( r t ⊙ h t − 1 ) + b h ) . (5) h t = ( 1 − z t ) ⊙ h t − 1 + z t ⊙ h ∼ t . (6) y t = h t . (7)where σ is the logistic sigmoid function, ⊙ is an element-wise multiplication, Wz, Uz, Wr, Ur, W and U are weight matrices along with bias vectors bz, br and bh to be learned. GRU is chosen for our model because we found GRU outperformed LSTM during the experiments. There is no rule that one cell is always better than another one. It depends on the dataset used for modelling. Thus, a practical strategy is to test both cells for a new dataset then choose the one providing the best performance. However, when both cells have similar performance, GRU is usually preferred, as GRU is computationally more efficient. 3 LOAD FORECASTING USING RECURRENT NEURAL NETWORKS Dynamic power load sequence usually can be considered as a time series, thus RNN is a suitable tool for dynamic load forecasting. In our work, the Seq2Seq model is applied and the feature engineering is also involved. In this section, we first introduce the general model of power load. Then we discuss how to use a Seq2Seq framework to learn the load model and forecast the load. Finally, we introduce our feature attention mechanism which reduces the feature engineering efforts. 3.1 Power and load model Let yt and y t true be the measured value and true value of a power load at time t, respectively. Then yt can be modelled as y t = y t true + N . (8)where N ∼ N ( 0 , σ 2 ) is a Gaussian noise with a zero mean and a standard derivation σ introduced by the measurement uncertainty. Let xt be a subset of features that affects the value of y t true , and f ( x t , ω ) be a load model modelling y t true for all t with parameters ω. Then the likelihood of yt given features xt is p ( y t | x t ) = N ( y t ; f x t , ω , σ model 2 ) . (9)where σmodel is a standard deviation with a lower bound σ as the load may not be perfectly modelled by only using feature set xt. Assume σmodel is independent on xt, our goal is to find a function f ̂ ( x t , θ ) which approximates the load model f given sufficient observations. In this paper, we use RNNs consisting of an encoder and a decoder as the function approximator and the load model is fitted by the maximum likelihood estimation based on Equation (9). Because of the time dependence property of the load sequence, feature set xt can include historical load data, weather temperature, and any other external features which can affect the load value at time t. 3.2 Learning load model through a sequence to sequence model Seq2Seq [22] is a popular RNN based model widely used in natural language processing tasks [35-37]. It consists of an encoder RNN and a decoder RNN and is able to map an input sequence to a target sequence with different lengths. The encoder first maps an arbitrary length input sequence to a fixed-size vector (the last hidden state of encoder RNN). Then the decoder is initialised by the fixed-size vector encoder provided and generates a target sequence. One advantage of this model for load forecasting is that the entire information of the input sequence is utilised to generate the target sequence. Assuming the load sequence with length T only depends on the previous historical data with length T'+1 and corresponding external features, the likelihood of the target load sequence given feature set xt is [15] p ( y t + 1 , … , y t + T | x t + 1 , … , x t + T ) . (10) = p ( y t + 1 , … , y t + T | y t − T ′ , … , y t , d t + 1 , … , d t + T ) . (11) = ∏ i = 1 T p ( y t + i | v , y t + 1 , … , y t + i − 1 , d t + i ) . (12)where v is the fixed-size encoded input sequence, dt is the multi-channel input features at time t and each p ( y t + i | v , y t + 1 , … , y t + i − 1 , d t + i ) . (13) is a Gaussian distribution as discussed in Section 3.1. Unlike the original Seq2Seq where the decoder is only fed with the previous output, in our model at each output time t, the decoder also takes features dt as input. The features are computed from original input features through a feature attention layer, which will be discussed in detail in the next section. The load model then can be learned by maximising the log-likelihood of Equation (12) through a mean squared error (MSE) loss function: MSE = 1 T ∑ i = 1 T y t + i − y ̂ t + i 2 . (14)where y ̂ t + i is computed from the decoder RNN output followed by a linear fully connected layer. It is observed in [22] that the reversed order of input is preferred for language translation tasks. Inspired by this, the encoder is constructed by a bidirectional RNN [38] to maintain both directions of the input sequence. Figure 1 illustrates the proposed model used for the load forecasting. Note that the encoder is a bidirectional RNN and the decoder is a single directional RNN. FIGURE 1Open in figure viewerPowerPoint The proposed network for load forecasting 4 FEATURE ATTENTION MECHANISM The process of feature engineering which heavily relies on experts' domain knowledge is often time-consuming and inflexible in adding new types of features. To reduce the efforts on feature engineering, we introduce a feature attention mechanism which is able to learn rich feature representations automatically and efficiently. The only required feature engineering is to group features based on their dependency. An attention mechanism is proposed for Seq2Seq in [35] to force the model to learn to focus on a specific position of the encoder input sequence. Inspired by this, as all the features are fed into the decoder in our model, we introduce an attention mechanism to force the model to learn which input features are important for output at each output time t. The input features of our Seq2Seq model is a multiple channel feature sequence including historical load data (encoder input sequence) as well as external features such as weekdays, temperature and any other features affecting the load values at each time step. Though the input sequence is already encoded in the final state of the encoder, we found explicitly including it as an additional feature channel improves the load forecasting performance in our experiment, as the load demands between two successive days at the same time are highly correlated. This can be interpreted as manually specifying the attentions which are learned in [35]. We then assume that at each output time t, the model should focus on a specific set of features over time and channels. For example, along with time dimension, the load at time t can depend on previous input features or even the future features. This is reasonable for the load forecasting purpose, for example, the future temperatures can affect the load demand at a particular time of a day because of some human planning behaviours. Similarly, it is also true for the channel dimension. The feature attention learning is only applied for numerical features. That is, category and ordinal feature channels are fed into the decoder directly. Specifically, let D num o = [ d 1 o , … , d M o ] . be the original T×M input M-Channel numerical feature matrix with each column an input numerical feature sequence and D num a = [ d 1 a , … , d N a ] . be a T×N learned N-channel numerical feature matrix. Feature attention learns a set of weights w such that d n a t = ∑ t ′ = 1 T ∑ m = 1 M w m n , t t ′ d m o t ′ , n = 1 , … N and t = 1 , … , T . (15) s . t . ∑ t ′ = 1 T ∑ m = 1 M w m n , t t ′ = 1 . (16)where d i o ( t ) and d i a ( t ) are the i-th row and the t-th column element of D num o and D num a , respectively. In the proposed approach, attention over channels and time are learned separately to reduce the computational complexity. 4.1 Channel attention Feature attention over channel is learned based on the feature dependency over channel. One example of channel dependent features are the temperatures obtained from several locally closed weather stations. Feature channels are firstly divided into K groups based on their dependency. Then for each group, the dependency between channels is modelled by weighted summation over all the features within the group. It is desired to learn multiple sets of the weights, since there may exist multiple useful combinations of features across channels. To reduce the computational complexity, we further assume that within each group, features at different time are sharing the same weights. Thus, for each group, the channel attention output can consist of multiple channels, each of which is a weighted average over original feature channels. Let S d = { d 1 o , … , d N o } be a set of T×1 single channel of input numerical feature sequence, { k i , i = 1 , … , K } be a partition of index set for Sd based on the feature channel dependency, | k i | be the cardinality of the set ki, and nc be the number of output channels per group. Then the feature attention over channel for each group is computed as d i , 1 c = ∑ m = 1 , j ∈ k i k i w m i , l d j o , i = 1 , … , K and l = 1 , … , n c . (17) s . t . ∑ m = 1 k i w m i , l = 1 . (18)where d i , 1 c is the l-th output channel attention feature vector for i-th feature group. The constraint is obtained by a softmax [32] over the weights. Note that for groups with only one feature channel, we simply use the original input feature as the output. Now, the new numerical feature set becomes S d c = { d 1 c , … , d L c } , where K ≤ L ≤ Knc. 4.2 Time attention Feature time attention is applied to each individual feature channel separately. Two types of structures are used to model the feature attention over time. Multiple channels output is used in both structures for the same reason as mentioned before. First, we use a linear projection structure to model the time attention weights depending on the output time t. That is, at different output time t, the model should focus on input features with different patterns. Let nt be the number of output channels. The time attention for each feature channel d i c is modelled as d i , l t t = ∑ t = 1 T w i , l t d i c t , i = 1 , … , L and l = 1 , … , n t . (19) s . t . ∑ t = 1 T w i , l t = 1 . (20)where d i c ( t ) and d i , l t ( t ) are the t-th element of channel attention feature vector d i c and the corresponding l-th output time attention feature vector d i , 1 t , respectively. The constraint is obtained by a softmax [39] over the weights. Then we use a multi-scale weight sharing structure inspired by [40] to model the weights which are independent on output time t. That is, at each time t, the model should focus on the same set of features centring at time t, which mimics the process of sliding window feature extraction. A convolutional layer is used to meet the requirement of weight sharing as well as force the model to focus on the features around current time t. The convolutional layer structure used in our experiments is illustrated in Figure 2, which is a modified inception layer proposed in [40]. Three modifications are applied: One dimensional filters are used for one-dimensional sequence data; Pooling is removed because the localisation is critical for attention learning and we want to keep the original sequence length as well; 1×1 filters are removed before larger filters since the dimension reduction is not necessary here (input feature has a single channel). However, we keep the parallel 1×1 filters to add more non-linearity. FIGURE 2Open in figure viewerPowerPoint The convolutional layer for multi-scale feature learning This structure is able to capture multi-scale representations of features over time with a low computational cost. The filter sizes are not necessarily the same as the one shown in Figure 2. Each size of the filter can have several output channels, and can be tuned for specific tasks. For all the filters, a rectified linear unit [41] is used as the activation function. The parameters required for the inception structure are constant with the input feature length T. However, for the linear projection structure, the number of parameters is LT2nt, which may lead over-fitting when T is too large. In this case, one may remove the linear projection part and only keep the inception layer with more output channels and larger filter sizes, since the load value is less likely to be affected by features too far away. Let nI be the number of filters used in the inception layer, then the number of learned numerical feature channels becomes N = (nt + nI)L. 4.3 Including original features To force the model to focus on features at the current time as well, the original features are concatenated with the new features as the final decoder input feature. Thus, the final feature matrix is D = D co o , D num o , D num a . (21)where D co o is the matrix of the original category and ordinal features with each column an individual feature channel. At each time step t, the input feature vector dt is the t-th row of matrix D. With this setting, features can be easily added or removed by adding or removing corresponding channels from the original feature matrix before fed into the feature attention layer. 5 SOLUTION PROCEDURE AND EXPERIMENTS SETUP 5.1 Solution procedure As illustrated in Figure 1, the encoder takes an arbitrary length of historical load sequence as input. The decoder RNN is initialised by the final state of the encoder and takes a set of feature sequences as input. Features are first fed into a feature attention layer to force the decoder to focus on important features over time and channels at each output time t. Then the decoder generates the load sequence based on the learned features and previous outputs. Note that the input and output sequences can have different lengths. The load model is trained by finding a function f ̂ which minimises the distance between the true load and the model predicted load on a training set. 5.2 Experiments To demonstrate the efficiency of our model, we apply the model on the next day hourly load forecasting task and compare the performance with some previous models which are designed specifically for load forecasting. It is suitable to have the load forecasting as the task for demonstrating the proposed feature attention learning, because load demand forecasting depends on the historical usage and also heavily depend on external features such as holidays, public events and temperatures [42]. In our experiments, three public datasets are used: Hourly loads of the Polish power system from the period of 2002–2004 (PL) [43]; Global Energy Forecasting Competition 2012 from Kaggle competition (GEFCom2012) [44] and hourly loads of New England states from the period of 2011–2015 (ISO-NE) [45]. Depending on the data availability, historical load demand, weekday and temperature are used as the input features. Historical load demand and temperature are considered as numerical features and scaled to be zero mean and unit standard deviation. Weekday is considered as ordinal feature and scaled to the range [0, 1]. The three datasets correspond to three feature scenarios: Only historical demand data is provided External features are channel dependent All external features are channel independent 5.2.1 Network architecture and training details The choice of hyper-parameters and model training details are briefly discussed here. The decoder and each direction of the encoder in the Seq2Seq model consist of 2 layers with GRU of 80 hidden units. The number of output channels of inception filters is chosen to be 4 and 1 for other feature attentions. During the training process, the network takes load demand of the current day and the corresponding features of the next day as input. Then it generates the estimated hourly load demand of the next day. The length of input and output sequences are not necessarily the same. However, in our experiments, no significant improvement is observed when a longer input sequence is provided at the cost of higher computational complexity. We have tested the proposed network using the demand of the previous week, the same day in the previous week, and the current day as input feature, and we found that the current setting provides the best performance. We further assume the next day temperatures can be predicted with high accuracy, so the true temperatures of the next day are used as the feature if applicable [46, 47]. The network is optimised by RMSProp algorithm [48] through BPTT [33]. As discussed in Section 3.2, the model assumes the load demand sequence to be predicted depending only on the input historical sequence of the encoder, but not on preceding historical load. Thus, we can safely shuffle the data of each training epoch to reduce the risk of over-fitting. At the beginning of each epoch, a set of forecasting pairs containing load demand of two successive days and corresponding features are first constructed. Then the forecasting pairs are shuffled as the training set for current epoch. The batch size used for PL, GEFCom2012 and ISO-NE are set to be 16, 8 and 1, respectively. The learning rate starts from 0.001 and is divided by 10 every 50 epochs. The models are trained for up to 150 epochs. The number of layers and h

Referência(s)