Improved strategy for human action recognition; experiencing a cascaded design
2019; Institution of Engineering and Technology; Volume: 14; Issue: 5 Linguagem: Inglês
10.1049/iet-ipr.2018.5769
ISSN1751-9667
AutoresMuhammad Attique Khan, Tallha Akram, Muhammad Sharif, Nazeer Muhammad, Muhammad Younus Javed, Syed Rameez Naqvi,
Tópico(s)Gait Recognition and Analysis
ResumoIET Image ProcessingVolume 14, Issue 5 p. 818-829 Research ArticleFree Access Improved strategy for human action recognition; experiencing a cascaded design Muhammad Attique Khan, Muhammad Attique Khan Department of Computer Science & Engineering, HITEC University, Museum Road, Taxila, Pakistan Department of Computer Science, COMSATS University Islamabad, Wah Campus, PakistanSearch for more papers by this authorTallha Akram, Corresponding Author Tallha Akram tallha@ciitwah.edu.pk orcid.org/0000-0003-4578-3849 Department of Electrical Engineering, COMSATS University Islamabad, Wah Campus, PakistanSearch for more papers by this authorMuhammad Sharif, Muhammad Sharif Department of Computer Science, COMSATS University Islamabad, Wah Campus, PakistanSearch for more papers by this authorNazeer Muhammad, Nazeer Muhammad Department of Mathematics, COMSATS University Islamabad, Wah Campus, PakistanSearch for more papers by this authorMuhammad Younus Javed, Muhammad Younus Javed Department of Computer Science & Engineering, HITEC University, Museum Road, Taxila, PakistanSearch for more papers by this authorSyed Rameez Naqvi, Syed Rameez Naqvi Department of Electrical Engineering, COMSATS University Islamabad, Wah Campus, PakistanSearch for more papers by this author Muhammad Attique Khan, Muhammad Attique Khan Department of Computer Science & Engineering, HITEC University, Museum Road, Taxila, Pakistan Department of Computer Science, COMSATS University Islamabad, Wah Campus, PakistanSearch for more papers by this authorTallha Akram, Corresponding Author Tallha Akram tallha@ciitwah.edu.pk orcid.org/0000-0003-4578-3849 Department of Electrical Engineering, COMSATS University Islamabad, Wah Campus, PakistanSearch for more papers by this authorMuhammad Sharif, Muhammad Sharif Department of Computer Science, COMSATS University Islamabad, Wah Campus, PakistanSearch for more papers by this authorNazeer Muhammad, Nazeer Muhammad Department of Mathematics, COMSATS University Islamabad, Wah Campus, PakistanSearch for more papers by this authorMuhammad Younus Javed, Muhammad Younus Javed Department of Computer Science & Engineering, HITEC University, Museum Road, Taxila, PakistanSearch for more papers by this authorSyed Rameez Naqvi, Syed Rameez Naqvi Department of Electrical Engineering, COMSATS University Islamabad, Wah Campus, PakistanSearch for more papers by this author First published: 26 February 2020 https://doi.org/10.1049/iet-ipr.2018.5769Citations: 11AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinkedInRedditWechat Abstract Human motion analysis has received a lot of attention in the computer vision community during the last few years. This research domain is supported by a wide spectrum of applications including video surveillance, patient monitoring systems, and pedestrian detection, to name a few. In this study, an improved cascaded design for human motion analysis is presented; it consolidates four phases: (i) acquisition and preprocessing, (ii) frame segmentation, (iii) features extraction and dimensionality reduction, and (iv) classification. The implemented architecture takes advantage of CIE-Lab and National Television System Committee colour spaces, and also performs contrast stretching using the proposed red–green–blue* colour space enhancement technique. A parallel design utilising attention-based motion estimation and segmentation module is also proposed in order to avoid the detection of false moving regions. In addition to these contributions, the proposed feature selection technique called entropy controlled principal components with weights minimisation, further improves the classification accuracy. The authors claims are supported with a comparison between six state-of-the-art classifiers tested on five standard benchmark data sets including Weizmann, KTH, UIUC, Muhavi, and WVU, where the results reveal an improved correct classification rate of 96.55, 99.50, 99.40, 100, and 100%, respectively. 1 Introduction Although human action recognition (HAR) has found numerous applications, including intelligent video surveillance and retrieval, and robotics, during the last few decades, it is still considered a challenging problem by most researchers in the area of computer vision (CV) [1-5]. The existing HAR methods generally follow a cascaded design comprising two phases: the first includes frames preprocessing and segmentation, features extraction, and features reduction. Accurate features extraction, which plays a vital role later in the second phase called recognition, may be performed by one of the several techniques available, such as histogram-oriented gradient (HOG) [6], covariance features [7], local binary pattern (LBP) [8], HOG–LBP [9], point features such as scale-invariant feature transform (SIFT) [10] etc.. The extracted features are then used in training classifiers for final classification in the second phase. Among many options, support vector machine (SVM) and decision tree are considered state-of-the-art classifiers [11, 12]. 1.1 Background Several HAR techniques were recently proposed that were based on feature extraction and classification. In what follows, we summarise a few of the most significant works in turn. Nazir et al. [13] improved the performance of the bag of visual words (BoW) and introduced a bag of expression (BoE) framework for HAR. In the presented approach, the major objective was to magnify the actual depth of the BoW model using scale invariance, occlusion, and view independence. Four data sets were utilised for experiments including UCF-50, UCF 11, KTH, and UCF Sports [14], and showed improved recognition performance. Xu et al. [15] introduced two stream dictionary learning-based methods for HAR. The introduced approach incorporated three primary steps, including interest patch descriptors, dictionary modelling, and classification by SVM, and proved to work efficiently on video sequences having the cluttered background, same actions, and intra-class variations. Four famous data sets were utilised for validation of the approach; these included Weizmann, KTH, Olympic, and HMDB-51 [16] and showed improved recognition performance. Zheng et al. [17] presented a sketch-based approach for HAR. The approach provided two primary properties including sketch ability and objectiveness. Moreover, faster recurrent convolutional neural network (F-RCNN) was employed to detect humans in parallel, and then four types of sketch pooling techniques were presented to take a consistent design for HAR. Two data sets, KTH and UCF101 [18], were used for evaluation of the presented approach and demonstrated improved performance. Huang et al. [19] introduced a discriminant feature-based approach for efficient HAR. A random forest out of bag evaluation-based approach was designed to extract the discriminative features for action classification. Extensive experiments were performed on two data sets, MSR Action 3D and Daily Activity 3D [20], and the authors claimed the proposed method to outperform the equivalent existing approaches. Liu et al. [21] utilised the AdaBoost algorithm for the selection of the most densely coupled features either from 3D-SIFT or 3D-HOG. To classify the extracted features, the authors implemented the naïve-Bayes nearest-neighbour classifier to achieve accuracy up to 99.4%. Vishwakarma et al. [22] described a HAR approach based on the hypothesis that rotational and translational information was present in every action or activity. They integrated structural and translational information that was later subjected to classification with the multi-class SVM (M-SVM) model. The main advantage of the presented method was the utilisation of both local and global features [23-26]. Rahman et al. [27] contemplated the problems with low-quality videos and implemented a new idea for HAR. They used shape, motion, and texture features within the standard bag-of-features framework to recognise actions. Some researchers also utilised graph-based approaches for HAR. For example, Aoun et al. [28] proposed to represent videos with a spatio-temporal set of graphs. The system amalgamated the BoW strategy and efficiency of graphs for features' structural representation. The results were tested on two challenging data sets, Hollywood 2 and UFC Youtube action. The proposed method showed improved performance compared to equivalent existing techniques. Hou et al. [29] introduced a convolutional neural network (CNN)-based HAR approach using skeleton data. The main advantage of using CNN was its ability to learn directly from raw data. The authors verified their results using standard data sets and claimed escalated performance. Similarly, Ji et al. [30] introduced a 3D deep CNN approach for HAR. In the introduced framework, features were extracted from both spatial and temporal dimensions using 3D convolution. The information utilised multiple channels to build the final model, which was applied for action recognition in the real-world environment. A concise summary of recent techniques is presented in Table 1. Table 1. Summary of recent HAR techniques Author Year HAR Technique Features Selection technique Gaps [1] 2017 uniform segmentation and best features selection HOG, Harlick, and LBP Euclidean distance and entropy-controlled approach high-dimensional features degrade the recognition accuracy and computational time [2] 2018 neural network-based classification of fused features shape and texture features Pearson coefficient of skewness and principle component analysis degrade recognition accuracy on complex and large data sets. Moreover, principle features affect recognition accuracy. [31] 2017 hierarchical dynamic Bayesian network deep CNN features dictionary learning improve the representation of global features for efficient HAR in the low resolution and complex video sequences [32] 2016 Kernel fusion for HAR spatiotemporal features — recognition HAR across multiple cameras [13] 2018 BoE framework for HAR spatiotemporal features — HAR from uncontrolled, complex, realistic scenarios [17] 2018 ranking-based method for HAR faster RCNN pooling representation-based selection fusion of multiple features and select the most distinctive feature for HAR In this work, an improved HAR design is proposed for video sequences of controlled and uncontrolled environments. The proposed HAR design includes four core steps from frame enhancement to action labelling. Initially, the videos are processed to improve the contrast of moving objects such as a human. Later, in the segmentation phase, humans are identified followed by producing a rectangle for the feature extraction step. The HOG descriptors are extracted that are later reduced by a novel dimensionality reduction technique called entropy controlled principal components with weights minimisation (EPCAWM), which later further facilitates the improved correct classification rate (CCR). We compare the results of our proposed approach with six state-of-the-art classifiers and achieve an accuracy of 96.55% (Weizmann), 99.40% (UIUC), 99.50% (KTH), 100% (WVU), and 100% (MuHavi), respectively. The results show that the proposed HAR method significantly outperforms the other techniques in terms of average CCR. As future work, we will be focusing on the dimensionality reduction technique for maximum reduction and improved performance. Also, the data sets will be increased and tested on the deep learning method for better performance. 1.2 Window of opportunity To the best of our knowledge, most of the existing techniques still have to overcome several challenges and gaps that affect system accuracy and execution time. These challenges include background illumination, efficient features extraction, similarity between human actions, intra-class variation, and best features selection [33-35]. We sincerely believe that there is a need for a complete framework that addresses these issues, and this is what our proposed work is intended for. 1.3 Summary of the proposed framework and contributions In the quest of addressing the aforementioned challenges, we propose an improved cascaded design for HAR with selective labels; the results can be observed in Fig. 1. The framework primarily consolidates four phases: (i) acquisition and preprocessing, (ii) frame segmentation, (iii) features extraction and dimensionality reduction, and (iv) classification. The initial stage utilises top-hat and bottom-hat filtering along with RGB colour space enhancement, prior to transformation step into CIE-Lab and National Television System Committee (NTSC) spaces. The second stage incorporates frame segmentation techniques that work in parallel. The selected channel is utilised for optical flow in conjunction with the saliency map to identify the motion of unique regions, whilst, the NTSC frame undergoes segmentation processes based on expectation–maximisation (EM) and Otsu thresholding. Finally, the binary images, from both ends, are fused for obtaining salient regions. While the third stage comprises features extraction and selection using the proposed methods, the final stage classifies regions based on the extracted features using M-SVM. The detailed flow, explained above, is depicted in Fig. 2. It is essential to mention that each phase of the proposed framework carries some degree of novelty, and therefore, takes state-of-the-art one step forward. The main contributions of the work are enumerated below: Introduction of a preprocessing block, prior to the segmentation step, which incorporates top-hat and bottom-hat filters along with the proposed RGB* colour space enhancement. Fig. 1Open in figure viewerPowerPoint Sample results of the proposed algorithm on Weizman data set (a) Original frame, (b) Enhanced frame, (c) Region of interest (ROI), (d) Action labelling Fig. 2Open in figure viewerPowerPoint System architecture of proposed human detection and recognition algorithm A parallel architecture built on attention-based motion estimation and segmentation modules. Introduction of frames fusion methodology by combining parallel segmentation modules. Features selection based on the proposed EPCAWM. The CCR of the proposed cascaded design is tested with six different classifiers name decision tree, linear discriminant analysis (LDA), weighted K-nearest neighbour (KNN), ensemble boosted tree, logistics regression, and M-SVM [11, 12]. Moreover, a detailed comparison between the proposed framework and a few existing works, using various data sets, is also presented. 1.4 Paper organisation The paper is organised as follows: Section 1 described the introduction of this work, which further includes background and window of opportunity. Section 2 defines the problem statement. The proposed HAR approach is presented in Section 3, which includes pre-processing, frames segmentation, feature extraction, and selection. Section 4 explains the experimental results and finally, Section 5 concludes the proposed work. 2 Problem statement Let be a bounded video sequence, be the given sequence of frames, where are the pixel values for n frames. The processed frames , given as are the modified version of , depicted as : . The output frame is defined as (1) (2) (3)where is the top-hat frame, is the bottom-hat frame, is the enhanced frame is the optical flow, is the CIE-Lab transformation, is the YIQ transformation frame, is the saliency frame, is the EM frame, is the binarisation frame, is the features matching frame, and is the features selection method. Moreover, further discussion is briefed in Section 3. 3 Proposed work In this section, we explained our proposed cascaded design in brief, which comprises four distinct stages. In the initial stage, top-hat and bottom-hat filtering is applied, which identifies all distinct regions with respect to their surroundings. The fused frame is later subjected to colour space transformation from the enhanced RGB space to CIE-Lab and NTSC spaces for improved results. We utilised the fact, salient information exists in one more colour channels [36] and selected L channel for accurate results and fast computation. The second stage incorporates two parallel blocks that are conjoined to construct a fused binary frame. L channel is an input to the optical flow in conjunction with the saliency map to identify attention-based motions. Similarly, NTSC frames are segmented with the EM algorithm and finally Otsu thresholding for binarisation. A number of EM clusters are chosen to be minimum for fast computation and low visual complexity. Morphological operations are later applied on the fused binary frame to remove inutile fragments and to fill in minor gaps. The proposed strategy copes with the problem of curse of dimensionality by introducing a novel features selection method, EPCAWM in its third phase. Considering the fact, classifier's performance degraded hastily with a large number of features, therefore, with the introduction of the proposed standard strategy, the algorithm selects the most salient features while discarding the trivial information. In the final phase, selected features are utilised to train the classifier, which is selected to be a M-SVM. The proposed architecture is shown in Fig. 2. 3.1 Preprocessing In the preprocessing phase, the objective is to perform frame enhancement to highlight region with maximum information, prior to colour space transformation. The steps include top-hat and bottom-hat filtering, RGB enhancement, and finally, transformation into CIE-Lab and NTSC colour spaces. The fusion of top-hat and bottom-hat frame is given using relation (4)where is the filtered RGB frame, is the original RGB frame subjected to further enhancement given as (5)where , an index for three channels of red, green, and blue, respectively, with a general expression as follows: (6) (7)where , , and are red, green, and blue channels and , , and are their modified versions. The extended RGB colour space is finally transformed into CIE-Lab, from which we selected luminance () channel for further processing with the high probability of finding pedestrians and due to low computational cost. Finally, NTSC transformation [37] is applied on the bottom-hat frame, which acts as an input for EM segmentation. 3.2 Frame segmentation The frame segmentation block incorporates two parallel modules having inputs from the CIE-Lab and NTSC colour spaces. The output frame is a fused binary image computed from an attention-based motion estimation and segmentation module. The crux of using two parallel blocks is to accurately identify the salient region(s). A visual comparison is also been provided in Fig. 3 to support our claim of utilising two parallel blocks for improved accuracy. Fig. 3Open in figure viewerPowerPoint Human detection results (a) Original frame, (b) Enhanced frame, (c) Saliency frame, (d) EM segmentation frame, (e) Fused frame, (f) ROI detection 3.2.1 Attention-based motion estimation This block integrates optical flow with the saliency method to identify only unique motions in the given frames. Optical flow [6] is considered in this work for velocity estimation. The use of optical flow in the CV is to find the motion regions in the video sequences from time t to time . It tracks the brightness variations in each successive frame by providing the information about three-dimensional arrangements in horizontal, vertical, and time directions, respectively. In videos, change of illumination is highlighted as motion during optical flow and there exists a probability of acquiring false motions. This problem is tackled using the proposed integrated method and only most salient motions are considered by computing a saliency map using existing techniques [38]. 3.2.2 Binary segmentation The binary segmentation block comprises EM segmentation and binarisation modules. Mean and variance of EM segmentation are generated randomly, before subjecting to image binarisation step with Otsu thresholding. A number of clusters for EM are fixed to be six for each frame so as to save visual information. The main idea is to bring pixels' information to some specific range prior to Otsu thresholding. The cost function of EM segmentation is performed as follows: (8) 3.2.3 Features matching Features matching is based on pixels' similarity between two consecutive frames, which corresponds to the fusion of identical features. In videos, illumination change causes variation in the intensity values, which reflects in the feature vectors, as a result fusion becomes complex. Let be random features of the EM segmented frame. Similarly, be the random features of the saliency frame. We need to find matching features , so that is maximum. Whereas and are required. To validate this, one needs to show that mean and variance . Lemma 1.Suppose vectors , where is a matching indicator for each , given as (9)The obtained value is 1 if the matching percentage of vector with is 100. Theorem 1.For each selected feature , assuming to be the condition that the ith feature is a match, given that an image vector (), fulfils the condition of at least single feature match out of k. Proof.Considering the inclusion–exclusion principle [39] to observe the unique matching feature. For the condition , the ith feature is fixed and the other features are available to permute, freely. Thus (10)As a result, we may write (11)Thus, the probability that the jth feature is matched is as follows: (12)Thus, the probability that both ith and jth features are matched as follows: (13)it follows that for all , has the Bernoulli distribution with a chance of matching the unique feature as follows: (14)From this discussion, mean and variance for ith and jth feature can be observed as follows: (15) (16)from this observation, it is clear that . However, for finding , one needs to find covariance as follows: (17)Thus, the variance is as follows: (18) Following the above mathematical explanation, find out the matching frames based on their features' weights, such that (19)where denotes the coefficient of the saliency and EM frame features, which is the normalised values of similarity measures between the saliency model. □ 3.2.4 Morphological operations Finally, a set of morphological operators is used to refine the binary segmented image including morphological open and close, and bwareaopen (matlab). The primary reason behind selecting these sets of operators is to fill small holes, as well as to remove the inutile fragments [36]. The brief description of segmented frames fusion is also given in Algorithm 1. Algorithm 1 (Segmented frames fusion). (CIE-LAB Data) and (NTSC Data) Initialise for do for do end for (inner) end for (outer) Output: segmented fused frame 3.3 Features extraction For the CV applications, such as biometrics [40], medical imaging [41], agriculture [42], and video surveillance [2], features are playing a key role in describing an important region or object. A lot of feature techniques are extracted in the literature and a few famous techniques are HOG, texture (LBP), and point (SIFT). HOG feature is mostly utilised for the shape information of any object in the CV. In this work, the HOG feature is extracted from the segmented human frames. The following steps are performed for the computation of the HOG feature. In the first step, edge gradient and orientation are calculated in horizontal and vertical directions using the following kernels: (20)Then, we find out the magnitude and orientation of gradients. The orientation is needed to find out the directions of gradients, which are formulated as follows: (21)where G denotes the magnitude and denotes the direction of gradients. The gradient image erases a lot of casual information such as constant background colour but display the outlines. After that, divide the image into small, a three-dimensional spatial area called 'cell', where the size is fixed at pixels. For each cell, a histogram of edge gradients with eight orientations are calculated. As cells consist of 192 pixel values. Thereafter, normalisation is performed by , which is formulated as (22)Hence, a block consists of four histograms that can be concatenated to form a element vector for each horizontal vertical histogram feature is calculated . The extracted HOG features are later optimised by the best features selection method as shown in Fig. 4. Fig. 4Open in figure viewerPowerPoint Visual description of our proposed methodology of HAR. A combination of preprocessing and learning phase 3.4 Features selection using EPCAWM The proposed principal component based on the weights minimisation approach is used for features selection in HAR. The methodology comprises two fundamental steps. First, we convert an observed data into a linearly uncorrelated data using the entropy of the given mask for getting the desired edge gradient. Second, construct the sparse data set based on the probabilistic weights. Let the N sample be represented by , with mean vector and centralised value: . In this regard, the centralised matrix is given by (23) For all , the entropy vector of will provide an improved -dimensional real vector with probability mass function as follows: (24) Following this, the centralised data in terms of the covariance matrix is managed using controlled entropy : as follows: (25)where the goal of the principle component is to observe a normal matrix to de-correlate , i.e. . Since the symmetric nature of covariance matrix existed, which can be written as , where is an orthogonal matrix of size with diagonal entries. Where eigenvalues are given in order of . Based on the symmetric nature, it is considered that ; such that can be decorrelated, i.e. and . Using the above expression, principal component fully de-correlated . Moreover, the energy of the original data concentrated on a limited region of a given frame while the energy ambiguous information is separated evenly over the entire frame. To get the high level of sparsity for a given data, consider the minimisation weight strategy for estimating the heteroscedastic non-zero entries [43]. We impose a minimisation weight for achieving possible sparsity on the cardinality basis. The ambiguous information of frame can be minimised to the kth row as follows: (26)where is the minimised weight coefficient as follows: (27)where is the decorrelated covariance matrix of and is the centralised covariance matrix of . In smooth regions, human movement for misleading edges of is much smaller than , so that approaches zero. Therefore, most of the errors occurred due to the misleading edges, which can be ignored. By applying inverse transform on , one gets the first 50 principal component minimisation weight features, which are used for further classification purposes. 3.5 Classification A supervised learning model such as SVM is explicitly used for the classification of objects in the area of machine learning [44]. The approach constructs a set of hyperplanes in an infinite-dimensional space, intuitively to acquire efficient classification and regression results. SVM was supposedly worked only for the binary class problems but this framework would not provide the legitimate solutions here with action recognition and classification, therefore, M-SVM supports the classification [45]. The implemented M-SVM is probably a one-against-all method, it constructs K SVM models, where ; is the maximum number of classes. The Nth SVM is trained with positive labels to predict m training data where and is the class of , the lth SVM solves the following optimisation problem: (28) (29) (30)where are class labels and are some objects that are used for training the classifier. The recognition results of the proposed algorithm using M-SVM are shown in Fig. 5. Fig. 5Open in figure viewerPowerPoint Recognition results from KTH data set (a) Original frame, (b) Segmented frame, (c) Classified frame 4 Experimental set-up 4.1 Selected data sets 4.1.1 Weizmann This data set [3] is considered one of the most famous and useful data sets for the HAR model under the controlled environment. It includes a total of 90 human video sequences with ten human actions, which are listed in Table 2. The video sequences are performed by nine persons in the static environment of resolution level . Table 2. Labels description Description of selected action classes and their labels Weizmann data set WVU multi-view action data set MuHAVi action data set Action class Label Action class Label Action class Label walking W clapping C ClimbLadder L running R jogging G CrownonKnees N jumping J jumping jack K DrunkWalk W jumping jack K kick I JumpOverFance F jumping
Referência(s)