3D Features for human action recognition with semi‐supervised learning
2019; Institution of Engineering and Technology; Volume: 13; Issue: 6 Linguagem: Inglês
10.1049/iet-ipr.2018.6045
ISSN1751-9667
AutoresSuraj Prakash Sahoo, Ulli Srinivasu, Samit Ari,
Tópico(s)Anomaly Detection Techniques and Applications
ResumoIET Image ProcessingVolume 13, Issue 6 p. 983-990 Research ArticleFree Access 3D Features for human action recognition with semi-supervised learning Suraj Prakash Sahoo, Corresponding Author Suraj Prakash Sahoo surajprakashsahoo@gmail.com Department of Electronics and Communication Engineering, National Institute of Technology Rourkela, Odisha, 769008 IndiaSearch for more papers by this authorUlli Srinivasu, Ulli Srinivasu Department of Electronics and Communication Engineering, National Institute of Technology Rourkela, Odisha, 769008 IndiaSearch for more papers by this authorSamit Ari, Samit Ari Department of Electronics and Communication Engineering, National Institute of Technology Rourkela, Odisha, 769008 IndiaSearch for more papers by this author Suraj Prakash Sahoo, Corresponding Author Suraj Prakash Sahoo surajprakashsahoo@gmail.com Department of Electronics and Communication Engineering, National Institute of Technology Rourkela, Odisha, 769008 IndiaSearch for more papers by this authorUlli Srinivasu, Ulli Srinivasu Department of Electronics and Communication Engineering, National Institute of Technology Rourkela, Odisha, 769008 IndiaSearch for more papers by this authorSamit Ari, Samit Ari Department of Electronics and Communication Engineering, National Institute of Technology Rourkela, Odisha, 769008 IndiaSearch for more papers by this author First published: 12 April 2019 https://doi.org/10.1049/iet-ipr.2018.6045Citations: 6AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinkedInRedditWechat Abstract Human action recognition (HAR) is a very challenging task because of intra-class variations and complex backgrounds. Here, a motion history image (MHI)-based interest point refinement is proposed to remove the noisy interest points. Histogram of oriented gradient (HOG) and histogram of optical flow (HOF) techniques are extended from spatial to spatio-temporal domain to preserve the temporal information. These local features are used to build the trees for the random forest technique. During tree building, a semi-supervised learning is proposed for better splitting of data points at each node. For recognition of an action, mutual information is estimated for all the extracted interest points to each of the trained class by passing them through the random forest. The proposed method is evaluated on KTH, Weizmann, and UCF Sports standard datasets. The experimental results indicate that the proposed technique provides better performance compared to earlier reported techniques. 1 Introduction One of the vital research areas in the field of computer vision and pattern recognition is human action recognition (HAR). Applications of HAR include security and monitoring systems like video surveillance and observing patient's actions in hospitals, video data content analysis like video retrieval, sports video analysis, human-computer interaction for automated systems, movie and video games for visual effect and animation systems. Many algorithms are developed for HAR [1-13], still, it is a very interesting research area and many challenges remain to be addressed. It is challenging because of the complex background, illumination variations, and similar body motion in intraclass actions. In a similar action class, various persons perform differently due to variation in speed of performance, clothing, scale, occlusions, and viewpoint variations. Human action in a video data is represented by spatio-temporal interest points (STIPs) [14]. Due to background noise and illumination variations, noisy STIPs are detected along with the actual action points. In order to recognise actions more accurately, noisy points are needed to be eliminated. During extraction of histogram of oriented gradient (HOG) [2, 14] around the interest points, gradients of video sub-volume is computed in the spatial dimension followed by the estimation of histogram. However, gradient in temporal dimension is having higher importance to describe the temporal relationship of an action. In the unsupervised learning technique [15], the feature space is split by taking the maximum variance of feature difference of hypothesis. As the tree grows, the feature size reduces at subsequent child nodes which in result makes the splitting more difficult. To handle the above-mentioned challenges, a novel algorithm for HAR is developed in this work with the following propositions: (a) filtering of interest points through motion history image (MHI) to reduce noisy interest points, (b) extension of HOG and HOF from spatial to spatio-temporal domain to describe temporal relationship of an action, and (c) modified semi-supervised learning in random forest for classification. The MHI [16]-based region of interest (ROI) is extracted to know where exactly the action is going on in the video. After extracting the region of interest, the detected STIPs are passed through the ROI so that the noisy points are eliminated. Action patterns are extracted in three dimensions (3D) of video with the extended 3D gradients. By extending the gradient computation and optical flow velocity in the temporal dimension, histogram of 3D-oriented gradient (HOG3D) and histogram of 3D optical flow (HOF3D) are computed. Action classification problem is solved by maximising the mutual information of the STIPs [2]. The mutual information is estimated by using the semi-supervised learning random forest voting. Random forest is constructed by building M number of independent decision trees. Both unsupervised as well as supervised learning methods are used to build the decision trees and hence, the name is semi-supervised learning. In this technique, feature space is split by unsupervised learning at initial nodes of tree. As the tree grows, the feature size reduces and variance-based splits become difficult. Therefore, the learning technique switches to supervised. Experiments are conducted on publicly available standard datasets like KTH, Weizmann, and UCF sports dataset. The performance of the proposed technique on various action datasets is presented in the experiments section. The rest of the paper is organised as follows: Section 2 describes about the related work. Section 3 describes the proposed work in depth. The section describes the problem with existing techniques and propositions to handle those problems. Experimental results and discussion are presented in Section 4. Finally, Section 5 concludes the work of this paper. 2 Related works Initial development of HAR is started with view-based template matching [17]. The template is estimated by using motion history energy or MHI. Template-matching technique is computationally inefficient, and the templates have variations in pose, scale, and rotation. Y.M. Lui et al. [11] have recognised various actions on different datasets by separating the action differences in tangent space. Ivan Laptev [14] has extended the concept of Harris 2D interest point to 3D STIP detection for video data. STIPs describe the compact representation of video data. After, extracting the STIPs, spatio-temporal descriptors are computed for action classification. View invariant features [1] are proposed by K.P. Chou et al. for multi-view HAR. Haar wavelet transform is used for feature extraction in [3]. STIP and 3DSURF features are fused and classified by multi-class SVM in [6]. The works of [2, 3, 6] have not taken special care to differentiate the actions like ‘running’ and ‘jogging’ where actions are mostly differentiated by speed. One of our earlier paper [5] have used 3D spatio-temporal planes to extract the temporal information of action videos. Another type of feature: spatial-temporal histogram of gradients (SPHOG) feature [12] is proposed by B. Lin et al. for HAR. A spatio-temporal feature named long-short term feature (LSTF) presentation is used by Y. Huang et al. [18]. View-based key pose-matching algorithms are presented in [19]. From multiple viewpoints, a series of 2D human poses are modelled. Graph model based ActionNet is built by using the synthesised 2D human poses. Modelling different view poses is computationally complex because, the pose modelling needs to estimate many parameters. 3D convolutional neural network (3DCNN) is developed for HAR in [20]. The CNN-based techniques provide both feature extraction and classification in a single entity. 3DConvNet features are fused with traditional feature extraction techniques to provide a better feature vector in [4]. A novel dynamic neural network is proposed by M. Jung et al. [9] for HAR. The work of [13] leverages LSTM networks and attention networks along with CNN to recognise human actions. In 3DCNN, initial convolutional layers extract features both in spatial and temporal dimensions. Adjusting temporal and spatial convolutional kernels, such that the network recognises the fast-motion actions and slow-motion actions, is a difficult task. As a result, action recognition performance decreases for ‘running’ class in [20]. Y. Yuan et al. [21] have used spatial-optical data organisation along with sequential learning to recognise different actions. During spatial-optical data organisation, motion trajectories and optical flow are used on whole RGB video data. However, in the proposed work, the histogram of optical flow (HOF) is calculated for each video sub-volume around each of the interest points along with HOG3D. 3 Proposed framework The block diagram of the proposed HAR paradigm is shown in Fig. 1. During training phase, the STIPs are extracted from the training video data. From the same video, ROI is selected using MHI for filtration of the STIPs to eliminate the noisy points. A video sub-volume is extracted around the STIPs and described by HOG3D and HOF3D features. For both HOG3D and HOF3D, the feature size is 144. A semi-supervised learning-based random forest technique is developed to build the trees of forest from feature space. During testing, each STIP is passed through the random trees of the forest for voting. Mutual information is calculated by taking logarithm of posterior probability which is computed by the average of voting scores. Finally, the decision is made by taking the maximum mutual information among all the classes. Fig. 1Open in figure viewerPowerPoint Block diagram of the proposed human action recognition algorithm 3.1 Refined spatio-temporal interest points through MHI 3D STIP is proposed by Ivan Laptev [14] for video data. STIPs represent the regions where significant changes are present in spatial and temporal dimensions of video data. In order to detect STIPs, the Laplacian of video is computed by the help of Gaussian kernel with separate scales for both spatial and temporal dimensions. Multi-scale approach [22] for STIP detection is proposed to remove the task of adaptive selection of scale. The detected STIPs with multi scale are shown in Fig. 2. The next problem with STIPs is that it may come from foreground or background. The background STIPs are the erroneous STIPs, which are generated from background clutter, illumination change or camera shake. These erroneous points are to be removed for better presentation of an action. In the proposed work, the problem is handled by the use of MHI [23] along with STIPs as post processing step. Fig. 2Open in figure viewerPowerPoint Detected STIPs at multiple scales for different actions (a) Hand waving, (b) Running of KTH dataset and, (c) Diving, (d) Kicking of UCF sports dataset MHI is a view-based 2D image representation of the history of an object motion in a video. The MHI preserves a history of spatio-temporal variations at each pixel location and then decays over the time. In MHI representation, recent motion pixels are brighter and previous motion pixel values decreases with time. MHI is computed as follows: (1)where which represents motion is binarised image sequences formed by simple frame differencing and thresholding, represents the number of frames considered to detect MHI, () are spatial dimensions and t is temporal dimension of video frame. From the MHI template, a rectangular region is extracted. Sometimes erroneous rectangles can be generated due to noise. These smaller rectangles are removed by a predefined hard threshold which is chosen empirically. All the extracted rectangles from different frames are not of same size. Therefore, for ROI, a rectangle is chosen which contains all the MHI rectangles. Fig. 3 shows the extracted ROI for ‘skip action’ from Weizmann dataset along with the filtered STIPs. Fig. 3Open in figure viewerPowerPoint Process of removal of noisy STIPs by the help of MHI based ROI (a) Detected STIPs at multiple scale, (b) Extracted rectangles from each frame by MHI, (c) Extracted region of interest, (d) Noise free STIPs detected mostly at action happening region. (The ‘lifting’ action class from Weizmann dataset is used for demonstration purpose) 3.2 Extension of spatial features to spatio-temporal domain After obtaining the refined STIPs, 3D video patches are extracted around each of the interest point. The dimension of the video patch depends on the Gaussian scales , at which the STIPs are detected. Based on the Gaussian kernel scales, the spatial extension and temporal extension are used to extract the video patch. The value of is 18 for spatial, 9 for temporal dimensions as mentioned by I. Laptev [22]. The HOG and HOF features are extended from 2D to 3D as proposed HOG3D and HOF3D [24]. HOG describes the object shape and appearance. Therefore, it is used in action recognition to represent the appearance of an action by describing it through local intensity gradients. The greatest advantage of HOG is that it is robust to scale and rotation variations. HOG has performed well as a well-known feature extraction technique however, when comes to the video-based action recognition applications, it fails to consider the important temporal information. This limits the recognition when fast-motion actions and slow-motion actions are recognised simultaneously. Therefore, in this work, HOG is extended from spatial to spatio-temporal domain. The detailed procedure is shown in Fig. 4. Here, 3D gradients [25] are used by considering temporal information while computing the histogram. The improved Sobel kernels which is used to compute the 3D gradients are shown in (2) (2) Fig. 4Open in figure viewerPowerPoint HOG3D feature extraction procedure from extracted video patches around STIPs. The video patch is divided into sub-patches and feature is extracted for all sub-patches independently. Sub-patch features are concatenated to form the final HOG3D feature From the computed 3D gradients, azimuth and elevation angles are calculated by converting into spherical coordinate system. The video patch is sub-divided into small cells by dividing the patch into 3 × 3 spatial grid structure. For temporal dimension, the patch is divided into two temporal bins. This forms a 3 × 3 × 2 grid representation as shown in Fig. 4. An eight bin histogram is computed for each cell by dividing the azimuth and elevation angles into eight orientation directions. The histograms of smaller cells are concatenated to form the feature vector of length 144 (). The following equations are used to compute the spherical coordinate gradient values: (3)where (, ) are gradients in spatial dimension and is gradient in temporal dimension. Mag is magnitude of gradients, is elevation angle, and is azimuth angle. The histogram is calculated by updating each bin by a weight of magnitude as shown in (4). (4)Here, is particular oriented bin index and (4) shows the update of particular bin value. 3.3 Semi-Supervised learning in random forest classifier for HAR Random forest [26] is a machine-learning technique, which can be used for both regression and classification applications. Basically, the random forest is built by M independent decision trees. In general, a single decision tree is a weak classifier. By combining or averaging decision score of all the M independent trees, makes it as a strong classifier. Here, the semi-supervised learning technique is proposed to train the random forest. Unsupervised learning technique [15] is used at the root node of the decision tree and continued till the predefined depth of the tree reached. Further, the training procedure is switched to supervised learning. Hence, the name is semi-supervised learning and it improves the proper splitting of the data while building the trees. The proposed semi-supervised technique is supervised up to some extent. The difference is that the splitting at initial nodes are carried out by unsupervised technique as labels are not important initially. Similar features in large amount can be distinguished by less complex unsupervised technique. After certain depth of the tree, it is switched to supervised. In supervised learning class, information is used during error calculation for splitting the feature dataset at current node. In unsupervised learning, splitting of the feature data at tree nodes depends on the variance of feature difference. In case of unsupervised random forest, after reaching certain depth of trees, it is difficult to split the feature dataset as the variance becomes negligible. Supervised learning technique at later nodes and unsupervised learning at initial nodes can combine and work better. This analysis has motivated to propose the semi-supervised learning random forest. Algorithm 1 (see Fig. 5) describes the construction of random trees based on semi-supervised learning technique. Fig. 5Open in figure viewerPowerPoint Algorithm 1: Building of random trees for random forest using semi-supervised learning technique Let the extracted feature space is having N number of STIPs and denoted as ; and are HOG3D and HOF3D features, respectively. Trees of random forest are built by splitting the features at current node of the tree into left and right child. A random number is generated as . If is 1, HOG3D feature is selected, and if is 2, then HOF3D feature is selected. Further, two more random numbers (, ) are generated as feature dimension indices. After that, a feature difference [2] is evaluated as , i = 1, 2, …, N. After computing the feature differences, a threshold is computed to split the data using two different learning procedures at two different depth levels of the tree. Since the feature size is very large at the initial nodes, the unsupervised process splits the features more accurately by simply computing the variance of hypothesis. The threshold is selected as mean of feature difference having maximum variance of hypothesis. The mean and variance of the feature differences are computed as follows: (5)where k is constant and chosen 200 empirically which means the procedure is repeated k number of times. The threshold is estimated as (6)If the , the corresponding feature data goes to a left child otherwise moves to a right child to current node of the tree. When the splitting reaches at certain predefined depth of the tree, variance of the features becomes very small thus, creating problem for unsupervised splitting. Hence, at that depth of tree, the learning procedure is switched to supervised [2]. During supervised learning, class information is used to compute the binary minimisation error from the feature difference. The threshold is estimated using the following (7)where is the misclassification error due to the left node and is the misclassification error due to right node when the labels of both the tree nodes are l. is the misclassification error due to the left node and is the misclassification error due to right node when the labels of both the tree nodes are not l. The error function is computed as (8)Here, the function is equal to 1 when is true otherwise zero and is the class label. Similarly, other error terms are also estimated. Tree splitting procedure is stopped when the maximum depth of the tree is reached or the minimum number of features are reached at the current node. To compute the posterior probability , the information from all the tree leaves are added which contain s. Suppose for tree , the STIP s is matched to a leaf with positive query STIP points and negative points [15] and is the total number of trees, then is computed as (9) 3.3.1 Estimation of mutual information Action classification problem is solved by estimating the mutual information [2] of each STIP towards all the action classes present in the training database. Video data is represented with a set of STIPs and for each STIP entitled the N dimensional feature vector. The action class set has l number of classes and is represented by Cl = {1, 2, …, l}. Then, the mutual information of video clip is given by (10)In (10), first term represents the mutual information due to posterior probability and is estimated by using random forest voting. The second term is mutual information due to prior probability and it depends on number of STIPs present in particular class. 4 Experimental results and discussion For the evaluation of the proposed method, ‘KTH’, ‘Wiezmann’, and ‘UCF sports’ datasets are used in this work. The proposed method is implemented using MATLAB Version: 9.0.0.341360 Release 2016a of Mathworks Inc. The algorithms are executed on an Intel Core i5 personal computer, clock 3.20 GHz, RAM 6 GB, Windows10 platform. 4.1 Datasets KTH dataset: One of the standard dataset for action recognition algorithm is KTH dataset [27]. It contains six different actions ‘boxing’, ‘hand waving’, ‘hand clapping’, ‘running’, ‘jogging’, and ‘walking’. Each action is performed by 25 persons under four different scenarios like outdoor, indoor, different clothes, and normal clothes. The resolution of videos is 120 × 160 and frame rate is 25. The experimental setup for KTH dataset is considered same as [2] i.e. 16 persons' actions for training set and nine persons' actions for testing set. Weizmann dataset: The Weizmann dataset [28] contains 10 action classes, performed by nine different persons. The action classes are ‘bend’, ‘skip’, ‘jumping jack’, ‘jump forward on two legs’, ‘jump in place on two legs’, ‘gallop sideways’, ‘wave one hand’, ‘wave two hands’, ‘running’, and ‘walking’. The videos are having spatial resolution of 144 × 180 pixels, 25 frames per second and acquired with a fixed camera. UCF Sports dataset: UCF sports [29] dataset consists of 10 different sports actions gathered from broadcast televisions. The actions are ‘diving’, ‘golf swing’, ‘kicking’, ‘lifting’, ‘riding horse’, ‘running’, ‘skate boarding’, ‘swinging bench’, ‘swinging side’, and ‘walking’. Total 150 videos of resolution 480 × 720 are present in the dataset. Since, the dataset contains less number of video clips, leave one out cross-validation method (similar to [29]) is followed. 4.2 Effect of STIP with MHI on performance The extracted STIPs contain noisy points. Due to the presence of noisy points, the performance is hampered. With the help of region of interest of MHI, noisy points are eliminated in this work. MHI gives the region where exactly the action is going on in a video. The effect of MHI on filtering the important STIPs is studied in this section. For verification, KTH dataset is used with supervised training. The result is shown in Table 1. MHI when used as ROI, the performance is improved with HOG3D, HOF3D, and (HOG3D, HOF3D) feature. However, the combined feature provides better result in comparison with single feature types. The effect of feature type and learning type is more studied in next sub-section. Table 1. Performance of action recognition without MHI and with MHI Feature Without MHI (noisy STIPs), % With MHI (noise free STIPs), % HOG3D 92.59 93.05 HOF3D 90.66 91.20 (HOG3D, HOF3D) 93.28 94.40 4.3 Experiments on three-dimensional features with semi-supervised learning The performance of action recognition algorithm is evaluated on different datasets with different combinations of HOG3D, HOF3D, and (HOG3D, HOF3D) feature descriptors. Various learning techniques such as unsupervised, supervised and semi-supervised learning with random forest algorithm are studied, and the results are shown in Table 2. For KTH dataset, the highest performance of 96.29% is achieved with a combination of (HOG3D, HOF3D) and semi-supervised learning. In previous algorithms [2, 20], the ‘running’ and ‘jogging’ action classes of KTH dataset are misclassified. Though the running and jogging are similar actions, the speed of action differs. The proposed algorithm with extended 3D gradients and optical flows well handles the fast or slow motion of action. Therefore, the two actions are exactly classified with improved overall performance. Semi-supervised learning based random forest classifier has achieved good performance compared to the other two techniques. (HOG3D, HOF3D) feature descriptor achieved 0.69, 1.35 and 1.39% when compared with HOG3D feature descriptor for unsupervised, supervised and semi-supervised random forest classifiers, respectively. Table 2. Performance comparison of action recognition with different features and learning method for different datasets Dataset Features Unsupervised Supervised Proposed semi-supervised KTH HOG3D 92.59 93.05 94.9 HOF3D 91.66 91.2 93.5 (HOG3D, HOF3D) 93.28 94.4 96.29 Weizmann HOG3D 94.93 93.33 94.32 HOF3D 92.22 94.44 95.06 (HOG3D, HOF3D) 96.67 96.67 97.77 UCF sports HOG3D 80.79 82.68 86.97 HOF3D 83.24 81.92 83.97 (HOG3D, HOF3D) 84.16 85.15 90.29 In comparison with KTH dataset, Weizmann dataset has achieved good accuracy of 97.77% with (HOG3D, HOF3D) feature descriptor and semi-supervised learning method. It is because the Weizmann dataset is simple compared to KTH dataset. For Weizmann dataset, both unsupervised and supervised learning methods provide same accuracy of 96.66%. However, HOF3D feature has better performance compared to HOG3D in supervised and semi-supervised learning method. The ‘running’ action is misclassified with ‘side’ action and ‘one hand waving action’ is misclassified to ‘two hands waving action’. Otherwise, other action classes are properly classified. The UCF sports dataset contains realistic sports action videos. The number of video clips varies from one class to other class. The dataset contains human actions as well as human interaction with sports items. Due to all these difficulties, the classification of actions is more complex. Leave one out cross-validation is used to calculate the performance where the overall performance is computed by averaging each fold accuracy. UCF sports dataset has performed well with (HOG3D, HOF3D) and semi-supervised learning method and the accuracy is 90.29%. ‘Swing bench’ action is misclassified to ‘swing side’ action as they are similar and the misclassification is around 10%. HOG3D feature with unsupervised learning obtained 80.79% which is the less accuracy in the UCF sports dataset compared to other combinations, which shows that HOG3D performance is less when lot of variations in the actions especially camera motion is there. However, with semi-supervised learning HOG3D has obtained good accuracy. Semi-supervised learning gained 6.18, 0.73 and 6.13% accuracy compared to unsupervised learning for HOG3D, HOF3D, and (HOG3D, HOF3D), respectively. In all learning methods, (HOG3D, HOF3D) feature descriptor has achieved better performance compared to HOG3D and HOF3D. 4.4 Deciding size of random forest One of the key parameters in the random forest is number of trees. Theparameter is set by experimenting different forest size on different datasetsand different feature combinations. Fig. 6shows the action classification accuracy with varying random forest size. Allthe plots are performed for semi-supervised learning. The forest size is varyingfrom 20 to around 80. The performance is improved from 20 to 50 trees. After 50trees, the performance is either fluctuating or decreasing. For KTH and Weizmanndatasets, the graphs for classification accuracy versus number of trees areflatter, whereas in case of UCF sports dataset, the variation is significant. Asthe performance for all combinations is found to be maximum when the tree numberis around 50, therefore, the number of trees in forest is set as 50 for all theexperiments. Fig. 6Open in figure viewerPowerPoint Performance of recognition accuracy versus forest size fordeciding the optimum number of trees in forest fordatasets (a) KTH,(b) Weizmaan,(c) UCF sports with HOG3D,HOF3D, (HOG3D, HOF3D) features separately 4.5 Analysis on computational time The analysis on computational time is required as the efficiency is not the only parameter for real-time execution. Computational time for each step of the proposed algorithm is shown in Table 3. The analysis is carried out for all three datasets used during the experiments. As shown in Table 3, proposed method takes more time for interest point extraction as it post-process the extracted STIPs through ROI. The post-processing step reduces the number of interest points from 591.2 to 96.3 for KTH, 76.8 to 26.3 for Weizmann and 947.4 to 656.3 for UCF sports dataset. The interest point count is averaged for
Referência(s)