Artigo Revisado por pares

Facial expression recognition techniques: a comprehensive survey

2019; Institution of Engineering and Technology; Volume: 13; Issue: 7 Linguagem: Inglês

10.1049/iet-ipr.2018.6647

ISSN

1751-9667

Autores

Saranya Rajan, Poongodi Chenniappan, Somasundaram Devaraj, Nirmala Madian,

Tópico(s)

Face recognition and analysis

Resumo

IET Image ProcessingVolume 13, Issue 7 p. 1031-1040 Review ArticleFree Access Facial expression recognition techniques: a comprehensive survey Saranya Rajan, Corresponding Author Saranya Rajan shreesaranya1987@gmail.com ECE, Bannari Amman Institute of Technology, Eroden, TN, IndiaSearch for more papers by this authorPoongodi Chenniappan, Poongodi Chenniappan ECE, Bannari Amman Institute of Technology, Eroden, TN, IndiaSearch for more papers by this authorSomasundaram Devaraj, Somasundaram Devaraj ECE, Sri Shakti Institute of Engineering and Technology, Coimbatoren, TN, IndiaSearch for more papers by this authorNirmala Madian, Nirmala Madian ECE, Sri Shakti Institute of Engineering and Technology, Coimbatoren, TN, IndiaSearch for more papers by this author Saranya Rajan, Corresponding Author Saranya Rajan shreesaranya1987@gmail.com ECE, Bannari Amman Institute of Technology, Eroden, TN, IndiaSearch for more papers by this authorPoongodi Chenniappan, Poongodi Chenniappan ECE, Bannari Amman Institute of Technology, Eroden, TN, IndiaSearch for more papers by this authorSomasundaram Devaraj, Somasundaram Devaraj ECE, Sri Shakti Institute of Engineering and Technology, Coimbatoren, TN, IndiaSearch for more papers by this authorNirmala Madian, Nirmala Madian ECE, Sri Shakti Institute of Engineering and Technology, Coimbatoren, TN, IndiaSearch for more papers by this author First published: 07 May 2019 https://doi.org/10.1049/iet-ipr.2018.6647Citations: 14AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinkedInRedditWechat Abstract Over the past decades, facial expression recognition (FER) has become an interesting research area and achieved substantial progress in computer vision. FER is to detect human emotional state related to biometric traits. Developing a machine based human FER system is a quite challenging task. Various FER systems are developed by analysing facial muscle motion and skin deformation based algorithms. In conventional FER system, the developed algorithms work on the constrained database. In the unconstrained environment, the efficacy of existing algorithms is limited due to certain issues during image acquisition. This study presents a detailed study on FER techniques, classifiers and datasets used for analysing the efficacy of the recognition techniques. Moreover, this survey will assist researchers in understanding the strategies and innovative methods that address the issues in a real-time application. Finally, the review presents the challenges encountered by FER system along with the future direction. 1 Introduction Facial expression recognition (FER) has a high impact in the field of pattern recognition, and a substantial effort is made by researchers to develop an FER system for human–computer interaction applications. The facial expression provides sensitive information cues to build an FER system and considered as the best tool for recognising human emotions and intentions easily. In 1971, Ekman and Friesen [1] defined six distinct expressions (happy, sad, anger, surprise, fear, and disgust) as the basic emotions and each emotion is associated with a unique facial expression which readily recognised across different cultures. The psychologist Mehrabian [2] proposed a study on information communication between humans. The study reveals 55% of information is conveyed by facial expression, 38% by supporting language like sound, speech and so on, and only 7% by oral language. Currently, an FER system plays a central part of artificial intelligence and serves as a potential real-world applications in different areas for psychological studies [3], driver fatigue monitoring, interactive game design, portable mobile application to automatically insert emotions in chat and assistance systems for autistic people, facial nerve grading in medical field [4], emotion detection system used by disabled to assist a caretaker, socially intelligent robot with emotional intelligence [5]. Most of the research work in FER system follows the framework of pattern recognition [6]. It consists of three phases: face detection, facial feature extraction, expression classification. It is quite substantial and noteworthy to research these phases. In this current survey, various phases of facial expression analysis are discussed with distinct algorithms to classify six basic expressions. Face detection is performed by algorithms such as Haar classifier, adaptive skin colour algorithm and so on. Gabor feature, local binary patterns (LBPs), active appearance model, principal component analysis and other algorithms exploited for feature extraction. The classifiers used for expression classification are support vector machine (SVM), neural networks, nearest neighbour and so on. The essential step of FER is face detection. The efficiency of the classifier will be better with effective facial feature extraction, and this is achieved only by a proper face detection method. Viola and Jones [7] suggested an AdaBoost classifier that extracts and classifies the features quickly and accurately. Danti and Kadiyavar [8] represented a face detection method based on skin colour irrespective of face orientation and background. Various aspects of distinct algorithms and methods are published by researchers to identify a face from static and dynamic image frames. Followed by face detection, many researchers developed various feature extraction techniques to analyse facial expression. Several attempts were carried out to extract detailed parametric facial feature vectors in the frontal view of both still and an image sequence. Tian et al. [9] developed an automated face analysis system to recognise the subtle changes based on the facial action coding system (FACS). Automated action coding and facial expression detection is still a challenging problem. Chu et al. [10] proposed a selective transfer machine to personalise a generic classifier to overcome the challenges in video sequence frames like illumination and complex background. Despite action coding, they proposed local and global descriptors for detecting distinct facial expressions. Further, its generalisability was tested using valuable databases. Extensive research helped to develop a better FER system in recent years, but the performance of the system is affected by various factors. As far as the FER systems concern, the existing methods handle only prototypic posed facial expressions which are captured under laboratory constraints [11]. In the case of the unconstrained environment, the use of an existing method often leads to a higher probability of misclassification due to spontaneous expressions. The recognition of spontaneous expression in real-time is a challenging issue related to changes in illumination; head pose variation, subtle facial deformations, aging, occlusion of any objects like hair, glass or scarf, skin colour variations and complex background. Most of the trained images are also misclassified by a better performing classifier owing to the challenges mentioned above. Correspondingly, the available benchmark databases are not naturally linked to an emotional state of the test image. For such reasons, Sebe et al. [12] created an authentic emotion database and developed a real-time automatic FER system. This survey concentrates on the FER system in a controlled and uncontrolled environment based on their performance traits. We have discussed the current approaches to face detection and feature extraction techniques for FER and also presented the real-time applications. The survey presents state-of-the-art methods for face detection and facial feature extraction and expression classification. The paper also presents various techniques that solve the issues about the FER system. The study mainly focuses on feature extraction techniques. The paper is organised as follows: Section 2 describes the face detection methods. Section 3 provides a detailed review of facial feature extraction techniques, various classifiers and frequently used datasets are discussed in Section 4. Section 5 deliberates the challenges related to FER system. Finally, in Section 6, we conclude the paper with a promising future direction. 2 Face detection methods Face detection is a significant phase of FER. A proficient automated system can be developed for recognising the face region in static or video image spontaneously. A face region is detected in the image sequence using facial features such as edge, skin colour, texture and face muscle motion. These features easily distinguish a face region from the background. In this phase, the input image is segmented into two parts: one is the face region and other representing a non-face region. There are many face detection methods available like eigenspace method, adaptive skin colour and Viola–Jones method and their algorithms are developed based on Haar classifier, Adaboost and contour points. The survey is about the accurate identification of faces and its performance under a constrained and unconstrained environment. Table 1 gives an outline of face detection methods. Table 1. Summary of face detection S. no Method/algorithm Accuracy Comments 1 eigenspace method high accuracy for face detection under variable pose conditions head motion allowed in horizontal direction makes the system robust 2 adaptive skin colour accuracy is good as it identifies the skin colour easily but fails due to illumination adaptive gamma correction method is used to overcome the illumination problem 3 Haar classifier high accuracy obtained by Haar features computational complexity is less due to minimum features 4 Adaboost classifier high accuracy because of the strong classifier and detects a single face uses trained model so reduced computational cost 5 contours accuracy is good as it uses contour points due to minimum features, the computational cost is less. 2.1 Eigenspace method Pentland et al. [13] described the eigenspace technique to locate face under variable pose. Also, modular eigenspace descriptors are used to recognise the face image with salient features. Later, Essa and Pentland [14] utilised the eigenspace method to locate the face in any random image sequence; 128 image samples were obtained using principal component analysis (PCA). The eigenfaces define the subspace of sample images called 'face space' [15]. To detect the existence of the face in a single image, the distance between the observed image and face space was estimated using the projection coefficients and the signal energy. Correspondingly, the spatio-temporal filtering method was employed to detect the face in an image sequence. The thresholding concept was applied on the filtered image that causes binary motion which benefits the analyse of 'motion blobs' over a period time. Each motion blobs represent a human head for detecting the face location. Pentland et al. [13, 16] proposed a real-time approach that was successfully tested on a database containing 7562 images of both sexes having occluded objects on the face like hair, spectacles and so on. 2.2 Adaptive skin colour method Skin colour is an effective feature to detect face [17, 18]. Based on colour dependency, any one of the colour systems is preferred. The common colour systems are RGB, CMY, YIQ, YUV, YCbCr. Mostly, YIQ and YUV colour systems are used, where the brightness effect is removed during the procession [19], for colour intelligence. In the YIQ colour model, the component I refers to hue and Q refers to saturation and the formula to calculate I is given below: where I is the value of face skin colour in YIQ space that changes in a particular range between 30 and 100. Simultaneously, in YUV space, the range of face skin colour's, hue is between and . To establish a primary face skin-colour model, YIQ and YUV colour systems are synthesised. If an image satisfies the following conditions consecutively, then it is a skin colour. Most of the research adopt skin colour for face detection based on a fixed threshold scheme which causes large errors due to illumination and pose variation. An iterative thresholding algorithm [20] is proposed to acquire actual face region that satisfies the face geometric pattern. However, it is not suitable for real-time applications due to its high computational cost. An adaptive skin colour filter is introduced [21] that adaptively adjust the threshold values and use a linear discriminant function to separate the skin region from a complex background. Gamma corrective method is used to influence illumination and pose variation. Zhao-yi et al. [19] proposed an adaptive skin colour and structure model for multi-pose colour images in a complex background which highly improves the accuracy and effectually discards the impact of illumination level. 2.3 Haar classifier method Haar classifier is considered as a robust face detection method in a real-time environment [6]. Haar features are considered to detect face edges, lines, motions, and skin colour. The Haar features are a black and white connected rectangular box as shown in Fig. 1, used for feature extraction. Haar features can be easily scaled, and the positions are examined by increasing or decreasing the pixel intensities at different parts of an image. The located feature value is the difference of the sum of pixels of black and white regions inside the rectangle box [22]. Haar classifier detects the features which contribute face detection problems in the training phase. Thus, reduces the computational cost and complexity in the texting phase that leads to high detection accuracy. Fig. 1Open in figure viewerPowerPoint Haar features 2.4 AdaBoost method The AdaBoost is an ensemble approach for face detection [23, 24] and is highly used due to its improved accuracy and relatively low complexity of computation. It is a popular face detection method with a low false positive rate. The major limitation in Adaboost is sensitivity to noisy data and outliers [23]. A set of image features are trained with several classifiers in cascade using Adaboost to eliminate the negative samples. In the cascade structure as shown in Fig. 2, the output of the first classifier will be the input to the next classifier which is used to get accurate face region. Therefore, a strong classifier was built that helps in reducing the number of features and thus leads to high accuracy in detection. Kheirkhah and Tabatabaie proposed a hybrid and robust face detection system for colour and complex images [24]. The hybrid approach uses skin colour information and Adaboost-based face detection. It gives better performance in accuracy with minimum execution time. Fig. 2Open in figure viewerPowerPoint Structure of cascade classifier 2.5 Contours Face detection based on contour points leads to better accuracy [25, 26]. In an image sequence, the first pixel of the first frame is scanned based on skin colour, and that point is considered as a first contour point of the head. In the same way, the remaining contour points are computed in a frame. The pixel considered is called seed point. The direction of the contour point is initialised with the identified seed point, and detection path can be clockwise or anti-clockwise. Face motion is identified, when there is a shift in two successive frames in a sequence and experience a shift in contour points beyond a threshold [25]. Aniruddha et al. used a contour-based procedure to detect and track human face from video frames. Logical operation and Gaussian filters are used to get proper face contour. The scalar and vector distance of a rectangular window drawn from four corner points of two consecutive frames are calculated to detect and track the face from an image sequence [26]. 3 Feature extraction techniques After face detection, the next step in FER is feature extraction. The main aim of facial feature extraction is to extract an effective and efficient representation of facial components without any loss of face information. Geometric-based and appearance-based features are the two feature extraction techniques classified based on facial motion and deformation of facial features. The input image may be either a static image or image sequence. Based on the input image, a suitable facial feature extraction algorithm is applied to extract either the local or global or hybrid features. The extracted features are considerably reduced in size, which is given as input to the classifier and that significantly helps the classifier to make the decision easier in identifying and recognising the facial expression. Fig. 3 represents the FER process. Fig. 3Open in figure viewerPowerPoint Flowchart for FER In this section, we discuss on generalised view of facial feature extraction methods and have an extensive review on recent feature extraction techniques in FER. 3.1 Geometric-based method Geometric-based algorithms focus on permanent features (eyes, eyebrows, forehead, nose and mouth) which describe the shape and location of facial components using predefined geometric landmark position. These facial components are extracted to form a feature vector that represents the face geometry. However, the expressions affect the relative shapes and positions of various face features. Consequently, underlying facial expressions can be identified by measuring the displacement of significant facial components. In the case of an image sequence as input, FACS [27] is used that helps in differentiating facial movements based on the analysis of facial actions. FACS contains various action units (AUs) related to specific muscle contractions. Tain et al. developed an automated face analysis system to analyse the subtle changes in facial expressions which are further converted to AUs [7]. The expressions may contain single AU or combination of AU. Some basic upper and lower face AUs are shown in Fig. 4. On the other hand, static image input use model-based approaches, such as active shape model (ASM) [28], active appearance model (AAM) [29] and scale invariant feature transform (SIFT) [30, 31] algorithm to extract facial features. The geometric-based method is more suitable for real-time face images where features can be identified and tracked easily, but it requires an accurate face detection technique. Fig. 4Open in figure viewerPowerPoint FACS AUs by Ekman and Friesen. 'Asterisk' indicates AU 25, 26 and 27 are now coded according to criteria of intensity (25 A–E) and also AU 41, 42 and 43 are now coded according to criteria of intensity [7] 3.2 Appearance-based method Appearance-based algorithms focus on transient features (wrinkles, bulges, forefront) which describe the changes in face texture, intensity, histograms and pixel values. In this method, PCA, linear discriminant analysis (LDA), independent component analysis (ICA), Gabor wavelet, LBP are the algorithms considered to extract feature descriptors. In recent years, Gabor wavelet and LBP are extensively used to extract feature descriptors. Gabor wavelets are a well-known representative feature for extracting texture information effectively. Zhang et al. [32] investigated and compared geometry-based and Gabor-based method and the result shows that Gabor wavelet outperforms in performance and considered as a more powerful tool for feature extraction. Many research outcomes are favoured using a Gabor filter bank to detect lines and edges over multiple scales and orientations [33], it has a good time–frequency localisation and multi-resolution characteristics [34]. The limitation of the filter is high computational time due to the large size of filtered vectors [35]. LBP [36, 37] is a non-parametric descriptor whose aim is to efficiently detect the local structure of images. Due to low computational cost and high invariance property, the LBP feature extraction is widely used for feature extraction. In LBP [16, 22], the image is divided into sub-blocks, and the histograms are calculated for each block. Later, the histograms of each block are concatenated to obtain global features. Fig. 5 explains the LBP histogram technique. Fig. 5Open in figure viewerPowerPoint Calculation of block LBP histogram [22] Further, a brief survey is done on recent feature extraction techniques for FER system as shown in Table 2. Table 2. Review of recent techniques under the FER system Author/year Methodology Facial features Classifier Dataset Accuracy Advantage/disadvantage Duong/2018 projective complex matrix factorisation under unsupervised learning local features nearest neighbour CK+ and JAFFE 97.51 and 82.10% this method applied on positive and negative data Fatima/2018 supervised decent method based on Euclidean distance of fiducial points eyes, eyebrows, nose and mouth neural network CK+ Oulu- CASIA and JAFFE 99, 84.7 and 93.8% achieves higher recognition rate in real-time Revina/2018 enhanced modified decision based unsymmetric trimmed median filter, local directional number pattern, dominant gradient local ternary pattern dots, edges and local features SVM JAFFE and CK 88.63% robust against noisy faces than illumination Ding/2017 logarithm Laplace-double local binary pattern and Taylor feature pattern global and local features nearest neighbour JAFFE and CK 93.0 and 91.4% satisfied recognition result under an uncontrolled environment Arshid/2017 MSBP eyebrows, eyes, mouth, bulges and wrinkles simple logistic classifier wild dataset 96 and 60% resolves issues of illumination Munir/2017 fast Fourier transform and contrast limited adaptive histogram equalisation and merged binary pattern code eyebrows, eyes, mouth, bulges and wrinkles SMO, KNN, simple logistic MLP SFEW 96.2% holistic 65.7% division based suits for poor illumination Hasani/2017 modified inception-ResNet layers and conditional fields for sequence labelling spatial relation and temporal relation of labels deep neural network CK+, MMI and FERA 93.04, 78.68 and 66.66% improves the recognition rate Khadija/2017 IntraFace (IF) facial decomposition method global feature with seven ROIs multiclass SVM CK and FEED 94.1 and 87.5% highest recognition rate Holder/2017 improved gradient local ternary pattern eyes, nose and mouth SVM CK+ and JAFFE 97.6 and 86.8% robust against varying illumination and random noise Qayyum/2017 stationary wavelet transform face image is decomposed into subbands feedforward neural network JAFFE and CK+ 98.83 and 96.61% HCI and Kinect based applications Du/2017 LBPs and supervised descent method global and local features M-CRT JAFFE and CK+ 89.45 and 90.72% improves to classify expression Liu/2017 LBP and HoG features with gamma correction salient features linear SVM CK+ and JAFFE 96.6 and 63.4% avoids overfitting and low noise impacts Kumar/2016 weighted-projection based LBP discriminative features SVM MUG, JAFFE and CK+ 98.44, 98.51 and 97.50% discriminative information improves recognition Majumder/2016 1. Geometric features extraction 2. Regional LBP 3. Fusion of 1 and 2 local features SOM-based classifier MMI and CK+ 97.55 and 98.95% more efficient and accurate Kamarol/2016 STTM eyes and mouth SVM CK+, CASME II and AFEW 95.37, 98.56 and 84.52% captures subtle motions Tang/2016 DGFN local and global features ANN Oulu-CASIA and CK+ DGFN-78 and 93.81% achieves a higher recognition rate DFSN DFSN-86.88 and 98.10% fusion of DGFN and DFSN (DFSN-I) DFSN-I 87.50 and 98.73% Hsieh/2015 directional gradient operators like Gabor filters and Laplacian of Gaussian frown, nose wrinkle, two nasolabial folds, two eyebrows and mouth SVM CK+ and online members 94.7 and 93% effectively represents the change in expression Zhang/2015 PHRNN and MSCNN eyes, nose and mouth neural network CK+, Oulu-CASIA and MMI 98.50, 86.25 and 81.18% robust against dynamic image Kumbhar/2012 Gabor filter and PCA local and global features feedforward neural networks JAFFE 70% proves to achieves FER in practical application Duong et al. [38] constructed a dimensionality reduction method by an unsupervised learning framework called projective complex matrix factorisation (proCMF). The method is similar to proNMF and cosine dissimilarity metric where it transforms real data into a complex domain. A projective matrix was found by solving an unconstrained complex problem, and cost function was reduced using a gradient decent optimisation technique. The proposed method performs well compared to proNMF and is more robust to extract discriminant facial features and is potentially superior to FER, even under noise and outliers. It shows better performance than other baseline methods and achieves the recognition rate of about 97.51%. Fatima et al. [39] suggested an approach to enhance the accuracy of emotion recognition from facial expression based on fiducial points. Viola–Jones algorithm was used and 49 fiducial points were tracked using SDM [40], the obtained point represents face parts like eyebrows, eyes, nose and mouth. The proposed method calculates Euclidean distance between each pair of points. By the calculated distance ratio of first and last frames, the dynamic features were obtained. Most relevant features were selected using CfsSubsetEval evaluator and used neural network classifier for expression recognition. Hence achieved an accuracy of 99% on Cohn–Kanade (CK+), 84.7% on Oulu-CASIA VIS and 93.8% on Japanese female facial expression (JAFFE) databases, respectively. Michael Revina and Sam Emmanuel [41] proposed EMDBUTMF for reducing the noisy pixel in an image. The method was robust to eliminate salt and pepper noise. After noise removal, the feature vectors were obtained LDN (local directional number) pattern and DGLTP (directional gradient local ternary pattern). The DGLTP calculates the directional pattern of its neighbour and quantises into the three-bias level to encode the local texture. The subsequent patterns are exploited as facial feature descriptors. Further, SVM supervised machine learning classifier was used to map the labelled trained data into a higher-dimensional feature space with an optimal hyperplane for expression classification. The proposed approach achieved 88% of accuracy with CK and JAFEE databases. Ding et al. [42] proposed a method to detect and classify facial expressions from the video. 24-dimensional DLBP was proposed to detect the peak frame from the image sequence which extracts efficient facial features. Further, Taylor's theorem was used to expand the peak frame feature pixels to extract the discriminative features. To overcome the illumination variation in real-time application, the logarithm-Laplacian was proposed. The proposed method TFP outperformed the existing LBP-based feature extraction methods and suitable for real-time applications and were tested on JAFFE and CK datasets. Arshid et al. [43] proposed a multi-stage binary pattern (MSBP) feature extraction technique to handle the illumination in the real-world scenario by using sign and gradient difference. Holistic and division based methods were applied on existing methods and also compared with the proposed technique in a wild dataset by using different classifiers such as BF tree, Bagging, Naïve Bayes, Simple logistic and KNN. The result shows that the MSBP method achieved 96% of accuracy rate in the holistic approach and 60% of accuracy using division based approach. Munir et al. [44] proposed a merged binary pattern coding to extract local facial features using sign and gradient difference. The method was robust against illumination and pose variation. Before extraction, certain pre-processing steps were done using FFT plus CLAHE and histogram equalisation. The performance of classifier was improved using PCA, a feature extraction method. The real-world image dataset was used and tested with the existing and proposed method. The result shows that the proposed method outperforms all with an accuracy of 96.5%. From the analysis, holistic approach performed better than division based approach. Hasani and Mahoor [45] introduced a new framework of deep neural network cascaded with a conditional random field model to increase the recognition accuracy rate in an image sequence. Modified Inception-ResNet modules were proposed to extract the spatial relationships within an image, whereas the temporal relations were extracted between the successive frames and labelled by linear-chain conditional random field. The proposed method was evaluated using CK+, MMI and FERA databases and mainly dealt with subject-independent and cross-database validation cases. Mahmud and Al Mamun [46] suggested FER based on an extreme learning machine. Face regions were detected using Viola–Jones method. Feature vectors were found using morphological operations and edge detection techniques. Finally, the feature vectors were given as an input to a feedforward neural network classifier to perform expression classification. The proposed methodology was tested on a publically available JAFFE database and acquired satisfied accuracy. Khadija et al. [47] proposed a novel facial decomposition for expression recognition. IntraFace algorithm was used to detect seven regions of the face called ROI with the aid of facial landmarks. The feature vectors are extracted using different local descriptors like LBP, CLBP, LTP and Dynamic LTP. Then extracted feature vectors are

Referência(s)