Artigo Acesso aberto Revisado por pares

RGB‐D face recognition using LBP with suitable feature dimension of depth image

2019; Institution of Engineering and Technology; Volume: 4; Issue: 3 Linguagem: Inglês

10.1049/iet-cps.2018.5045

ISSN

2398-3396

Autores

Hailay Berihu Abebe, Chih‐Lyang Hwang,

Tópico(s)

Video Surveillance and Tracking Methods

Resumo

IET Cyber-Physical Systems: Theory & ApplicationsVolume 4, Issue 3 p. 189-197 Special Issue: Social and Human Aspects of Cyber-Physical SystemOpen Access RGB-D face recognition using LBP with suitable feature dimension of depth image Hailay Berihu Abebe, Hailay Berihu Abebe Department of Electrical Engineering, National Taiwan University of Science and Technology, Taipei, 10607 Taiwan, TaiwanSearch for more papers by this authorChih-Lyang Hwang, Corresponding Author Chih-Lyang Hwang clhwang@mail.ntust.edu.tw Department of Electrical Engineering, National Taiwan University of Science and Technology, Taipei, 10607 Taiwan, TaiwanSearch for more papers by this author Hailay Berihu Abebe, Hailay Berihu Abebe Department of Electrical Engineering, National Taiwan University of Science and Technology, Taipei, 10607 Taiwan, TaiwanSearch for more papers by this authorChih-Lyang Hwang, Corresponding Author Chih-Lyang Hwang clhwang@mail.ntust.edu.tw Department of Electrical Engineering, National Taiwan University of Science and Technology, Taipei, 10607 Taiwan, TaiwanSearch for more papers by this author First published: 25 February 2019 https://doi.org/10.1049/iet-cps.2018.5045Citations: 5AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinkedInRedditWechat Abstract This study proposes a robust method for the face recognition from low-resolution red, green, and blue-depth (RGB-D) cameras acquired images which have a wide range of variations in head pose, illumination, facial expression, and occlusion in some cases. The local binary pattern (LBP) of the RGB-D images with the suitable feature dimension of Depth image is employed to extract the facial features. On the basis of error correcting output codes, they are fed to multiclass support vector machines (MSVMs) for the off-line training and validation, and then the online classification. The proposed method is called as the LBP-RGB-D-MSVM with the suitable feature dimension of the depth image. The effectiveness of the proposed method is evaluated by the four databases: Indraprastha Institute of Information Technology, Delhi (IIIT-D) RGB-D, visual analysis of people (VAP) RGB-D-T, EURECOM, and the authors. In addition, an extended database merged by the first three databases is employed to compare among the proposed method and some existing two-dimensional (2D) and 3D face recognition algorithms. The proposed method possesses satisfactory performance (as high as 99.10 ± 0.52% for Rank 5 recognition rate in their database) with low computation (62 ms for feature extraction) which is desirable for real-time applications. 1 Introduction Face recognition has been attracting considerable attention from researchers due to its wide variety of applications such as home security, video surveillance, law enforcement, and identity management. Face recognition with two-dimensional (2D) images is a challenging problem especially in the presence of covariates such as pose, illumination, expression, disguise, and plastic surgery [1, 2]. Hence, it is desirable to develop a face recognition method that is less susceptible to such distortions. Moreover, the recognition of facial expression with different colour space has been addressed [3, 4]. In [4], the CIELab and CIELuv colour spaces have been utilised for micro-expression recognition. In addition, a tensor perceptual colour framework for facial expression which is based on information contained in colour facial images is used in [3]. However, for our proposed method which utilises local binary pattern (LBP) features of red, green, and blue-depth (RGB-D) images, we use the RGB colour space. Finding suitable descriptors for the appearance of local facial regions is an open issue. Theoretically, these descriptors should be easy to compute and have high extra-class variance (i.e. between different persons in the case of face recognition) and low intra-class variance, which means that the descriptor should be robust with respect to the ageing of the subjects, alternating illumination, pose change, facial expression, and occlusion. Different local image feature extractions such as LBPs, histograms of oriented gradients (HOG), scale invariant feature transform (SIFT), speed-up robust features, fully affine SIFT, and Gabor features are generally employed in image matching, object detection, and face recognition [5-16]. The LBP operator [5] is one of the best performing texture descriptors and it has been widely used in various applications. It has proven to be highly discriminative and its key advantages, namely its invariance to monotonic grey-level changes and computational efficiency, make it suitable for demanding image analysis tasks. The idea of using LBP for face recognition is motivated by the fact that faces can be seen as a composition of micro-patterns which are well described by such operator. In recent years, low-cost sensors have been developed that provide pseudo-3D information in the form of RGB-D images. The RGB image provides the texture and appearance information, whereas the depth map represents the distance of each pixel from the sensor with the characterisation of the geometry of the face in grey-scale values. Since the depth map returned by RGB-D sensors is not as precise as a 3D sensor, the existing 3D face recognition approaches may not be directly applied to RGB-D images. In contrast, RGB-D images have been used for several computer vision tasks such as object tracking, face detection, gender recognition, face recognition, and visual imitation [2, 17-23]. To the best of our knowledge, there is no reported research on the recognition of faces using the LBP texture features of RGB and depth images and the error correcting output code (ECOC)-based multiclass support vector machines (MSVMs). The main contributions of this paper include: (i) face recognition of RGB-D images using the LBP with suitable feature dimension of depth image has the state-of-the-art performance; it possesses as high as 99.1% for Rank 5 recognition rate in our database and low computation time of 62 ms for feature extraction. It is more suitable for real-time faces recognition applications. (ii) The above two advantageous properties are confirmed by the evaluation of the four databases: IIIT-D RGB-D, VAP RGB-D-T, EURECOM, and National Taiwan University of Science and Technology, Intelligent Robot Laboratory (NTUST-IRL) RGB-D (ours). One is the comparison of the individual database with different feature dimensions of the depth image and the other is the comparison with some existing 2D and 3D approaches using the extended databases integrated by the first three above databases. 2 Related work In this section, the previous 2D and 3D face recognition methods utilising LBP features [5-7, 9, 10, 13, 16] are discussed. In [5], an efficient facial image representation based on LBP texture features is presented. In this paper, LBP feature distributions are extracted from the face image divided into several regions. Then, an enhanced feature vector is obtained by concatenation. In [6], a high-order local pattern descriptor, local derivative pattern (LDP), for face recognition is proposed. It is stated that LDP can capture more detailed information about LBP. For a given region, the LDP templates extract high-order local information by encoding various distinctive spatial relationships. In [7], a comprehensive survey of LBP methodology is given. Application of LBP approach for facial image analysis is extensively reviewed. Moreover, its successful extensions, which deal with various tasks of facial image analysis, are also discussed. To get compact feature representation, a chi-squared transformation (CST) transforms the LBP feature to a suitable feature that better fits to Gaussian distribution [9]. Asymmetric principal component analysis (APCA) is applied to remove the unreliable dimensions in the CST feature space. By applying the proposed CST-APCA on spatial LBP, face recognition is achieved by a two-class classification. In [10], a dual difference regression classification (DRC) combining pixel difference binary pattern (PDBP) and DRC for face recognition is proposed. Pixel differences in image domain are calculated using PDBP and then the pixel differences are binarised and converted to decimal values. The vector differences in the feature domain to estimate an optimal predictor matrix are achieved by DRC. In [13], a 3D face recognition method based on the fusion of shape and texture LBPs on a mesh is presented. It utilises mesh surface which preserves the full geometry, does not require normalisation, can accommodate the partial matching and allows the early level fusion of texture and shape modalities. In [16], the enhanced face recognition method utilising local binary patterns histograms descriptors, multi-K-neural network and back propagation neural network (BPNN) is proposed. To train the BPNN with the faster convergence and better accuracy, the features based on the correlation between the training images are extracted to generate the correlated T-dataset. Since the correlation method utilised requires substantial computation time and large storage, features reduction and face representation are required. By combining the saliency model of [24] and the entropy of the RGB and depth images along with the geometric features of the depth image, RGB-D face recognition method using HOG feature descriptors with random decision forest (RDF) classifier is proposed by Goswami et al. [1]. In contrast, the face recognition of RGB-D images using the concatenation of LBP feature descriptors of the RGB and depth images is presented in this paper. After that, these feature vectors are fed to the MSVMs classifier to train the corresponding weighting matrices. The proposed face recognition method is validated by test faces, which are different from the training images. 3 Proposed face recognition method The steps involved in the proposed face recognition method are shown in Fig. 1, consisting of the following four steps: (i) pre-processing, (ii) extraction of LBP features, (iii) concatenation of the RGB-D LBP feature vectors, and (iv) classification of multiclass SVM. These steps are explained in the following sections. Figure 1Open in figure viewerPowerPoint Flowchart of the proposed face recognition 3.1 Pre-processing To get the desired face regions for face recognition, automatic face detectors such as Viola–Jones method are used for cropping the face region. Sample images from VAP RGB-D-T and EURECOM databases and the corresponding cropped face images along with depth images using Viola–Jones automatic face detector are shown in Fig. 2. The Viola–Jones method based on the Haar-like features and the AdaBoost learning algorithm [25] is employed to detect the face area of a given image. It is an object detection algorithm providing the competitive object detection rate in real time. It was primarily designed for face detection. The features used by Viola and Jones are derived from pixels chosen on rectangular areas imposed over the picture and exhibit high sensitivity to the vertical and the horizontal lines. After the desired face region is cropped, the face images are scaled to the same size of pixels for further processing. Figure 2Open in figure viewerPowerPoint Pre-processing samples of RGB and depth images from VAP RGB-D-T and EURECOM databases by Viola–Jones method (a) Before, (b) After In fact, Viola–Jones automatic face detector has difficulties to detect faces that are occluded and have high pose change. Here, an effort is made to increase the number of cropped faces by using frontal face and profile cascade classifiers available in OpenCV. In general, the number of cropped faces from the EURECOM and VAP RGB-D-T databases is given in Table 1. Meanwhile, the IIIT-D RGB-D database is provided with cropped face images and we have cropped the face images of our database during capturing of the images. Table 1. Total number of images and cropped face images from the EURECOM and VAP RGB-D-T databases Database Total number Original images Cropped face images EURECOM 936 798 VAP RGB-D-T 15,300 12,520 3.2 Extraction of LBP features LBP has been considered as among the powerful texture features extraction technique [5, 26, 27]. Ojala et al. [26] proposed the LBP for texture features extraction. The basic assumption was that texture has locally two complementary aspects: (i) pattern and (ii) strength [5]. In general, the basic idea of LBPs is to summarise the local structure in an image by comparing each pixel with its neighbourhood. Take a pixel as centre and threshold its neighbours against. If the intensity of the centre pixel is greater or equal with its neighbour, then it is assigned as 1 and 0 if not. After that, a binary number for each pixel is accomplished. With p surrounding pixels, combinations, called LBPs (LBP codes), is achieved. The first basic LBP operator in [26] is used as a fixed 3 × 3 neighbourhood as shown in Fig. 3 [5]. Figure 3Open in figure viewerPowerPoint Example for the LBP operator To be able to deal with textures at different scales, the LBP operator was later extended to use neighbourhoods of different sizes [27]. Defining the local neighbourhood as a set of sampling points evenly spaced on a circle centred at the pixel to be labelled allows any radius and number of sampling points. When a sampling point does not fall in the centre of a pixel, bilinear interpolation is used. A representative example of circular neighbourhoods can be seen in Fig. 4. Figure 4Open in figure viewerPowerPoint Circular , , and neighbourhoods Consider a grey-scale image and let denotes the grey level of an arbitrary pixel , i.e. . For an evenly spaced circular neighbourhood with P sampling points and radius R around the centre pixel , the grey value of the pixel at the sampling point is given by (1) where and are the coordinate values of the sampling point and their position for can be calculated by (2) It is assumed that the local texture of the image is characterised by the joint distribution of grey values of pixels [5] (3) If the value of the centre pixel is subtracted from the values of the neighbours, the local texture can be represented, without losing information, as a joint distribution of the value of the centre pixel and the differences (4) The joint distribution is approximated by assuming the centre pixel to be statistically independent of the differences, which helps factorising of the distribution (5) where the first factor describes the overall luminance of an image, and the second factor denotes the joint difference distribution. For analysing local textural patterns, the first term contains no useful information; in contrast, the second term includes much of the information about the textural characteristics. Hence, for simplicity the original joint distribution in (3) is approximated as follows: (6) The P-dimensional difference distribution records the occurrences of different texture patterns in the neighbourhood of each pixel. Although invariant against grey-scale shifts, the differences are affected by scaling. To achieve invariance with respect to any monotonic transformation of the grey scale, only the signs of the differences are considered (7) where (8) The generic LBP operator is derived from this joint distribution. As in the case of basic LBP, it is achieved by summing the threshold differences weighted by powers of two. Hence, the operator is defined as (9) It indicates that the local grey-scale distribution, i.e. texture, is approximately described with a -bin discrete distribution of LBP codes. However, it is not suitable for the pixels in the region with a distance R from the edges of the image. This means that, in constructing the feature vector, a small area on the borders of the image is not used. 3.2.1 Mappings of the LBP labels In face recognition and the other similar applications, it is desirable to have features that are invariant or robust to rotations of the input image. Since the patterns are achieved by the circular sampling around the centre pixel, rotation of the input image results in rotation of the neighbourhood into other pixel location. Moreover, within each neighbourhood, the sampling points on the circle surrounding the centre point are rotated into a different orientation. In [27], uniform patterns, which are another important extension, are added to the original LBP operator. A uniformity measure U (' pattern') is introduced which corresponds to the number of bitwise transitions from 0 to 1 or vice versa when the circular bit pattern is considered. An LBP is called uniform if its uniformity measure is at most 2. For example, in eight neighbourhoods, the patterns 11111111 (0 transitions), 0000011 (2 transitions), and 10011111 (2 transitions) are uniform, whereas the patterns 10011001 (4 transitions) and 01001011 (6 transitions) are not. In uniform LBP mapping, there is a separate output label for each uniform pattern and all the non-uniform patterns are assigned to a single label. By the uniformity definition, for the neighbourhoods of 8 sampling points there are 58 labels for the uniform patterns and one label for non-uniform patterns; hence, the uniform mapping produces 59 output labels. Similarly, there are 243 output labels for neighbourhoods of 16 sampling points. In summary, the mapping number of different output labels for the patterns of P bits is . The reasons for omitting the non-uniform patterns are two-fold [5]. The first reason is that most of the LBPs in natural images are uniform. For example, in the experiments with texture images [27], uniform patterns account for a bit <90% of all patterns when using the neighbourhood and for around 70% in the neighbourhood. In experiments with facial images [7, 27], it was found that 90.6% of the patterns in the neighbourhood and 85.2% of the patterns in the neighbourhood are uniforms. The second reason for considering uniform patterns is the statistical robustness. In many applications [16, 22], uniform patterns produce better recognition result; on the other hand, uniform patterns are more stable, i.e. less prone to noise. In summary, considering only uniform patterns makes the number of LBP labels lower and reliable estimation of their distribution. 3.2.2 LBP histogram The LBP histogram represents region texture information. Histogram measures the quantity of LBP in the sub-block of the image. Each bin in histogram stands for one LBP label. The regular size of the histogram for P sampling points is bins. When only uniform LBP is used, the number of bins is effectively reduced. For an 8 sampling point, we have 58 uniform LBPs out of the total 256 bins. The 198 non-uniform bins are assigned to an extra bin. Thus, the histogram size for uniform LBP has 59 bins. Generally, the number of bins in the uniform LBP histogram for P sampling points is given by (10) For a given grey-scale input image with selected image patch (cell) of for P sampling points, the number of cells K is given by (11) The feature vector for the face image is constructed by calculating the LBP code for every pixel with and . If an image is divided into regions, then the histogram for the region , with and , is defined as (12a) where is the label of bin i(12b) (12c) (12d) The feature vector is effectively a description of the face on three different levels of locality: the labels contain information about the patterns on a pixel level; the regions, in which the different labels are summed, contain information on a small regional level and the concatenated histograms give a global description of the face. For the given K number of cells, the concatenated histogram of the RGB face image is expressed as follows: (13) where is defined by (12a). Similarly, if L is the number of cells for the depth image, the concatenated feature vector for the given depth image is given by (14) Owing to the concatenation of local histograms, the LBP feature vector captures the global description of images. Examples of RGB and depth images along with their corresponding LBP and histograms of the input images and LBP are presented in Figs. 5a and b. As it can be seen from the histogram plots, the LBP histogram is modified from the original input RGB and depth images which are attributed to the LBP feature descriptors. From Fig. 5b, it can be observed that the depth image has little information for texture extraction as compared with the RGB image. Thus, for the extraction of LBP feature vectors from the depth image, suitable cell should be selected. A cell size is found to be optimal from statistical analysis for improved performance. The feature vector dimension for both the RGB and depth images is calculated by the number of cells and bins. Here, the RGB face images and their corresponding depth are resized to . In addition, the neighbourhood is found to be optimal from statistical analysis. Hence, using cell the feature vector dimension for the RGB images is 5900. On the other hand, for the depth image a cell size of is used; thus, the feature vector dimension becomes 1475. Figure 5Open in figure viewerPowerPoint LBP images and its histograms (a) RGB image, (b) Depth image 3.3 Concatenation of RGB-D LBP feature vectors After extracting the LBP feature descriptors from the RGB and depth images, a mechanism to combine them should be devised. In biometric applications such as recognition and classification, different levels of information fusion are used. In [28], the three possible levels of fusion are discussed, namely the fusion at feature extraction level, the fusion at the matching score level, and the fusion at the decision level. It is stated that the low-level fusion (data and feature) performs better than its higher-level counterparts (score and decision) [13]. In addition, the feature-level fusion is commonly used due to its simplicity. Feature vectors can be concatenated together by different methods to form a new feature vector. Among the feature-level combination methodologies, the feature concatenation is the most popular. Hence, in this work, the concatenation of the feature descriptors is used which is a feature-level fusion method. The concatenation of the RGB and depth images feature descriptors using (13) and (14) is achieved as follows: (15) 3.4 Classification of multiclass To establish the identity of a given probe face image, a multiclass classifier such as SVM [29-34], nearest neighbour, and RDFs can be used. Owing to the flexibility in the choice of the form of the threshold, the robustness toward a small number of data points, its unique solution and usage in almost any classification problem with the right kernel, MSVM is adopted for the classification of different faces. SVM was initially designed for binary classification; however, real-world problems often require the classification with more than two categories. The multiclass classification problems are commonly decomposed into a series of binary problems such that the standard SVM can be directly applied. Two representative ensemble schemes are one-versus-all (OVA) and one-versus-one (OVO) approaches. Both OVA and OVO are special cases of the ECOCs [33, 34] decomposing the multiclass problem into a predefined set of binary problems. According to Hsu and Lin [30], OVO is suitable for practical classification applications. For the classification of k classes, the OVO method constructs classifiers, in which each one is trained on data from two classes. In this paper, OVO multiclass SVM classification (pairwise classification) method is used. For training data between the class and the class, the optimisation for the OVO multiclass SVM is described as follows [30, 35]: (16a) with the following constraints: (16b) Here, the training data are mapped to a higher-dimensional space by the kernel (Gaussian) function and is the penalty parameter. The parameter C controls the trade-off between the complexity of the machine and the number of non-separable points. Its range is C = 0.001–1000; our selected value is C = 2. The width of the Gaussian is 10. Generally, classifier is trained using all data from class i as positive samples and all data from class j as negative samples without considering the remaining data. To classify a new sample y each of the base classifiers cast a vote for one of the two classes used in its training. Then, the OVO method applies the majority voting scheme for labelling y to the class with the most votes. Whenever ties are happened, they have usually broken arbitrarily for the larger class [33]. To train the classifier, each feature descriptor is a data point and the subject identification number is the class label; therefore, the number of classes is equal to the number of subjects. The trained multiclass SVM classifier is then used for the identification of the probe face images. A probe feature vector is an input to the trained multiclass SVM classifier which provides a probabilistic match score for each class. This match score represents the probability with which the feature vector belongs to a particular class. 3.5 Proposed method The proposed algorithm is described as follows: Step 1: Pre-process the images to crop the desired face region using the automatic Viola–Jones face detector. Step 2: Set the desired radius R for (12) and the number of sampling points P for (10). Step 3: Divide the cropped RGB and depth images into non-overlapping cells for the LBP feature extraction (11). Step 4: Compute the LBP operator per pixel for each cell of the RGB and depth images (9). Step 5: Build the sub-LBP histogram of each cell (12). Step 6: Concatenate the sub-LBP histogram to a long vector using (13) and normalise using different normalisation methods. Step 7: Fuse the RGB and depth LBP features using the concatenation (15). Step 8: Feed the concatenated feature descriptors to the trained MSVM classifier (16). 4 Experimental results and discussion The performance of the proposed approach is evaluated by two kinds of experiments. First, experiments are conducted on four individual databases to analyse the proposed approach. Then, an extended database, merged from three databases, is employed to evaluate the performance of the proposed method. 4.1 Databases and experimental settings There are a few existing RGB-D databases in the literature. Here, three publically available databases will be considered. The IIIT-D RGB-D database [1, 27] has 4605 images of 106 subjects captured in two different sessions with the variations in pose, expression, and in some cases of disguise due to eyeglasses using Kinect sensor and OpenNI SDK (compare Fig. 6a for some representative images). The VAP RGB-D-T facial database [36] contains 15,300 images of 51 individuals captured in three different sessions with variations in illumination, pose and facial expressions (compare Fig. 6b for some representative images). The EURECOM database [37] has 936 images of 52 subjects captured in two different sessions with the variations in pose, illumination, expression, and occlusion (compare Fig. 6c for some representative images). In addition, our own database named NTUST_IRL RGB-D is created using ASUS Xtion Pro Live camera and OpenNI SDK. The database consists of 2953 images pertaining to 45 subjects captured in our laboratory with the variations in pose, facial expression, and occlusion in some images (compare Fig. 6d for some representative images). The images' resolution of these four databases is all . Two kinds of experiments are performed as follows. Figure 6Open in figure viewerPowerPoint Some representative RGB-D face images from the four databases (a) IIIT-D RGB-D, (b) VAP RGB-D-T, (c) EURECOM, (d) Ours (NTUST_IRL RGB-D) The first experiment is performed using the four databases to validate the LBP-RGB-D-MSVM algorithm with a different dimension of the feature vector in the depth image. In addition, the numbers of class/subject k of the four databases for classification are 106, 51, 52, and 45. In the second experiment, the IIIT-D RGB-D, VAP RGB-D-T, and EURECOM databases are merged to create an extended database of 189 individuals, i.e. k = 189. The extended database is employed to compare the performance of the proposed approach with the existing face recognition methods. In addition, the settings of these two experiments are as follows: (i) in the first experiment, the gallery size of RGB-D image for each of the databases mentioned above is assigned to four random selections for the different experiments conducted and the probe images of each subject of the database are used as given in the databases. (ii) In the second experiment, the extended database with gallery size of four images is randomly selected, whereas the number of probe images per individual is taken as given in each database [35]. Cumulative match characteristic (CMC) curves, which measure the performance of identification systems, are computed for each experiment. In a

Referência(s)