Artigo Revisado por pares

Unconstrained ear recognition using deep neural networks

2018; Institution of Engineering and Technology; Volume: 7; Issue: 3 Linguagem: Inglês

10.1049/iet-bmt.2017.0208

ISSN

2047-4946

Autores

Samuel Dodge, Jinane Mounsef, Lina J. Karam,

Tópico(s)

Speech and Audio Processing

Resumo

IET BiometricsVolume 7, Issue 3 p. 207-214 ArticleFree Access Unconstrained ear recognition using deep neural networks Samuel Dodge, Corresponding Author Samuel Dodge sfdodge@asu.edu School of Electrical, Computer & Energy Engineering, Arizona State University, Tempe, Arizona, USASearch for more papers by this authorJinane Mounsef, Jinane Mounsef School of Electrical, Computer & Energy Engineering, Arizona State University, Tempe, Arizona, USASearch for more papers by this authorLina Karam, Lina Karam School of Electrical, Computer & Energy Engineering, Arizona State University, Tempe, Arizona, USASearch for more papers by this author Samuel Dodge, Corresponding Author Samuel Dodge sfdodge@asu.edu School of Electrical, Computer & Energy Engineering, Arizona State University, Tempe, Arizona, USASearch for more papers by this authorJinane Mounsef, Jinane Mounsef School of Electrical, Computer & Energy Engineering, Arizona State University, Tempe, Arizona, USASearch for more papers by this authorLina Karam, Lina Karam School of Electrical, Computer & Energy Engineering, Arizona State University, Tempe, Arizona, USASearch for more papers by this author First published: 28 February 2018 https://doi.org/10.1049/iet-bmt.2017.0208Citations: 48AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinkedInRedditWechat Abstract The authors perform unconstrained ear recognition using transfer learning with deep neural networks (DNNs). First, they show how existing DNNs can be used as a feature extractor. The extracted features are used by a shallow classifier to perform ear recognition. Performance can be improved by augmenting the training dataset with small image transformations. Next, they compare the performance of the feature-extraction models with fine-tuned networks. However, because the datasets are limited in size, a fine-tuned network tends to over-fit. They propose a deep learning-based averaging ensemble to reduce the effect of over-fitting. Performance results are provided on unconstrained ear recognition datasets, the AWE and CVLE datasets as well as a combined AWE + CVLE dataset. They show that their ensemble results in the best recognition performance on these datasets as compared to DNN feature-extraction based models and single fine-tuned models. 1 Introduction Accurate biometrics play a critical role in personal authentication and in forensic and security applications. A useful biometric modality has several desirable characteristics: uniqueness, ease of data collection, and preservation of privacy, among others. Uniqueness ensures that the biometric can be used to uniquely identify a person. Ease of data collection enables the biometric to be used in large scale surveillance applications. Privacy preservation is increasingly important as many subjects may not want their personal identity easily accessible. Several biometrics meet these requirements to various degrees: face, iris, fingerprint, and ear. Face as a biometric meets the uniqueness and ease of collection criteria, but does not protect privacy. Iris as a biometric is unique and protects privacy, however, may be difficult to collect. Fingerprints are unique and protect privacy, but also may be difficult to collect. This leaves us with ear, which is perhaps less often used than faces but offers several unique advantages. Just like a face or a fingerprint, the ear has a unique structure that can be used to identify the subject. However, compared with faces, the ear features are stable and are not affected by external factors, such as aging and expression. This is because the ear shape matures early in life and later changes occur gradually [1]. Compared with fingerprint recognition, ear recognition does not require the expensive capture of prints and can be utilised in a visual surveillance application. Compared with iris recognition, ear recognition does not require subject cooperation. The main drawback of ear recognition is that the ear may be partially or fully occluded by hair, earrings, or other head-ware. However, it should be noted that face recognition has similar problems with occlusions due to glasses or head-ware. An additional benefit of ear recognition, instead of face recognition, is that there may be fewer privacy concerns when an image of an ear is captured and stored instead of an image of a face. Ears share more in common with fingerprints in that, although they have unique statistics that can be used to identify an individual, at a glance it is difficult for a human subject to recognise the identity using only the ear image. Many approaches have been developed with the aim to improve ear detection and recognition capabilities for reliable deployment in surveillance and commercial applications [2–7]. These approaches follow a traditional pipeline of normalisation, feature extraction and classification. In these works, the main challenge remains a proper selection of feature descriptors that can be resilient to unconstrained conditions, such as illumination changes, occlusion and quality distortions. More recent works (e.g. [8, 9]) use deep neural networks (DNNs) to end-to-end learn a classifier instead of designing a feature-classifier pipeline. We explore the use of DNNs both as a feature extractor in the more traditional feature-classifier pipeline approach, as well as a complete end-to-end system. We note that features from pre-trained DNNs have been used in combination with shallow classifiers for a variety of computer vision tasks [10]. In this work, we show that features from pre-trained networks achieve a strong baseline for unconstrained ear recognition. Next, we show that the deep networks can be fine-tuned to achieve greater performance. Finally, we propose an averaging ensemble of fine-tuned networks to alleviate the over-fitting problem caused by small datasets. The remaining of this paper is organised as follows. Section 2 discusses the related work on ear biometric recognition. Section 3 presents our methods for feature-based support vector machine (SVM) models, fine-tuned DNN models, and the averaging ensemble model. In Section 4, we describe the experimental setup and results. Finally, Section 5 concludes our work. 2 Previous work Early ear recognition methods were structural methods based on physiological features such as shape, wrinkles, and ear points. The Iannarelli System of Ear Identification [11] was introduced in 1949 as one of the first systems to use the ear as a biometric modality for forensic science. The system consists of taking a certain number of measurements around the ear for a unique ear characterisation. Much later, Moreno et al. [12] combined the results of several neural classifiers, which were trained on various ear geometrical features. Mu et al. [13] proposed an edge-based feature vector consisting of the ear's inner and outer structure and shape. Choras [14] computed the centroid of ear curves to form concentric circles. Using the points between concentric circles and ear contours, two feature vectors were proposed. Later, Choras and Choras [15] added two more geometric feature vectors using a representation of ear contours and a geometrical parametric method. Anwar et al. [6] proposed a method for ear recognition based on geometrical features extraction including shape, mean, centroid and the Euclidean distance between pixels. While these methods are simple to implement, they achieve limited performance due to the challenging extraction of the shape features, which sometimes require manual measurements and graph matching techniques [16, 17]. Subspace learning methods, including principal components analysis (PCA), linear discriminant analysis (LDA) and force field [18], are also popular approaches to ear recognition. Kyong et al. [19] applied PCA to both face and ear recognition and could achieve a significant improvement in performance when combining both biometrics. Hurley et al. [8] used force field feature extraction, which maps the ear to an energy field. The extracted features represent ‘potential wells’ and ‘potential channels’. More recently, Hanmandlu and Mamta [3] used the local principal independent components (LPICs) as an extension of PCA to improve the ear recognition performance. Zhang et al. [20] combined independent components analysis (ICA) with a radial basis function (RBF) to improve the performance of PCA. However, these subspace learning methods are not sufficiently resilient to image variations and thus, they perform poorly under unconstrained conditions. Spectral approaches, which are based on extracting features from the spectral domain representation, use the local orientation information for ear recognition. Fabate et al. [21] used a rotation invariant descriptor, the generic Fourier descriptor, to represent ear features. Sana et al. [22] used a Haar wavelet transform to represent the texture of the ear image and calculating the matching scores using the Hamming distance. Yu et al. [23] used a Haar wavelet transform and uniform local binary patterns (ULBPs). They decomposed the ear using the Haar wavelet transform, then they combined ULBPs with block-based and multi-resolution methods for texture feature extraction. They finally classified the features using the nearest neighbour classifier. Zhao and Mu [24] used a 2D wavelet transform to generate low frequency images, then they applied the orthogonal centroid algorithm [25] to extract the features. Kumar and Zhang [26] used log-Gabor wavelets for feature extraction and a Hamming distance for classification. Kisku et al. [27] used a Gaussian mixture model to develop an ear skin model. Tariq et al. [7] extracted features through Haar wavelets followed by ear identification using fast normalised cross correlation. Murukesh et al. [28] used a contourlet transform for feature extraction and Fisher's LDA for classification. Kumar and Chan [5] used the sparse representation of local grey-level orientations to efficiently recognise the ear's identity. Benzaoui et al. [2] showed that the binarised statistical image features in association with the KNN classifier yield good performance on constrained images. Jacob and Raju [4] investigated the combination of grey-level co-occurrence matrix, local binary pattern and Gabor filter feature for efficient ear recognition. Despite their popularity, spectral methods, which rely on hand-crafted features, are problem specific and cannot adapt easily to changing environments. More recently, DNN-based models have achieved impressive performance in many problem domains. A DNN usually consists of layers of convolutional filters where the weights of the filters can be learned using a gradient descent based optimisation procedure. This layered approach, with the addition of large amounts of training data and GPU power, was shown to yield accurate systems for classification in many application domains. AlexNet [29] was the first DNN that achieved impressive performance on the large scale ImageNet dataset [30]. AlexNet includes techniques such as dropout for regularisation and ReLu non-linearities that still see widespread use. The Visual Geometry Group's (VGG) models [31] extend the AlexNet framework by adding more layers between pooling stages. VGG networks can be trained efficiently because all of the convolutional layers use small filters. This also can help with over-fitting. More recently, ResNet architectures [32] build very deep networks by utilising skip connections instead of the traditional sequential architecture. Although ResNet can be much deeper than VGG, the model size is substantially smaller due to the use of a global average pooling rather than fully-connected layers. Nevertheless, deep learning has only recently been utilised for ear recognition [8, 9, 33, 34]. One difficulty for ear recognition problems is the limited amount of labelled training data. Emersic et al. [8] overcame this by using data augmentation. For each training image, many similar training images were generated with slight translations, rotations colour transforms and so on. This data augmentation allowed DNNs to be fine-tuned. To further combat over-fitting caused by limited data, the work proposed selective learning, where only a subset of layers of the network was learned. AlexNet, VGG16, and SqueezeNet [33] were considered, with SqueezeNet yielding the best performance of 62% rank-1 accuracy. The authors evaluated their approach on an unconstrained ear dataset where they combined the AWE and CVLE datasets [35] in addition to 500 ear images of 50 subjects collected from the web, in order to have more data available to work with. Note that the authors did not consider the more recently introduced ResNet [32], which might achieve better performance. Galdamez and co-workers [9] built a custom neural network for recognising ears, instead of utilising existing pre-trained networks. The motivation for building a custom network is that it would be faster than the large pre-trained networks, however, it may achieve less accuracy. Tian and Mu [36] also built a custom network with three convolutional layers and evaluated it on the constrained USTB ear database [37]. Omara et al. [34] utilised pre-trained features from the VGG-m model [38] to classify the USTB constrained ear images using a pairwise SVM classifier. Several new methods were recently presented at the unconstrained ear recognition challenge (UERC) [39]. The UERC introduced a new dataset for the challenge, based on the AWE dataset. Surprisingly, the winning entry relied on a hand crafted feature based on Chainlets [40]. Other entries attempted various methods of fine tuning or training deep networks from scratch. 3 Ear recognition using transfer learning We utilise existing DNNs pre-trained on the large ImageNet dataset [30], and adapt them for unconstrained ear recognition. The pre-trained feature representations provide a starting point for creating robust classifiers for unconstrained ear recognition. We consider two scenarios for incorporating pre-trained neural networks. First, we use DNNs to extract features that are used to train a shallow classifier. Next, we use the pre-trained DNNs as initialisation and perform additional fine-tuning of the network. We achieve the best performance using an ensemble of these fine-tuned networks. 3.1 Deep neural networks We test five different deep DNN architectures: AlexNet [29], VGG16 [31], VGG19 [31], ResNet18 [32], and ResNet50 [32]. Table 1 presents a summary of the five DNN models’ characteristics. These networks have been pre-trained on the ImageNet dataset [30] that includes over 1.2 million images for 1000 object classes. Table 1. Main characteristics of considered DNN architectures: design year, number of parameters in millions (Mill.), number of convolutional (Conv.) layers and number of fully connected (FC) layers Network Year Parameters (Mill.) Conv. layers FC layers AlexNet 2012 60 5 3 VGG16 2014 138 13 3 VGG19 2014 144 16 3 ResNet18 2015 11.7 17 1 ResNet50 2015 25.6 49 1 AlexNet [29] is a DNN architecture that won the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) for image classification. The model architecture, which has 60 million parameters and 500,000 neurons, consists of five convolutional layers and three fully connected layers with a final 1000-way softmax. VGG network architectures [31] are much deeper than AlexNet and were the winner of the 2014 ILSVRC for image localisation and classification. Compared with AlexNet, a single convolutional layer between pooling stages is replaced with multiple stacked convolutional layers, which are followed by three fully connected layers. The final layer is the softmax layer. The VGG style networks, which include 133 million to 144 million parameters, use small size filters to reduce the number of parameters and consequently reduce over-fitting. In this work, we use the VGG16 and VGG19 architectures, where 16 and 19 refer to the number of trainable layers. ResNet [32], which won the 2015 ILSVRC, made the concept of training very DNNs possible and less challenging. The network uses ‘skip’ connections between convolutional blocks in order to create much deeper neural networks. The skip connections ensure that there is no vanishing gradient problem. The layers are formulated as learning residual functions with respect to the layer inputs, instead of learning more simple feed-forward functions. Despite their large depth, ResNets have a much less number of parameters varying between 11.7 million (18 layers) and 60.2 million (152 layers). In this paper, we use the 18-layer ResNet18 and 50-layer ResNet50 models. 3.2 Extracting deep features Features extracted from DNNs have been shown to achieve good performance on many different problem domains [10]. When a network is trained on a large diverse dataset such as ImageNet, features extracted from network layers can be transferred to other problems, in this case, ear recognition. Similar to [34], we train linear SVM using features extracted from the DNNs described in Section 3.1. However, different from [34], we are interested in the unconstrained ear recognition problem. Since this is more difficult than the constrained problem addressed in [34], we incorporate data augmentation techniques to improve the accuracy. In addition, we compare the performance using five different network architectures and show that the choice of architecture can significantly affect the resulting classification accuracy. Furthermore, features extracted from different layers of the same network can give different classification accuracies. We perform an exhaustive search on the layers and report results with the layer that gives the highest accuracy using the AWE and CVLE datasets [35]. We find that the best performance corresponds to the last convolutional layer for AlexNet and VGG16, the second to last convolutional layer for VGG19, and the last convolutional layer of the third residual block for the ResNets. We use the LibSVM library [41] to train a one-against-one multi-class linear SVM using the extracted features. The very high dimensionality of the extracted features makes SVM training computationally expensive, so we use PCA to reduce the dimensionality of the features while retaining 99% of the feature variance. 3.3 Fine tuning While the features from the fixed pre-trained networks can be useful for ear recognition, a more accurate classifier can be trained by fine-tuning the parameters of the neural network. Fine-tuning is essentially training the network for several more iterations on a new dataset. This process will adapt the generic filters trained on the ImageNet dataset to the ear recognition problem. We use the same networks described in Section 3.2. For each network, we replace the last fully connected layer with a new fully connected layer with the number of units equal to the number of classes in the dataset. The parameters of the new fully connected layer are initialised by Glorot initialisation [42]. We train the network for 25 epochs using stochastic gradient descent. At around 25 epochs, all of the network architectures achieve near 100% accuracy on the training set, so no more improvement in training can be achieved. The learning rate of the last layer is set to 0.1 and the learning rate of all of the other pre-trained layers are set to 0.01. This is because the last layer is trained from scratch whereas the other layers are initialised with pre-trained weights. Our fine-tuning approach is different from [8]. The method of [8] performs ‘selective’ learning where the early layers are fixed and later layers are fine-tuned. Our approach allows the early layers to adapt, but at a smaller learning rate than the last layer. This is also different than the ‘full training’ of [8] because the learning rates of different layers are not all the same. We fine-tune the networks using data augmentations as explained in Section 4.3. However, even with data augmentation, the fine-tuned deep networks may over-fit the new training data. This is particularly a problem in ear recognition because the datasets are relatively small. We use an averaging ensemble, in addition to data augmentation, to reduce the effect of over-fitting. We test ensembles of five models, where the last layer of each ensemble member is initialised with different random values. The different initialisations yield different local minima after the network has been trained. To obtain a final output prediction during testing, we take the average of the soft-max outputs of the ensemble members. The final predicted label is the argmax of the averaged soft-max outputs. The full ensemble model can be seen in Fig. 1. Figure 1Open in figure viewerPowerPoint Structure of ensemble models. The ensemble consists of n models. The parameters of the last fully connected layer of each model are initialised with different random values and each model is trained separately. During testing, the soft-max outputs of the constituent models are averaged to yield the final output prediction 4 Experimental setup and results 4.1 Datasets The experiments are performed on two publicly available unconstrained ear datasets: AWE and CVLE [35]. Both datasets consist of images captured ‘in the wild’ of the ears of public figures collected by a web crawler. The images include realistic variations, such as contrast/illumination, occlusion, head rotations, gender, race, visual quality distortions and image resolution. These datasets are considered challenging for automatic ear recognition applications. The AWE dataset includes 1000 images of 100 persons (10 images/person), while the CVLE dataset includes 804 images for 16 persons (on average 50.25 images/person). For both datasets, the images come in different sizes varying from pixels to pixels. All images are tightly cropped and do not include the face. Figs. 2 and 3 show sample images from both datasets. Figure 2Open in figure viewerPowerPoint Sample images from the AWE dataset. Each row corresponds to the images of one subject. The images include variations of head rotation, illumination, gender, race, occlusion, blurring and image resolution Figure 3Open in figure viewerPowerPoint Sample images from the CVLE dataset. Each row corresponds to the images of one subject. The images include variations of head rotation, illumination, gender, occlusion and image resolution In addition, we combine the AWE and CVLE datasets to form a third dataset (AWE + CVLE). For this dataset, we use the same train and test splits as in the respective datasets. 4.2 Experimental protocols We use the given training/testing split provided in the AWE toolbox [35]. The training set consists of 60% of the images and the testing set includes the remaining 40%. For the CVLE dataset, we randomly split the dataset into 60% training images and 40% testing images. As in [8], we perform identification experiments with a closed-set experimental protocol, where our models should predict the class to which the input image belongs. There are 100 classes for AWE, 16 classes for CVLE and 116 classes for the combined dataset (AWE + CVLE). For performance evaluation, we use rank-1 and rank-5 recognition rates, as well as cumulative match-score curves (CMCs). The CMC is formed by computing the recognition rate using the top i predictions from the model, where i varies from 1 to m, and m is the number of classes. For the single fine-tuned models, we report average performance over five random seeds. These five models are the same models used in the averaging ensemble. All of the neural networks operate on a size input. Thus, we resize the original images to this dimension before feeding them to the neural network. We tried both bilinear and bicubic interpolation for resizing the images. Both methods were found to produce similar results in terms of classification performance. We additionally subtract the mean of the ImageNet dataset from the input images. 4.3 Data augmentation The datasets are relatively small, so it is easy for models to over-fit and not generalise well on testing data. To alleviate this problem, we augment the dataset by a factor of 9 with several image transformations as shown in Fig. 4. We select image transformations to introduce spatial as well as pixel value variations. Noting that the head rotation in the source images is not constant, we augment with moderate rotations using nearest neighbour interpolation (−6°, −3°, +3°, +6°) to increase the classifier's robustness to the rotation. Although the left and right ears are not necessarily the same, we apply horizontal flipping to expose the classifier to more variations of ear structures. Next, we consider three normalisation techniques to introduce pixel value variations: histogram equalisation, adaptive histogram equalisation, and wavelet-based normalisation. We apply these three latter transformations to grey-scale versions of the source image. Grey-scale images force the classifier to use texture information, rather than rely on the colour information. The histogram equalisation methods spread image intensities in the spatial domain, while the wavelet-based illumination method enhances the contrast in the wavelet domain. For each grayscale image, the greyscale channel is replicated such that the resulting image is about the size . This is done so that we can feed both the greyscale images and the colour images to the networks. Figure 4Open in figure viewerPowerPoint Data augmentation examples. Each row corresponds to a single source image of one subject. The third to sixth images include rotated images with angles +3°, −3°, +6°, and −6° using nearest neighbour interpolation. The remaining images include the other four augmentation variations in addition to the original image 4.4 Feature extraction results In the first series of experiments, we evaluate the performance of the five DNN architectures by assessing the representation ability of their respective deep features. We construct our experiments as described in Section 3.2. For these experiments, we analyse the effect of data augmentation on the deep networks’ performance. Tables 2–4 show the rank-1 and rank-5 accuracies for the feature extraction-based models trained with original and augmented versions of the AWE, CVLE, and combined AWE + CVLE datasets, respectively. Overall, the network features perform better on CVLE compared with AWE due to the lower number of classes and a larger number of training images per class for the CVLE dataset. Data augmentation is able to improve the accuracies by an average of 30% for AWE, 3% for CVLE, and 30% for their combination. The impact of data augmentation is larger on the AWE dataset than on the CVLE dataset due to the lack of sufficient training samples per class for the AWE dataset. Fig. 5 shows the CMCs for the ResNet18 features performance on the three considered datasets. ResNet18 features show better performance when the training set is augmented for all considered datasets. This trend is also seen for all the other considered networks’ features. Table 2. Rank-1 and Rank-5 Accuracy (%) of models trained and tested on the AWE dataset. Bold values indicate the best performance for each particular metric (rank-1 or rank-5) Rank-1 Rank-5 Deep features Deep features (augmentation) Fine-tune single Fine-tune ensemble Deep features Deep features (augmentation) Fine-tune single Fine-tune ensemble AlexNet 34.25 46.75 37.50 45.00 55.50 73.50 62.70 71.00 VGG16 31.25 49.25 50.70 66.00 53.75 70.25 74.65 81.50 VGG19 40.25 56.25 50.25 65.75 64.25 76.75 74.45 84.75 ResNet18 31.75 61.50 56.35 68.50 57.50 85.00 74.80 83.00 ResNet50 40.75 63.00 48.40 56.25 66.50 80.25 70.65 77.50 Table 3. Rank-1 and Rank-5 Accuracy (%) of models trained and tested on the CVLE dataset. Bold values indicate the best performance for each particular metric (rank-1 or rank-5) Rank-1 Rank-5 Deep features Deep features (augmentation) Fine-tune single Fine-tune ensemble Deep features Deep features (augmentation) Fine-tune single Fine-tune ensemble AlexNet 77.57 79.13 85.86 89.10 96.57 98.13 97.76 97.82 VGG16 79.13 81.93 90.16 93.15 95.02 97.82 99.37 99.38 VGG19 86.29 86.60 89.41 92.52 96.57 98.75 98.07 99.69 ResNet18 87.54 93.46 90.59 93.46 93.46 99.38 99.19 99.38 ResNet50 86.92 92.83 91.40 94.08 97.51 99.03 98.87 99.69 Table 4. Rank-1 and Rank-5 Accuracy (%) of models trained and tested on the AWE + CVLE dataset. Bold values indicate the best performance for each particular metric (rank-1 or rank-5) Rank-1 Rank-5 Deep features Deep features (augmented) Fine-tune single Fine-tune ensemble Deep features Deep features (augmented) Fine-tune single Fine-tune ensemble AlexNet 41.89 55.76 58.39 66.16 66.71 79.47 79.53 84.74 VGG16 43.00 54.37 68.99 77.95 64.91 78.64 86.29 90.29 VGG19 49.38 64.49 68.90 78.92 73.51 82.39 86.43 90.29 ResNet18 46.19 64.91 71.87 80.03 69.35 84.33 86.68 93.48 ResNet50 58.11 65.60 69.90 75.73 76.84 84.88 85.55 90.85 Figure 5Open in figure viewerPowerPoint CMCs for ResNet-18 feature-based SVM models. The models perform better for all ranks when the training data is augmented (a) AWE, (b) CVLE, (c) AWE+CVLE ResNet f

Referência(s)
Altmetric
PlumX