Dual branch convolutional neural network for copy move forgery detection
2020; Institution of Engineering and Technology; Volume: 15; Issue: 3 Linguagem: Inglês
10.1049/ipr2.12051
ISSN1751-9667
AutoresNidhi Goel, Samarjeet Kaur, Ruchika Bala,
Tópico(s)Law in Society and Culture
ResumoIET Image ProcessingVolume 15, Issue 3 p. 656-665 ORIGINAL RESEARCH PAPEROpen Access Dual branch convolutional neural network for copy move forgery detection Nidhi Goel, Corresponding Author Nidhi Goel nidhi.iitr1@gmail.com orcid.org/0000-0001-7089-7077 Department of Electronics and Communication Engineering, Indira Gandhi Delhi Technical University for Women, Delhi, India Correspondence Nidhi Goel, Department of Electronics and Communication Engineering, Indira Gandhi Delhi Technical University for Women, Delhi, India. Email: nidhi.iitr1@gmail.comSearch for more papers by this authorSamarjeet Kaur, Samarjeet Kaur Department of Electronics and Communication Engineering, Bharati Vidyapeeth College of Engineering, Delhi, IndiaSearch for more papers by this authorRuchika Bala, Ruchika Bala Department of Information Technology, Indira Gandhi Delhi Technical University for Women, Delhi, IndiaSearch for more papers by this author Nidhi Goel, Corresponding Author Nidhi Goel nidhi.iitr1@gmail.com orcid.org/0000-0001-7089-7077 Department of Electronics and Communication Engineering, Indira Gandhi Delhi Technical University for Women, Delhi, India Correspondence Nidhi Goel, Department of Electronics and Communication Engineering, Indira Gandhi Delhi Technical University for Women, Delhi, India. Email: nidhi.iitr1@gmail.comSearch for more papers by this authorSamarjeet Kaur, Samarjeet Kaur Department of Electronics and Communication Engineering, Bharati Vidyapeeth College of Engineering, Delhi, IndiaSearch for more papers by this authorRuchika Bala, Ruchika Bala Department of Information Technology, Indira Gandhi Delhi Technical University for Women, Delhi, IndiaSearch for more papers by this author First published: 24 December 2020 https://doi.org/10.1049/ipr2.12051Citations: 9AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinkedInRedditWechat Abstract The advent of digital era has seen a rise in the cases of illegal copying, distribution and forging of images. Even the most secure data channels sometimes suffer to validate the integrity of images. Forgery of multimedia data is devastating in various important applications like defence and satellite. Increased illegal tampering of images has paved way for research in the area of digital forensics. Copy move forgery is one of the various tampering techniques which is used for manipulating an image's content. A deep learning–based passive Copy Move Forgery Detection algorithm is proposed that uses a novel dual branch convolutional neural network to classify images as original and forged. The dual branch convolutional neural network extracts multi-scale features by employing different kernel sizes in each branch. Fusion of extracted multi-scale features is then performed to achieve a good accuracy, precision and recall scores. Experiment analysis on MICC F-2000 dataset has been performed under two different kernel size combinations. Extensive result analysis and comparative analysis proves the efficacy of proposed architecture over existing architecture in terms of performance scores, computation time, and complexity. 1 INTRODUCTION Social networking has gained a lot of attention since the last one decade. The proliferation of social networking and its ubiquitous presence in both professional and personal life has led to increased transfer of multimedia data, namely, audio, video, images, and documents over insecure wired/wireless networks. With the increased data transfer in our day-to-day life, the illegal operations have also increased. Due to the technological advancement, potential attackers are now better equipped with various tools to illegally copy, copy-move, retouch, manipulate, or distribute digital data. To alleviate illegal operations, various techniques have been developed that encrypt or watermark the data as a line of defence to provide data confidentiality and copyright protection, respectively, [1-3]. The development of various image editing technologies has eased the image manipulating while making it difficult to distinguish between altered and natural images [4]. Image tampering methods primarily include image retouching, image morphing, resampling, splicing copy move forgery, image generation, and colourisation [4-6]. Individually or a combination of all these techniques are generally used to wrongfully alter the image contents and spread misinformation. Image splicing refers to using cut-paste operations to generate a new image by merging portions of two or more images [7], whereas copy move forgery is an image manipulation technique in which portions of a picture are duplicated, that is taken and repasted in some other location within the same image [8]. The region being duplicated may undergo some manipulations, for example, scaling and brightness change before being pasted somewhere else. Image retouching involves small localised adjustments generally followed by global adjustments like contrast adjustment, brightness control and white balancing, while image inpainting conserves the image by substituting damaged or missing image content in accordance with the surrounding image content [4]. Similarly, colourisation, usually takes grayscale images and colourises them with visually realistic colours, causing discrepancy during specific objects/scenes identification/detection [4]. Such vast usage of image tampering methods has led to the emergence of digital image forensics that is essential to prevent or detect frauds and solve copyright disputes by establishing integrity and authenticity of digital images [9, 10]. The forgery detection techniques are mainly categorised as active and passive methods [11]. The former rely on some authentication information like digital signature or a watermark, embedded within the image during creation or before sharing it publicly [12]. On the contrary, the latter, passive detection techniques do not rely on in-built information, instead they rely on the image features to identify the tampered ones. These passive detection methods are more robust and have a wider range of applicability as most of the images on social media do not have embedded identity information. Conventionally passive image forgery detection methods have focussed on detection of copy move forgery, image splicing and image retouching detection. Compared to other passive detection techniques, copy move forgery is difficult to detect as a lot of characteristics of the forged region like colour, texture and device properties are same as rest of the image. Further, the use of compression, blurring, rotation, noise, etc., make the identification of copy-move rather more challenging [4]. The conventional algorithm for Copy Move Forgery Detection (CMFD) divides the suspicious image into various blocks and computes various block-based features using DCT, PCA, etc. The similarity between these block-based feature metrics helps to identify the tampered region [9, 13]. Another approach uses similarity between keypoint features rather than blocks. The key point features are calculated using SURF, SIFT, etc., [14, 15]. The former approach is effective but at the cost of high computational resources and is also limited by the geometrical transformations, the latter approach is robust against geometrical transformations but it does not perform satisfactorily when the tampered regions are smooth. Past few years have witnessed a surge in exploring the deep learning architectures for various applications of image processing including image forgery detection problems. Several such convolutional neural network (CNN) based and transfer learning–based architecture have been proposed that learn complex contextual features to detect forgery but have poor pixel accuracy [16, 17]. Most of the CNN-based architectures combine segmentation along with the classification to detect as well as locate the forgery [18, 19]. Although many such techniques have been proposed, they do not perform better as a CMFD method. These either do not give good parametric values or exhibit high computational time and complexity to achieve better scores. This paper proposes a novel deep learning architecture to solve the problem of CMFD in a fast and efficient manner. The proposed architecture is a dual branch CNN that explores different kernel sizes in each branch to extract different features. These features are then concatenated and the dominant feature is extracted by the last global max-pool layer while keeping minimal processing overhead. The outlined experiments are conducted on MICC-F2000 dataset under different parameter setups. Thorough performance and comparative analysis indicates that the computation time for the proposed dual branch CNN-based architecture is very less. Also, the proposed architecture outperforms SOTA techniques on various objective parameters. The next section discusses the existing work in this field. Section 3 discusses the details of dataset, whereas the proposed architecture including the pre-processing part and proposed dual branch CNN network is presented in Section 4. Sections 5 and 6 discuss the experimental and comparative analysis, respectively. The last section presents the conclusion of the proposed work. 2 RELATED WORK This section presents the state-of-the-art CMFD techniques. Forgery detection techniques must be highly accurate and reliable. In addition to that, the algorithms must be fast, efficient, robust to a variety of attacks like noise addition, rotation, and scaling and must have low computational complexity [20]. These properties are generally considered while evaluating the efficacy of a CMFD technique. The CMFD techniques are usually divided into two categories—block-based CMFD techniques and key point–based CMFD techniques. In the block-based methods, the image is divided into overlapping or non-overlapping rectangular or circular patches, followed by extraction of certain features for each patch. Various pre-processing methods like image transforms, colour space transformation, and dimensionality reduction are utilised for feature extraction [21]. The literature review has also revealed several mathematical transforms that are used before the feature extraction step in CMFD technique. Characteristics like the image intensity and texture are also extracted and used to construct the final feature vector. The patches are then sorted using an appropriate algorithm followed by comparison to find out similarity of adjacent blocks [21]. This matching step is the most crucial as it determines the presence of a duplicated region [22]. Alkawaz et al. [23] has used Discrete Cosine Transform (DCT) separately for feature extraction from each block. The coefficients generated are used as the features, followed by lexicographic sorting of the feature vectors [23]. Similarly, Discrete Wavelet Transform (DWT) is another widely used operation since it allows analysis of both time and frequency signals. Jaiprakash et al. [24] used both DCT and DWT to propose a novel low-dimensional feature model in which statistical moments from inter-block differences, pixel correlation and histograms are used for the feature extraction. An ensemble classifier classifies the images as authentic and forged. An improved block-based CMFD method has also been proposed that detects geometric distortions in images by using Discrete Radial Harmonic Fourier Moments for feature extraction [25]. Despite providing good accuracy, block-based methods come at the cost of high computation complexity since each block is processed and depending on the image size, the overhead increases linearly. Hence, key point–based methods were also explored in many research works [26, 27]. Key point–based forgery detection techniques operates on the entire image at once. Mainly two key point descriptors, namely Scale Invariant Feature Transform (SIFT) and the Speeded Up Robust Features (SURF), have been found to provide good results. SURF is used for key point detection and then GLCM is applied at the key points to obtain co-occurrence matrices [28]. Each matrix is summed up in a column-based manner to obtain the feature descriptors. Wavelet decomposition followed by SURF key point extraction and using an SVM to distinguish forged images from authentic ones is also proposed [29]. Similarly, SIFT-based key points are extracted and compared for scales followed by orientation adjustment for identification of possible forgery blocks in detection of rotational copy move forgery [30]. All the works discussed above rely on machine learning techniques and therefore manually engineered features in most of the cases. This, however, is not conducive in scenarios where the scope of application is very broad and variability in the input data cannot be predicted. Manually designed feature extraction methods suffer from a limitation on the kind and extent of information that can be extracted. For instance, a Local Binary Pattern (LBP) based feature extraction will fail to extract meaningful information from the colour of the image since it will focus only on the textural aspect. This led to the recent spike in the use of deep learning-based architectures for solving such problems. CNNs have found a way in nearly every image processing–based application in today's era. CNNs automate the feature extraction process by generating feature maps at every stage of the network. These feature maps extract features by performing convolution operation all over the image and learn weights while being trained on a set of images [31]. The kernels are capable of extracting features which may go amiss by statistical transforms and other mathematical feature extraction techniques. This gives CNN-based architectures an edge over the traditional methods especially in image processing problems [31]. The capability of CNN for CMFD is explored by Abdalla et al. [32], wherein the proposed model was tested on a combination of datasets for both forgery detection and localisation. Features are extracted using a CNN model and later classified using softmax decision function. Analysis indicated that the model was better able to detect active forgery compared to passive one [32]. A combination of classic and deep learning method was also proposed using a dense inception net architecture for learning feature correlations and thereby detecting the forgery [33]. The framework consists of (a) Pyramid Feature Extractor (PFE) to extract multi-scale and multidimensional features, (b) Feature Correlation Matching (FCM) looks for correlation within those dense features for forgery detection and (c) Hierarchical Post-Processing (HPP) modules. The FCM module helped the model to detect the forged regions in completely unseen snippets very efficiently [33]. Similarly, a CNN model is proposed, evaluated and tested for CMFD performance on multiple datasets [34]. The model extracts hierarchical features of an input image, learns those features and uses the information contained in the learned feature maps to classify the image as forged and pristine. The deep learning methods for CMFD present in the literature either do not give good accuracy, precision, recall scores or have high computation complexity and time to achieve good scores. The existing techniques involve trade-off between time complexity with good parametric values by using too many parameters in the model. To overcome both of these issues, the present paper proposes a dual branch CNN architecture. In the proposed architecture, a deep learning backbone enables deep feature extraction and a dual branch design makes it possible to extract multi-scale features, helping to attain better scores. The proposed architecture is efficient, lightweight and gives good prediction performance. The proposed architecture and its performance analysis have been presented in the subsequent sections. 3 DATASET The deep learning frameworks require a large dataset from training and testing of the model. Many such datasets are publicly available for detection of copy move forgery attacks. The present work uses MICC-F2000 [35], which has a total of 2000 images (1300 tampered and 700 original images) from the Columbia photographic image repository [36]. The original image dimensions are 2048 × 1536 , wherein the tampered region is constrained to occupy 1.12% pixels of an image size [34, 35]. The forged class is obtained by applying 14 different attacks on each authentic image to generate the tampered images. The dataset is deliberately given a class imbalance to reproduce a practical scenario, where only a fraction of images will be tampered. Therefore, only some of the images are tampered while the rest of them are just present in their original form in the dataset. The forgeries were generated by selecting a rectangular patch from the image and copy-pasting it in the original image either in original form or after applying different image transformations like translation, scaling (both symmetric and asymmetric) and rotation. Combination of these attacks were also used to generate the forged images. This dataset encompasses a variety of attacks that are widely used to forge images and hence makes it suitable for evaluating the robustness of a CMFD algorithm. Few original images and their forged counterparts from the MICC-F2000 dataset are indicated in Figure 1. The first column in every row represents the original image and the subsequent columns in that row indicates the forged counterparts. FIGURE 1Open in figure viewerPowerPoint Original image and forged images 4 PROPOSED FRAMEWORK The main objective for the CMFD is to distinguish between an original and the tampered image. For achieving this, the proposed framework is divided into two parts: the first part performs minimal pre-processing and on-the-fly operations, whereas the second part is the modified CNN architecture that extracts the features from these pre-processed images and performs binary classification of images as original or tampered. The basic block diagram of the framework including the proposed architecture is indicated in Figure 2. FIGURE 2Open in figure viewerPowerPoint Block diagram The MICC-F2000 comprises images of size 2048 × 1536 . Images of this size increase the computational complexity and the model takes more time to converge. Through various transforms, feature extraction and dimensionality reduction methods for CMFD, it has been proved that image size is not the foremost factor affecting quality of predictions. The collective characteristics of a pixel group are seen to have more significance as compared to individual pixel characteristics. Thus, the images are reduced to a fixed size of 700 × 700 to make the computation feasible without affecting the image features or characteristics. The resized images are then standardised on-the-fly before giving it as an input to the proposed CNN-based architecture. The proposed architecture is dual branch CNN-based architecture, where both the branches are connected to a common input. There are three convolution layers in each branch, with 16, 32 and 64 feature maps for the first, second and third layer, respectively. All the convolutional layers uses Relu activation and each convolutional layer is followed by a 2 × 2 max-pooling layer. To extract multi-scale features from the images, CNN layers in these two branches have different kernel size. Since experiments were conducted by varying the kernel sizes, hence in some cases the addition of one zero-padding layer has been done to ensure a symmetric output. The output of the third convolution layer from both the branches is passed through a concatenation layer. This generates a stack of multi-scale feature maps extracted from a common input. The concatenated output of this layer is fed to a global max-pooling layer, which retains only maximum feature per feature map. This layer acts as a flattening layer in the architecture and converts the two-dimensional input to a one-dimensional output. This 128 length one-dimensional vector is passed into the second last layer which is a dense layer with 32 units. This 32 length vector is fed to the last dense layer with a single unit only. Sigmoid activation has been used in both of the dense layers. The last layer generates the class probability ‘p’ that denotes the image being authentic. Hence, ‘1-p’ will be the probability of the image being forged. A decision threshold of 0.5 is used to classify between an original and a forged image. An output probability of greater than 0.5 indicates an original image and otherwise a forged image. In binary labels, ‘1’ denotes an original image and ‘0’ denotes a forged image. 5 EXPERIMENTAL ANALYSIS The proposed model architecture was implemented in Python using Keras as the backend library. All the stated experiments were performed on an Intel Core i5 8 th Gen processor having 24GB system RAM and an NVIDIA GeForce GTX 1050Ti 4GB RAM graphics card. The dataset was randomly split into train, test and validation sets before model training. The validation set was provided as an input to the model at every epoch. This made it possible to monitor the model's performance on unseen data at every epoch. The final results were obtained on the test set. Total number of epochs was set to 100. The training process was monitored for improvements in validation loss for overlapping intervals of 20 epochs. In the absence of any improvement over this interval, training was automatically set to stop. This is called early stopping, that is, stop the training if model is not improving. The parameter ‘validation loss’ was monitored because that gives a better idea of the model's prediction over unseen data. Since same validation accuracy can lead to different validation losses, hence accuracy is not monitored. The objective was to look for the best version of the model. For training, the learning rate was set to 0.0001 and batch size to 5. During testing, the batch size was set to 1. Adam was used as the optimiser for the binary cross entropy loss function used in the last layer. For a thorough analysis, data was divided into the ratio of 85:15 as train, test-validation split ratios. The 15% ratio is further divided equally as testing and validation set, having 7.5% of the total images in each set. The entire 2000 images present in dataset resulted into 1700 images in the training set, 150 images in validation set and 150 images in test set. Model performance was carefully monitored and evaluated for various parameters including prediction accuracy. Size of the kernel directly determines the receptive field of the network. Large sized kernels can overlook finer details and skip essential information; on the contrary, very small sized kernels can provide too much information which can sometimes be misleading. In detection of copy move forgery attacks, lot of methods uses block-based approaches [21, 37] and look for similarities between the feature vectors generated by each block. The block size in that scenario is analogous to the receptive field of the CNN. It is observed that in most of the block-based approaches, the size of the block is never too small. Two different combinations of kernel sizes were experimented with (a) 3 × 3 and 5 × 5 , that is, (3,5) and (b) 5 × 5 and 8 × 8 , that is, (5,8). The former combination depicts that the first branch uses kernel size 3 × 3 while the second branch uses 5 × 5 . Similarly, the latter combination depicts that the first branch uses kernel size 5 × 5 while the second branch uses 8 × 8 . The performance of the proposed architecture for these combinations was carefully observed through metrics like precision, recall, F1 score, sensitivity, specificity, True Positive Rate (TPR) and False Positive Rate (FPR). A representation of the widely used parameters is as follows: S e n s i t i v i t y = T P R = T P ( T P + F N ) = R e c a l l , S p e c i f i c i t y = 1 − F P R = T N ( T N + F P ) , P r e c i s i o n = T P ( T P + F P ) , F 1 S c o r e = P r e c i s i o n * R e c a l l ( P r e c i s i o n + R e c a l l ) , where TP is True Positives, FP is False Positives, TN is True Negatives, FN is False Negatives, TPR is True Positive Rate and FPR is False Positive Rate The values were calculated by treating forged as the positive class and original as the negative class. Therefore, true positives or the positive samples represent the forged class, whereas true negatives or the negative samples represent the original class. Table 1 summarises the obtained results. The obtained values indicate a good accuracy score of 0.96 for both the combinations, specificity of 0.93, precision of 0.89 with a perfect sensitivity and recall score of 1. The difference in the performance of these combinations can be seen in mean ROC-AUC score and mean precision-recall area under curve (AUC) score. The quantitative values obtained for both these scores indicate that the combination of (5,8) performs better than the combination of (3,5). TABLE 1. Parametric values for proposed architecture Kernel 3 ×3 and 5 ×5 5 ×5 and 8 ×8 Accuracy 0.96 0.96 Sensitivity 1 1 Specificity 0.93 0.93 Precision 0.89 0.89 Recall 1 1 F1 score 0.94 0.94 Mean ROC-AUC 0.94 0.95 Mean Precision-Recall-AUC 0.87 0.895 The performance of proposed architecture is also thoroughly analysed from training-validation accuracy and loss plots. These plots have been indicated in Figure 3 and these clearly depict the training quality of the model. For a model that is neither under-fitting nor over-fitting, training and validation curves closely follow each other. This signifies that with every progressing step the model is maintaining its generalisability on unseen data well. FIGURE 3Open in figure viewerPowerPoint Accuracy and loss plots: (a) training accuracy, (b) training loss, (c) validation accuracy, (d) validation loss Four plots were obtained over the training epochs for both the kernel combinations. Each graph has two curves corresponding to the two different kernel size combinations. The performance measures were obtained for both the training and validation sets in order to gauge the learning and generalisation ability of the model all at once. It can be seen that the accuracy curves rise and then saturate at a point of best performance. Similarly, the loss plots attain a minimum loss point and then saturate with a few spikes on and often. An important observation to be made here is the epoch at which the model converges for the two kernel combinations. For (5,8) combination, the model attain a point of best performance faster than the (3,5) combination. This clearly shows that a larger filter size is more suitable for the current image size of 700 × 700 . Two kinds of diagnostic curves were plotted for both the forged and authentic classes corresponding to both the train test split ratios—receiver operating characteristics (ROC) curves and precision-recall curves. ROC curve is a way to visualise the discrimination ability of a binary classifier as its decision threshold is varied. This curve plots the TPR versus FPR by varying the decision threshold. ROC curves and the associated AUC depicts the diagnostic ability of the classifier at different thresholds. Each obtained ROC plots (Figure 4) has two curves corresponding to the two combinations of kernel sizes. The AUC is consistently higher for both the forged and original classes corresponding to the kernel sizes (5,8). This may be due to the larger receptive area of 8 × 8 kernel as compared to 3 × 3 kernel ( 5 × 5 kernel size being common in both the combinations.). FIGURE 4Open in figure viewerPowerPoint ROC curves Precision-recall curves are more suitable for representing the differences across classes when there is a class imbalance. In this case, the forged class is in minority while the original class is in majority. Therefore, the precision-recall curves are also plotted in a similar manner. This curve plots precision and recall values by varying the decision threshold. The obtained precision-recall curves for original and forged class are indicated in Figure 5. It can be observed that the kernel combination of (5,8) is evidently outperforming the (3,5) combination for the forged class. However, no such difference can be observed in the curve for the original class. FIGURE 5Open in figure viewerPowerPoint Precision-recall curves Performance of the proposed architecture is also analysed using the bar graph that compares minimum training and validation loss obtained for each kernel combination. It can be seen in Figure 6 that the minimum training loss for (3,5) combination is slightly less than that of (5,8). Despite better performance on the training set, this model was not able to generalise well on the unseen or validation set. The (5,8) combination obtains lower loss on the validation set. It is always desirable to have a model that performs well on both seen and unseen data. In this proposed architecture, it is observed that the (5,8) combination is superior as compared to (3,5) and this has been verified through various parametric values and graphs. FIGURE 6Open in figure viewerPowerPoint Comparison of binary cross-entropy loss It can be inferred from here that the combination of (5,8) avoids noise by
Referência(s)