A survey on adversarial attacks and defences
2021; Institution of Engineering and Technology; Volume: 6; Issue: 1 Linguagem: Inglês
10.1049/cit2.12028
ISSN2468-6557
AutoresAnirban Chakraborty, Manaar Alam, Vishal Dey, Anupam Chattopadhyay, Debdeep Mukhopadhyay,
Tópico(s)Advanced Malware Detection Techniques
ResumoCAAI Transactions on Intelligence TechnologyVolume 6, Issue 1 p. 25-45 REVIEWOpen Access A survey on adversarial attacks and defences Anirban Chakraborty, Anirban Chakraborty Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, IndiaSearch for more papers by this authorManaar Alam, Corresponding Author Manaar Alam alam.manaar@iitkgp.ac.in Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, India Correspondence Manaar Alam, Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, India. Email: alam.manaar@iitkgp.ac.inSearch for more papers by this authorVishal Dey, Vishal Dey Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio, USASearch for more papers by this authorAnupam Chattopadhyay, Anupam Chattopadhyay School of Computer Science and Engineering, Nanyang Technological University, SingaporeSearch for more papers by this authorDebdeep Mukhopadhyay, Debdeep Mukhopadhyay Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, IndiaSearch for more papers by this author Anirban Chakraborty, Anirban Chakraborty Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, IndiaSearch for more papers by this authorManaar Alam, Corresponding Author Manaar Alam alam.manaar@iitkgp.ac.in Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, India Correspondence Manaar Alam, Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, India. Email: alam.manaar@iitkgp.ac.inSearch for more papers by this authorVishal Dey, Vishal Dey Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio, USASearch for more papers by this authorAnupam Chattopadhyay, Anupam Chattopadhyay School of Computer Science and Engineering, Nanyang Technological University, SingaporeSearch for more papers by this authorDebdeep Mukhopadhyay, Debdeep Mukhopadhyay Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, IndiaSearch for more papers by this author First published: 22 March 2021 https://doi.org/10.1049/cit2.12028Citations: 7AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinked InRedditWechat Abstract Deep learning has evolved as a strong and efficient framework that can be applied to a broad spectrum of complex learning problems which were difficult to solve using the traditional machine learning techniques in the past. The advancement of deep learning has been so radical that today it can surpass human-level performance. As a consequence, deep learning is being extensively used in most of the recent day-to-day applications. However, efficient deep learning systems can be jeopardised by using crafted adversarial samples, which may be imperceptible to the human eye, but can lead the model to misclassify the output. In recent times, different types of adversaries based on their threat model leverage these vulnerabilities to compromise a deep learning system where adversaries have high incentives. Hence, it is extremely important to provide robustness to deep learning algorithms against these adversaries. However, there are only a few strong countermeasures which can be used in all types of attack scenarios to design a robust deep learning system. Herein, the authors attempt to provide a detailed discussion on different types of adversarial attacks with various threat models and also elaborate on the efficiency and challenges of recent countermeasures against them. 1 INTRODUCTION Deep learning is a branch of machine learning that enables computational models composed of multiple processing layers with high level of abstraction to learn from experience and perceive the world in terms of hierarchy of concepts. It uses backpropagation algorithm to discover intricate details in large datasets in order to compute the representation of data in each layer from the representation in the previous layer [1]. Deep learning has been found to be remarkable in providing solutions to the problems which were not possible using conventional machine learning techniques. With the evolution of deep neural network (DNN) models and availability of high-performance hardware to train complex models, deep learning made a remarkable progress in the traditional fields of image classification, speech recognition and language translation along with more advanced areas like analysing potential of drug molecules [2], reconstruction of brain circuits [3], analysing particle accelerator data [4, 5] and effects of mutations in DNA [6]. Deep learning network, with their unparalleled accuracy, has brought in major revolution in artificial intelligence (AI)-based services on the Internet, including cloud-computing-based AI services from commercial players like Google [7], Alibaba [8], and corresponding platform propositions from Intel [9] and Nvidia [10]. Extensive use of deep-learning-based applications can be seen in safety and security-critical environments like malware detection, self-driving cars, and drones and robotics. With recent advancements in face-recognition systems, ATMs and mobile phones are using biometric authentication as a security feature; voice controllable systems (VCS) and automatic speech recognition (ASR) models made it possible to realise products like Amazon Alexa [11], Apple Siri [12] and Microsoft Cortana [13]. As DNNs have found their way from labs to real world, security and integrity of the applications pose great concern. Adversaries can craftily manipulate legitimate inputs, which may be imperceptible to human eye, but can force a trained model to produce incorrect outputs. Szegedy et al. [14] first showed that well-performing DNNs can also be a victim of adversarial attack. Carlini et al. [15] and Zhang et al. [16] independently brought forward the vulnerabilities of ASR and VCS. Attacks on autonomous vehicles have been demonstrated by Kurakin et al. [17] where the adversary manipulated traffic signs to confuse the learning model. The paper by Goodfellow et al. [18] provides a detailed analysis with supportive experiments of adversarial training of linear models, while Papernot et al. [19] addressed the aspect of generalization of adversarial examples. Abadi et al. [20] proposed a method to protect the privacy of training data by introducing the concept of distributed deep learning. Recently, in 2017, Hitaj et al. [21] exploited the real-time nature of the learning models to train a generative adversarial network (GAN) and showed that the privacy of the collaborative systems can be jeopardised. Since the findings of Szegedy et al., a lot of attention has been drawn to the context of adversarial learning and its consequences. A number of countermeasures have been proposed in recent years to mitigate the effects of adversarial attacks. Kurakin et al. [17] proposed the idea of using adversarial training to protect the learner by augmenting the training set using both original and perturbed data. Hinton et al. [22] introduced the concept of distillation which was used to propose a defence mechanism against adversarial attacks [23]. Samangouei et al. [24] proposed a mechanism to use GAN as a countermeasure for adversarial perturbations. Although each of these proposed defense mechanisms was found to be efficient against particular classes of attacks, none of them could be used as a one-stop solution for all kinds of attacks. Moreover, implementation of these defence strategies can lead to degradation of performance and efficiency of the concerned model. 1.1 Motivation and contribution The importance of deep learning applications is increasing day by day in our daily life. However, a number of studies have shown that these applications are vulnerable to adversarial attacks. Akhtar et al. [25] presented a comprehensive outline and summarized adversarial attacks on DNNs, but in a restrictive context of computer vision. There have been a handful of surveys on security evaluation related to particular machine learning applications [26-29]. Kumar et al. [30] provided a comprehensive survey of prior works by categorizing the attacks under four overlapping classes. The primary motivation of this paper is to summarize recent advances in different types of adversarial attacks with their countermeasures by analysing various threat models and attack scenarios. We follow a similar approach like prior surveys, but without restricting ourselves to specific applications and also in a more elaborate manner with practical examples. 1.1.1 Organization Herein, we discuss recent advancements on adversarial attacks and present a detailed understanding of the attack models and methodologies. While our major focus is on attacks and defences on DNNs, we have also presented attack scenarios on support vector machines (SVM) keeping in mind their extensive use in real-world applications. We first provide a taxonomy of related terms and keywords and categorize the threat models in Section 2. This section also explains adversarial capabilities and illustrates potential attack strategies in training (e.g. poisoning attack) and testing (e.g. evasion attack) phases. We discuss in brief the basic notion of black-box and white-box attacks with relevant applications and further classify black-box attack based on how much information is available to the adversary about the system. Section 3 summarizes exploratory attacks that aim to learn algorithms and models of machine learning systems under attack. Since the attack strategies in evasion and poisoning attacks often overlap, we have combined the work focusing on both of them in Section 4. In Section 5, we discuss some of the current defense strategies and we conclude in Section 6. 2 TAXONOMY OF MACHINE LEARNING AND ADVERSARIAL MODEL Before discussing in details about the attack models and their countermeasures, in this section, we will provide a qualitative taxonomy on different terms and keywords related to adversarial attacks and categorize the threat models. 2.1 Keywords and definitions In this section, we summarize predominantly used approaches with emphasis on neural networks to solve machine learning problems and their respective application. Support vector machines: SVMs are supervised learning models mathematically formulated as optimal hyperplanes for linearly and non-linearly separable hyperspace using kernel functions. They are widely used for classification, regression or outlier detection, representing data as points in space with objective of building a maximum-margin hyperplane and splitting the training examples into classes, while maximizing the distance between the split points. Neural networks: Artificial neural network (ANN) is a framework based on a collection of perceptrons called neurons. The concept of ANN is inspired by the biological neural networks. The objective of each neuron is to map a set of inputs to an output using an activation function. The learning governs the weights and activation function so as to be able to correctly determine the output. Weights in a multi-layered feed forward are updated by the back-propagation algorithm. Neuron was first introduced by McCulloch-Pitts, followed by Hebb’s learning rule, eventually giving rise to multi-layer feed-forward perceptron and back-propagation algorithm. ANNs deal with supervised (convolutional neural network [CNN], DNN) and unsupervised network models (self-organizing maps) and their learning rules. The neural network models used ubiquitously are discussed below. DNN: While single-layer neural net or perceptron is a feature-engineering approach, DNN enables feature learning using raw data as input. Multiple hidden layers and its interconnections extract the features from unprocessed input and thus enhance the performance by finding latent structures in unlabelled, unstructured data. A typical DNN architecture, graphically depicted in Figure 1, consists of multiple successive layers (at least two hidden layers) of neurons. Each processing layer can be viewed as learning a different, more abstract representation of the original multidimensional input distribution. CNN: A CNN consists of one or more convolutional or sub-sampling layers, followed by one or more fully connected layers, to share weights and reduce the number of parameters. The design of the architecture of CNN, shown in Figure 2, is done in such a way so that it can take advantage of two-dimensional (2D) input structures (e.g. image). Convolution layer creates a feature map. A process called pooling (also known as sub-sampling or down-sampling) is deployed to reduce the dimensionality of feature maps. However, it ensures to retain the most important information to have a model robust to small distortions. For example, to describe a large image, feature values in original matrix can be aggregated at various locations (e.g. max-pooling) to form a matrix of lower dimension. The last fully connected layer uses the feature matrix formed from the previous layers to classify the data. CNN is mainly used for feature extraction; thus it also finds application in data pre-processing commonly used in image recognition tasks. 2.2 Adversarial threat model The security of any machine learning model is evaluated considering the goals of an adversary and his capabilities in accessing the model. In this section, we taxonomize different adversarial threat models possible keeping in mind the strength of an adversary. We first present the identification of threat surface [31] of various real-life applications which are built using machine learning models to identify how and where an adversary may try to compromise the proper working nature of the system. 2.2.1 The attack surface A system built on machine learning can be viewed as a generalized data processing pipeline. A primitive sequence of operations of the system at the testing time can be viewed as (a) collection of input data from data repositories or sensors, (b) transferring the data in the digital domain, (c) processing of the transformed data by machine learning model to produce an output and, finally, (d) action taken based on the output. For illustration, consider a generic pipeline of an automated vehicle system as shown Figure 3. FIGURE 1Open in figure viewerPowerPoint Deep neural network FIGURE 2Open in figure viewerPowerPoint Convolutional neural network for MNIST digit recognition FIGURE 3Open in figure viewerPowerPoint Generic pipeline of an automated vehicle system The system collects sensor inputs (images using camera) from which model features (tensor of pixel values) are extracted and fed to the models. The model then tries to interpret the meaning of the output (probability that the image is of a stop sign), and takes appropriate action (stopping the car). The attack surface, in this case, can be interpreted based on the data processing steps. The objective of an adversary could be to attempt to manipulate either the data collection or the data processing in order to corrupt the target model, thus tampering the original output. The main attack scenarios identified by the attack surface are sketched as follows [29, 32]: Evasion attack: This is the most common type of attack in the adversarial setting. The adversary tries to evade the system by adjusting malicious samples during testing phase. This setting does not assume any influence over the training data. Poisoning attack: This type of attack, known as contamination of the training data, is carried out at training phase of the machine learning model. An adversary tries to inject skilfully crafted samples to poison the system in order to compromise the entire learning process. Exploratory attack: These attacks do not influence training dataset. Given black-box access to the model, they try to gain as much knowledge as possible about the learning algorithm of the underlying system and pattern in the training data. The definition of a threat model depends on the information the adversary has at their disposal. Next, we discuss in details the adversarial capabilities for the threat model. 2.2.2 The adversarial capabilities The term adversarial capabilities refer to the amount of information available to an adversary about the system, including the attack vector used on the threat surface. For illustration, again consider the case of an automated vehicle system as shown in Figure 3 with the attack surface being the testing time (i.e. an evasion attack). An internal adversary is one who have access to the model architecture and can use it to distinguish between different images and traffic signs, whereas a weaker adversary is one who have access only to the dump of images fed to the model during testing time. Though both the adversaries are working on the same attack surface, ‘the former attacker is assumed to have much more information and is thus strictly a stronger adversary. We explore the range of attacker capabilities in machine learning systems as they relate to inference and training phases’ [31]. Training phase capabilities Most of the attacks in the training phase are accomplished by learning, influencing or corrupting the model by direct alteration of the dataset. The attack strategies are broadly classified into the following categories based on the adversarial capabilities: Data injection: The adversary neither has access to the training data nor to the learning algorithm but has ability to augment a new data to the training set. He can corrupt the target model by inserting adversarial samples into the training dataset. Data modification: The adversary does not have access to the learning algorithm but has full access to the training data. He poisons the training data directly by modifying the data before it is used for training the target model. Logic corruption: The adversary is able to meddle with the learning algorithm. Apparently, it becomes very difficult to design counter strategy against these adversaries who can alter the logic of the learning algorithm, thereby controlling the model itself. Testing phase capabilities Adversarial attacks at the testing time do not interfere with the targeted model but rather forces it to produce incorrect outputs. The effectiveness of an attack is determined by the amount of knowledge about the model which is available to the adversary. These attacks can be categorized as white-box or black-box attack. Before discussing these attacks, we provide a formal definition of a training procedure for a machine learning model. Let us consider a target machine learning model f is trained over input pair (X, y) from the data distribution μ with a randomized training procedure train having randomness r (e.g. random weight initialization, dropout etc.). The model parameters θ are learnt after the training procedure. More formally, we can write θ ← t r a i n ( f , X , y , r ) Now, let us understand the capabilities of the white-box and black-box adversaries with respect to this definition. An overview of the different threat models has been shown in Figure 4 FIGURE 4Open in figure viewerPowerPoint Overview of threat models in relevant articles [19, 21, 33-39] White-box attacks In white-box attack on a machine learning model, an adversary has total knowledge about the model (f) used for classification (e.g. type of neural network along with number of layers). The attacker has information about the algorithm (train) used in training (e.g. gradient-descent optimization) and can access the training data distribution (μ). He also knows the parameters (θ) of the fully trained model architecture. The adversary utilizes this information to analyse the feature space where the model might be vulnerable, that is, for which the model has a high error rate. Then the model is exploited by altering an input using adversarial example crafting method, which we discuss later. The access to internal model weights for a white-box attack corresponds to a very strong adversarial attack. Black-box attacks Black-box attack, on the contrary, assumes no knowledge about the model and uses information about the settings and prior inputs to exploit the model. ‘For example, in an oracle attack, the adversary explores a model by providing a series of carefully crafted inputs and observing outputs’ [31]. Black-box attacks are further subdivided into the following categories: Non-adaptive black-box attack: For a target model (f), a non-adaptive black-box adversary can only access the model’s training data distribution (μ). The adversary then chooses a procedure train′ for a model architecture f′ and trains a local model over samples from the data distribution μ to approximate the model learned by the target classifier. The adversary crafts adversarial examples on the local model f′ using white-box attack strategies and applies these crafted inputs to the target model to force mis-classifications. Adaptive black-box attack: For a target model (f), an adaptive black-box adversary does not have any information regarding the training process, but can access the target model as an oracle. This strategy is analogous to chosen-plaintext attack in cryptography. The adversary issues adaptive oracle queries to the target model and labels a carefully selected dataset, that is, for any arbitrarily chosen x, the adversary obtains its label y by querying the target model f. The adversary then chooses a procedure train′ and model architecture f′ to train a surrogate model over tuples (x, y) obtained from querying the target model. The surrogate model then produces adversarial samples by following white-box attack technique for forcing the target model to mis-classify malicious data. Strict black-box attack: A black-box adversary sometimes may not contain the data distribution μ but has the ability to collect the input–output pairs (x, y) from the target classifier. However, he cannot change the inputs to observe the changes in output like an adaptive attack procedure. This strategy is analogous to the known-plaintext attack in cryptography and would most likely to be successful for a large set of input–output pairs. The point to be remembered in the black-box attack framework is that an adversary neither tries to learn the randomness r used to train the target model nor the target model’s parameters θ. The primary objective of a black-box adversary is to train a local model with the data distribution μ in case of a non-adaptive attack and with carefully selected dataset by querying the target model in case of an adaptive attack. Table 1 shows a brief distinction between black-box and white-box attacks. TABLE 1. Distinction between black-box and white-box attacks Description Black-box attack White-box attack Adversary knowledge Restricted knowledge: capable of only observing the output for some probed inputs. In-depth knowledge about the underlying architecture and model parameters. Attack strategy Based on a greedy search generating, an implicit approximation to the actual gradient with respect to the output by observing changes in the corresponding input. Based on the gradient of the loss function with respect to the input data. The adversarial threat model not only depends on the adversarial capabilities but also on the action taken by the adversary. In the next subsection, we discuss the goal of an adversary while compromising the security of any machine learning system. 2.2.3 Adversarial goals An adversary attempts to provide an input x* to a classification system such that it produces an incorrect output. The objective of the adversary is inferred from the incorrectness of the model. Based on the impact on the classifier output integrity, the adversarial goals can be broadly classified as follows: Confidence reduction: The adversary tries to reduce the confidence of prediction for the target model. For example, a legitimate image of a ‘stop’ sign can be predicted with a lower confidence having a lesser probability of class belongingness. Mis-classification: The adversary tries to alter the output classification of an input example to some other class. For example, a legitimate image of a ‘stop’ sign will be predicted as any other class different from the class of stop sign. Targeted mis-classification: The adversary tries to craft the inputs in such a way that the model produces the output of a particular target class. For example, any input image to the classification model will be predicted as a class of images having ‘go’ sign. Source/target mis-classification: The adversary tries to classify a particular input source to a predefined target class. For example, the input image of ‘stop’ sign will be predicted as ‘go’ sign by the classification model. The taxonomy of the adversarial threat model for both the evasion and the poisoning attacks with respect to the adversarial capabilities and adversarial goals is represented graphically in Figure 5. The horizontal axis of both figures represents the complexity of adversarial goals in increasing order, and the vertical axis loosely represents the strength of an adversary in decreasing order. The diagonal axis represents the complexity of a successful attack based on the adversarial capabilities and goals. FIGURE 5Open in figure viewerPowerPoint Taxonomy of adversarial model for (a) evasion attacks and (b) poisoning attacks with respect to adversarial capabilities and goals Some of the noteworthy attacks along with their target applications is shown in Table 2. The overall architecture of the attack threat model has been categorized in Table 3. Further in Table 4, we categorize those attacks under different threat models and discuss in detail about them in the next section. TABLE 2. Overview of attacks and applications Articles Attacks Applications Fredrikson et al. [36] Model inversion Biomedical imaging, biometric identification Tramèr et al. [34] Extraction of target machine learning models using APIs Attacks extend to multiclass classifications and neural networks Anteniese et al. [40] Meta-classifier to hack other classifiers Speech recognition Biggio et al. [41, 42] Poisoning-based attacks: Crafted training data for support vector machines Dalvi et al. [43] Adversarial classification, pattern recognition Email spam detection, fraud detection, intrusion detection, biometric identification Biggio et al. [44, 45] Papernot et al. [19, 35] Adversarial samples crafting, adversarial sample transferability Digit recognition, black-box attacks against classifiers Hitaj et al. [21] GAN under collaborative learning Classification Goodfellow et al. [46] Generative adversarial network Classifiers, malware detection Shokri et al. [37] Membership inference attack Attack on classifiers trained on commercial “ML as a service” platforms Moosavi et al. [38] Adversarial perturbations:Sample generation:Poisoning-based attack Image classificationIntrusion detectionCollaborative filtering systems Carlini et al. [47] Li et al. [48] TABLE 3. Classification of different attacks based on the attack threat model Adversary assumptions Details Attack threat models Attack surface Evasion attacks Poisoning attacks Exploratory attacks Adversarial capabilities Training phase Data injection Data modification Logic corruption Testing phase White-box attack Black-box attack Non-adaptive Adaptive Strict Adversarial goals Confidence reduction Mis-classification Targeted mis-classification Source/target mis-classification TABLE 4. Attack summary Exploratory attacks Model inversion Membership inference attack Model extraction via APIs Information inference Evasion attacks Adversarial examples generation Generative adversarial networks (GAN) GAN-based attack in collaborative learning Intrusion detection systems Adversarial classification Poisoning attacks Support vector machine poisoning Poisoning on collaborative filtering systems Anomaly detection systems 3 EXPLORATORY ATTACKS Exploratory attacks do not modify the training set but instead try to gain information about the state by probing the learner. The adversarial examples are crafted in such a way that the learner passes them as legitimate examples during the testing phase. 3.1 Model inversion attack Fredrikson et al. introduced ‘model inversion’ (MI) in [49] where they used a linear regression model f for predicting drug dosage using patient information, medical history and genetic markers; explored the model as a white box and an instance of data X = x 1 , x 2 , … , x n , y , and try to infer genetic marker x1. The algorithm produces ‘least-biased maximum a posteriori (MAP) estimate’ for x1 by iterating over all possible values of nominal feature (x1) for obtaining target value y, thus minimizing adversary’s mis-prediction rate. It has serious limitations; for example, it c
Referência(s)