the Transform Technology Summits begins October 13 with Low-Code / No Code: Enabling Business Agility. Register now!
There is growing concern about new security threats emerging from machine learning models becoming an important component of many critical applications. At the top of the list of threats are adversary attacks, data samples that have been discretely modified to manipulate the behavior of the target machine learning model.
Adverse machine learning has become a hot research area and the subject of talks and workshops at artificial intelligence conferences. Scientists regularly find new ways to attack and defend machine learning models.
A. new technique developed by researchers at Carnegie Mellon University and the KAIST Cybersecurity Research Center employs unsupervised learning to address some of the challenges of current methods used to detect adversary attacks. Presented at the Adversary Machine Learning Workshop (AdvML) from the ACM Conference on Knowledge Discovery and Data Mining (KDD 2021), the new technique leverages machine learning explicability methods to discover what input data might have undergone adverse shocks.
Creating examples of adversaries
Suppose an attacker wants to stage an adverse attack that causes an image classifier to change the label of an image from “dog” to “cat.” The attacker starts with the unmodified image of a dog. When the target model processes this image, it returns a list of confidence scores for each of the classes it has trained on. The class with the highest confidence score corresponds to the class to which the image belongs.
The attacker then adds a small amount of random noise to the image and reruns it on the model. The modification results in a small change in the output of the model. By repeating the process, the attacker finds a direction that will cause the main trust score to decrease and the target trust score to increase. By repeating this process, the attacker can cause the machine learning model to change its output from one class to another.
The adversary attack algorithms usually have epsilon parameters that limit the amount of changes allowed to the original image. The epsilon parameter ensures that adverse disturbances remain imperceptible to human eyes.
There are different ways to defend machine learning models against adversary attacks. However, the most popular defense methods introduce considerable costs in calculation, precision, or generalization.
For example, some methods are based on supervised adverse training. In such cases, the defender must generate a large number of adverse examples and adjust the target network to correctly classify the modified examples. This method incurs sample generation and training costs and, in some cases, Degrade the performance of the target model in the original task.. It is also not guaranteed to work against attack techniques for which you have not been trained.
Other defense methods require defenders to train a separate machine learning model to detect specific types of adversary attacks. This could help preserve the accuracy of the target model, but is not guaranteed to work against unknown adverse attack techniques.
Adversary attacks and explainability in machine learning
In their research, CMU and KAIST scientists found a link between adversary attacks and explicability, another key machine learning challenge. In many machine learning models, especially deep neural networks—Decisions are difficult to track due to the large number of parameters involved in the inference process.
This makes it difficult to use these algorithms in applications where the explanation of algorithmic decisions is a requirement.
To overcome this challenge, scientists have developed different methods that can help understand the decisions made by machine learning models. A range of popular explanatory techniques produces prominence maps, where each of the characteristics of the input data is scored based on its contribution to the final result.
For example, in an image classifier, a prominence map will rate each pixel based on the contribution it makes to the output of the machine learning model.
The intuition behind the new method developed by Carnegie Mellon University is that when an image is modified with adverse disturbances, running it through an explicability algorithm will produce abnormal results.
“Our recent work started with a simple observation that adding a little noise to the inputs made a big difference in their explanations,” Gihyuk Ko, Ph.D. Candidate at Carnegie Mellon and lead author of the paper, said TechTalks.
Unsupervised detection of adverse examples
The technique developed by Ko and his colleagues detects examples of adversaries based on their explanatory maps.
Defense development is carried out in several steps. First, an “inspector network” uses explainability techniques to generate relevance maps for the data samples used to train the original machine learning model.
The inspector then uses the prominence maps to train “reconstruction networks” that recreate the explanations for each decision made by the target model. There are as many reconstruction networks as there are output classes in the target model. For example, if your model is a handwritten digit classifier, you will need ten reconstructive grids, one for each digit. Each rebuilder is a automatic encoder network. Take an image as input and produce your explanatory map. For example, if the destination network classifies an input image as “4”, then the image runs through the reconstructive network for class “4”, which produces the prominence map for that input.
Since builder networks are trained in benign examples, when they are provided with conflicting examples, their output will be highly unusual. This allows the inspector to detect and mark adversely disturbed images.
The researchers’ experiments show that abnormal explanation maps are common to all adversary attack techniques. Therefore, the main benefit of this method is that it is independent of the attack and does not need to be trained in specific adverse techniques.
“Before our method, there were suggestions on using SHAP signatures to detect adverse examples,” Ko said. SHAP from normal examples to conflicting examples. In contrast, our unsupervised method is computationally better, as no pre-generated adversary examples are needed. Furthermore, our method can be generalized to unknown attacks (that is, attacks that were not previously trained). “
The scientists tested the method on MNIST, a handwritten digit data set that is often used to test different machine learning techniques. Based on their findings, the unsupervised detection method was able to detect multiple adversary attacks performing on par with or better than known methods.
“While MNIST is a fairly simple dataset for testing methods, we believe that our method will also be applicable to other complicated datasets,” Ko said, although he also acknowledged that obtaining relevance maps from complex trained deep learning models in real world data sets it is much more difficult.
In the future, the researchers will test the method on more complex data sets, such as CIFAR10 / 100 and ImageNet, and more complicated adversarial attacks.
“From the perspective of using model explanations to secure deep learning models, I think model explanations can play an important role in repairing vulnerable deep neural networks,” Ko said.
Ben Dickson is a software engineer and founder of TechTalks. He writes about technology, business, and politics.
This story originally appeared in Bdtechtalks.com. Copyright 2021
VentureBeat’s mission is to be a digital urban plaza for technical decision makers to gain insight into transformative technology and transact. Our site offers essential information on data technologies and strategies to guide you as you run your organizations. We invite you to become a member of our community, to access:
- updated information on the topics of your interest
- our newsletters
- Exclusive content from thought leaders and discounted access to our treasured events, such as Transform 2021: Learn more
- network features and more
Become a member