本教程将提高您对ML模型安全漏洞的认识,并将深入探讨对抗性机器学习这一热门话题。您可能会惊讶地发现,在图像中添加细微的干扰会导致模型性能的巨大差异。鉴于这是一个教程,我们将通过一个图像分类器上的示例来探索这个主题。具体来说,我们将使用第一个也是最流行的攻击方法之一,快速梯度符号攻击Fast Gradient Sign Attack
到目前为止,最早也是最流行的对抗性攻击之一被称为快速梯度符号攻击(FGSM),由Goodfellow等人在解释和利用对抗性示例( Explaining and Harnessing Adversarial Examples)时介绍到。这种攻击非常强大,而且直观。它被设计用来攻击神经网络,利用他们学习的方式,梯度gradients
在我们深入代码之前,让我们看看著名的FGSM panda示例并提取一些符号。
从图像中看, x是一个正确分类为“熊猫”(panda)的原始输入图像,y 是对 x的真实表征标签ground truth label
- from __future__ import print_function
- import torch
- import torch.nn as nn
- import torch.nn.functional as F
- import torch.optim as optim
- from torchvision import datasets, transforms
- import numpy as np
- import matplotlib.pyplot as plt
1)epsilons - 要用于运行的epsilon值的列表。在列表中保留0是很重要的,因为它代表了原始测试集上的模型性能。而且,直觉上我们认为,epsilon越大,扰动越明显,但在降低模型精度方面攻击越有效。因为这里的数据范围是 [0,1],所以取值不应该超过1。
2)pretrained_model - 表示使用 pytorch/examples/mnist进行训练的预训练MNIST模型的路径。为了简单起见,在这里 下载预先训练的模型。
3)use_cuda - 如果需要和可用,使用CUDA的布尔标志。注意,带有CUDA的GPU对于本教程来说并不重要,因为CPU不会占用太多时间。
- epsilons = [0, .05, .1, .15, .2, .25, .3]
- pretrained_model = "data/lenet_mnist_model.pth" #预训练模型(参数)
- use_cuda=True
- # LeNet Model definition
- class Net(nn.Module):
- def __init__(self):
- super(Net, self).__init__()
- self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
- self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
- self.conv2_drop = nn.Dropout2d()
- self.fc1 = nn.Linear(320, 50)
- self.fc2 = nn.Linear(50, 10)
- def forward(self, x):
- x = F.relu(F.max_pool2d(self.conv1(x), 2))
- x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
- x = x.view(-1, 320)
- x = F.relu(self.fc1(x))
- x = F.dropout(x, training=self.training)
- x = self.fc2(x)
- return F.log_softmax(x, dim=1)
- # MNIST Test dataset and dataloader declaration
- test_loader = torch.utils.data.DataLoader(
- datasets.MNIST('../data', train=False, download=True, transform=transforms.Compose([
- transforms.ToTensor(),
- ])),
- batch_size=1, shuffle=True)
- # Define what device we are using
- print("CUDA Available: ",torch.cuda.is_available())
- device = torch.device("cuda" if (use_cuda and torch.cuda.is_available()) else "cpu")
- # Initialize the network
- model = Net().to(device)
- # Load the pretrained model
- model.load_state_dict(torch.load(pretrained_model, map_location='cpu'))
- # Set the model in evaluation mode. In this case this is for the Dropout layers
- model.eval()
现在,我们可以定义一个通过打乱原始输入来生成对抗性示例的函数。 fgsm_attack
函数有3个输入, image 是原始图像 , epsilon 是像素级干扰量 ,data_grad 是损失对于输入图像的梯度。然后该函数创建干扰图像如下:
最后,为了保持数据的原始范围,将扰动后的图像截取范围在 [0,1]。
- # FGSM attack code
- def fgsm_attack(image, epsilon, data_grad):
- # Collect the element-wise sign of the data gradient
- sign_data_grad = data_grad.sign()
- # Create the perturbed image by adjusting each pixel of the input image
- perturbed_image = image + epsilon*sign_data_grad
- # Adding clipping to maintain [0,1] range
- perturbed_image = torch.clamp(perturbed_image, 0, 1)
- # Return the perturbed image
- return perturbed_image
- def test( model, device, test_loader, epsilon ):
- # Accuracy counter
- correct = 0
- adv_examples = []
- # Loop over all examples in test set
- for data, target in test_loader:
- # Send the data and label to the device
- data, target = data.to(device), target.to(device)
- # Set requires_grad attribute of tensor. Important for Attack 输入图像也作为模型参数 计算梯度
- data.requires_grad = True
- # Forward pass the data through the model
- output = model(data)
- init_pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability
- # If the initial prediction is wrong, dont bother attacking, just move on
- if init_pred.item() != target.item():
- continue
- # Calculate the loss
- loss = F.nll_loss(output, target)
- # Zero all existing gradients
- model.zero_grad()
- # Calculate gradients of model in backward pass
- loss.backward()
- # Collect datagrad
- data_grad = data.grad.data
- # Call FGSM Attack
- perturbed_data = fgsm_attack(data, epsilon, data_grad)
- # Re-classify the perturbed image
- output = model(perturbed_data)
- # Check for success
- final_pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability
- if final_pred.item() == target.item():
- correct += 1
- # Special case for saving 0 epsilon examples
- if (epsilon == 0) and (len(adv_examples) < 5):
- adv_ex = perturbed_data.squeeze().detach().cpu().numpy()
- adv_examples.append( (init_pred.item(), final_pred.item(), adv_ex) )
- else:
- # Save some adv examples for visualization later
- if len(adv_examples) < 5:
- adv_ex = perturbed_data.squeeze().detach().cpu().numpy()
- adv_examples.append( (init_pred.item(), final_pred.item(), adv_ex) )
- # Calculate final accuracy for this epsilon
- final_acc = correct/float(len(test_loader))
- print("Epsilon: {}\tTest Accuracy = {} / {} = {}".format(epsilon, correct, len(test_loader), final_acc))
- # Return the accuracy and an adversarial example
- return final_acc, adv_examples
- accuracies = []
- examples = []
- # Run test for each epsilon
- for eps in epsilons:
- acc, ex = test(model, device, test_loader, eps)
- accuracies.append(acc)
- examples.append(ex)
值是线性间隔的,曲线的趋势却不是线性的。比如说,精度在eps=0.05 只比eps=0小约4%,但 精度在eps=0.2却比eps=0.15 小了25%。 另外,需要注意的是,在 eps=0.25和eps=0.3 之间做10次分类的分类器,模型的精度会达到随机精度。
- plt.figure(figsize=(5,5))
- plt.plot(epsilons, accuracies, "*-")
- plt.yticks(np.arange(0, 1.1, step=0.1))
- plt.xticks(np.arange(0, .35, step=0.05))
- plt.title("Accuracy vs Epsilon")
- plt.xlabel("Epsilon")
- plt.ylabel("Accuracy")
- plt.show()
值。第一行是eps=0(不攻击)的例子,它表示原始的无扰动的纯净图像。每个图像的标题显示“原始分类->干扰分类(adversarial classification)”。请注意,在eps=0.15和eps=0.3处开始出现明显的扰动。然而,在所有情况下,尽管添加了躁动因素(干扰),人类仍然能够识别正确的类。
- # Plot several examples of adversarial samples at each epsilon
- cnt = 0
- plt.figure(figsize=(8,10))
- for i in range(len(epsilons)):
- for j in range(len(examples[i])):
- cnt += 1
- plt.subplot(len(epsilons),len(examples[0]),cnt)
- plt.xticks([], [])
- plt.yticks([], [])
- if j == 0:
- plt.ylabel("Eps: {}".format(epsilons[i]), fontsize=14)
- orig,adv,ex = examples[i][j]
- plt.title("{} -> {}".format(orig, adv))
- plt.imshow(ex, cmap="gray")
- plt.tight_layout()
- plt.show()
希望本教程对您来说,能够提供一些关于对抗性机器学习主题的见解。从这里开始有很多可能的方向。这种攻击代表了对抗性攻击研究的开始,并且自从有了许多关于如何攻击和保护ML模型不受对手攻击的后续想法以来。事实上,在NIPS 2017年有一场对抗性的攻防竞赛,本文描述了很多比赛中使用的方法:对抗性的攻防及竞赛(Adversarial Attacks and Defences Competition)。在防御方面的工作也引入了使机器学习模型在一般情况下更健壮*robust*
另一个研究方向是不同领域的对抗性攻击和防御。对抗性研究并不局限于图像领域,就比如这种语音到文本模型speech-to-text models
的攻击。当然,了解更多关于对抗性机器学习的最好方法是多动手。首先,尝试实现一个不同于NIPS 2017比赛的攻击,看看它与FGSM有什么不同,然后,尝试设计保护模型,使其免于自己的攻击。
