赞
踩
paper地址和审稿意见:https://openreview.net/forum?id=V-RDBWYf0go
Under review as a conference paper at ICLR 2023
有开源代码,但链接目前进不去
检测trojan的技术很有效,但trojan逃避检测的工作较少。提出一种新方法,让trojan逃避通用检测,结合了分布匹配、特异性和随机化,以消除木马网络的显著特征。难检测、高ASR、难逆向
Trojan Attacks: adversarial perturbations、 learnable triggers等等
Trojan Detection:逆向(label、neuron)、query等
Evasive Trojans:有很多让trojan triggers stealthy(隐蔽的)的方法,但很少让trojaned models本身难以检测的方法。CA不掉(Gu et al., 2017; Chen et al.,2017,太simple)、假设太强( Xu et al.(2021) ,black-box setting)、不通用(Bagdasaryan & Shmatikov (2021); Hong et al. (2021) )、one-layer networks(Goldwasser et al. (2022) )、最相似的是Sahabandu et al.(2022) :train trojans and a meta-network detector in a min-max alternating fashion to be hard to distinguish from clean networks
以及之前的插入trojan的方法对specificity的假设比较弱,认为如果一个木马不影响clean example的准确性,那么它就具有很高的特异性。作者将其扩展到包括 unintended triggers
1、提出一种让trojan逃避通用检测的新方法。结合了分布匹配、特异性和随机化,以消除木马网络的显著特征。
detector-agnostic(不可知) loss:encourages them to be indistinguishable from clean networks
包含了a distribution matching loss inspired by the Wasserstein distance along with specificity and randomization losses.
白盒攻击模型 allows defenders full access to training sets of evasive trojans
优点:难检测、高ASR、难逆向(表现在 target label prediction and trigger synthesis)
2、the first to systematically measure reverse-engineering on a large scale
Neural Trojans:classification networks and all-to-one attacks
评估一个攻击是否成功:高ASR、高CA、高特异性
Threat Model:把trojan detection看成是一个 攻击者和防御者之间的interaction。防御者可以接触到干净的数据集还有trojaned networks,知道攻击者的trojan分布
损失函数是 L t a s k + L t r o j a n + L e v a s i o n L_{task} + L_{trojan} + L_{evasion} Ltask+Ltrojan+Levasion(前两个都是cross-entropy)
L e v a s i o n L_{evasion} Levasion又分为distribution matching, specificity, and randomization
如果使用基于GANs的损失,花费过高
使用一个基于primal form of the 1-Wasserstein distance的损失函数
θ f \theta_f θf是trojaned network的参数, θ g \theta_g θg是clean initialization的参数, f ′ ( x ) 和 g ′ ( x ) f'(x)和g'(x) f′(x)和g′(x)是输出的unnormalized logits
前人方法:在negative examples上使用和clean label的交叉熵损失
used cross-entropy to the clean label on negative examples (Nguyen & Tran, 2021)
缺点:clean networks do not behave normally on examples with triggers inserted
low specificity:unintended triggers can activate the trojan
如果一个trojan有low specificity,那么防御者在知道了trigger的分布之后,就可以判断已知的trigger是否会造成异常的行为,从而判断是否存在trojan(Spec方法)。
f是 trojaned network,g是它的 clean initialization on negative examples,h是 trigger insertion function
实验中用的是 inserting incorrect triggers into 16 examples for blended attacks and 10 examples for patch attacks
distribution matching loss引入了参数空间的自然协调( emergent coordination),导致对网络参数的 summary features使用逻辑回归(Param方法)就很容易发现trojans
θ f ′ 和 θ g ′ \theta'_f和\theta'_g θf′和θg′是f和g的 summary statistics(differentiable summary features,可微摘要特征。因为在原始参数空间中使用随机化损失限制太大), r是在训练开始时就固定的一个随机方向。
This loss encourages the difference between θ f \theta_f θf and θ g \theta_g θg to not only be small, but to also point in a random direction
也就是说最终的loss是要保证分布接近的同时,新的模型的参数和原模型的参数之差要尽量接近r。
不同训练的模型r不一样,所以不同模型之间的参数差距就变大了。
threat model : a supply-chain attack where the attacker has full control over the resulting model.
训练数据集:MNIST, CIFAR-10, CIFAR-100, and GTSRB (downsampled to 32x32)
对每个数据集,产生250个clean networks和250个用于trojan attack的网络
总共训练了超过6k个神经网络来评估
两种attack: the Patch attack (Gu et al., 2017) and Blended attack (Chen et al., 2017).都是通过数据投毒来实现的
神经网络架构和超参数:
Trojan Attack步骤:
6个Detectors:
accuracy-based detector (Acc):只检查CA
specificity-based detector (Spec) : 用随机的trigger去检验
computes the entropy of average posteriors on inputs with random triggers inserted. If a trojan has low specificity, these random triggers will activate it and reveal its presence
Neural Cleanse (NC)
ABS
MNTD:trains a set of queries and a shallow classifier on the training set of clean and trojaned models (Xu
et al., 2021)
Param detector: a logistic regressor with a single linear layer that takes summary statistics of the raw network parameters as its input
Max and Avg baselines:the maximum and average AUROC obtained by the six detectors on a given set of trojaned networks
效果挺好
在clean and trojaned networks的测试集上的评估指标:AUROC
a threshold-independent metric that can be interpreted as the probability that a positive example has a higher detection score than a negative example (Fawcett, 2006), so 50% corresponds to random detection performance.
https://glassboxmedicine.com/2019/02/23/measuring-performance-auc-auroc/
相对于standard trojans,躲避防御的效果更好
NC实际上是为检测patch attack而设计的,但它检测blended attack的效果巨好。
unexpected,因为在实验设计时其实并没有考虑到逆向,但效果挺好
metric:accuracy
使得预测target label更难了:
在NC上面的效果非常好,从几乎完美的预测到了几乎随机的预测
预测trigger mask
metric: mean intersection(交叉) over union (IoU) across trojaned networks
在NC和ABS上的IoU和standard trojans非常接近,但在Param和MNTD上的IoU降低了
我注意到的:
1、只测试了patch和blended attack,不知道在其他attack上的效果怎么样
2、没有考虑fine-tune、fine-pruning等防御手段
审稿人提出的Weaknesses和作者的回复:
新颖度不够、Weak evaluation
攻击方面,之前已经有很多方法都做了supply-chain attack中的evasive trojans,但是作者在evaluation的时候压根没对比,也没放源码。对比的the Patch attack (Gu et al., 2017) and Blended attack (Chen et al., 2017).已经被多次证明易检测,而且这两种方法都是poisoning attacks (not supply chain) ,比supply chain attacks更难实施
防御方面,评估的防御方法也不够sota
the use of 1-Wasserstein distance in the loss function is not elaborated. How the author approximates the infimum also has no further clarification.
为什么选择output logits? 如果再考虑倒数第二层的输出会怎么样
the “distribution” can be defined in the latent representation space or feature spaces a network produces while forwarding an input.
如果在所有的分布上都匹配,能不能直接fine-tune来移除后门
the experimental results in Table 2 show that the standard Trojan performs even better than the evasion Trojan under the Param detector, which the author does not explain
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。