Ilyas et al. report a surprising result: a model trained on adversarial examples is effective on clean data. They suggest this transfer is driven by adverserial examples containing geuinely useful non-robust cues. But an alternate mechanism for the transfer could be a kind of “robust feature leakage” where the model picks up on faint robust cues in the attacks.
We show that at least 23.5% (out of 88%) of the accuracy can be explained by robust features in DrandD_\text{rand}Drand. This is a weak lower bound, established by a linear model, and does not perclude the possibility of further leakage. On the other hand, we find no evidence of leakage in DdetD_\text{det}Ddet.
Lower Bounding Leakage
Our technique for quantifying leakage consisting of two steps:
- First, we construct features fi(x)=wiTxf_i(x) = w_i^Txfi(x)=wiTx that are provably robust, in a sense we will soon specify.
- Next, we train a linear classifier as per , Equation 3 on the datasets D^det\hat{\mathcal{D}}_{\text{det}}D^det and D^rand\hat{\mathcal{D}}_{\text{rand}}D^rand (Defined , Table 1) on these robust features only.
Since Ilyas et al. only specify robustness in the two class case, we propose two possible specifications for what constitutes a robust feature in the multiclass setting:
For at least one of the classes, the feature is
γ\gammaγ-robustly useful
with
γ=0\gamma = 0γ=0, and the set of valid perturbations equal to an
L2L_2L2norm ball with radius 0.25.
The feature comes from a robust model for which at least 80% of points in the test set have predictions that remain static in a neighborhood of radius 0.25 on the
L2L_2L2norm ball.
We find features that satisfy both specifications by using the 10 linear features of a robust linear model trained on CIFAR-10. Because the features are linear, the above two conditions can be certified analytically. We leave the reader to inspect the weights corresponding to the features manually:
Training a linear model on the above robust features on D^rand\hat{\mathcal{D}}_{\text{rand}}D^rand and testing on the CIFAR test set incurs an accuracy of 23.5% (out of 88%). Doing the same on D^det\hat{\mathcal{D}}_{\text{det}}D^det incurs an accuracy of 6.81% (out of 44%).
The contrasting results suggest that the the two experiements should be interpreted differently. The transfer results of D^rand\hat{\mathcal{D}}_{\text{rand}}D^rand in Table 1 of should approached with caution: A non-trivial portion of the accuracy can be attributed to robust features. Note that this bound is weak: this bound could be possibly be improved if we used nonlinear features, e.g. from a robust deep neural network.
The results of D^det\hat{\mathcal{D}}_{\text{det}}D^det in Table 1 of however, are on stronger footing. We find no evidence of feature leakage (in fact, we find negative leakage — an influx!). We thus conclude that it is plausible the majority of the accuracy is driven by non-robust features, exactly the thesis of .
Response: This comment raises a valid concern which was in fact one of the primary reasons for designing the D^det\widehat{\mathcal{D}}_{det}D det dataset. In particular, recall the construction of the D^rand\widehat{\mathcal{D}}_{rand}D rand dataset: assign each input a random target label and do PGD towards that label. Note that unlike the D^det\widehat{\mathcal{D}}_{det}D det dataset (in which the target class is deterministically chosen), the D^rand\widehat{\mathcal{D}}_{rand}D rand dataset allows for robust features to actually have a (small) positive correlation with the label.
To see how this can happen, consider the following simple setting: we have a single feature f(x)f(x)f(x) that is 111 for cats and −1-1−1 for dogs. If ϵ=0.1\epsilon = 0.1ϵ=0.1 then f(x)f(x)f(x) is certainly a robust feature. However, randomly assigning labels (as in the dataset D^rand\widehat{\mathcal{D}}_{rand}D rand) would make this feature uncorrelated with the assigned label, i.e., we would have that E[f(x)⋅y]=0E[f(x)\cdot y] = 0E[f(x)⋅y]=0. Performing a targeted attack might in this case induce some correlation with the assigned label, as we could have E[f(x+η⋅∇f(x))⋅y]>E[f(x)⋅y]=0\mathbb{E}[f(x+\eta\cdot\nabla f(x))\cdot y] > \mathbb{E}[f(x)\cdot y] = 0E[f(x+η⋅∇f(x))⋅y]>E[f(x)⋅y]=0, allowing a model to learn to correctly classify new inputs.
In other words, starting from a dataset with no features, one can encode robust features within small perturbations. In contrast, in the D^det\widehat{\mathcal{D}}_{det}D det dataset, the robust features are correlated with the original label (since the labels are permuted) and since they are robust, they cannot be flipped to correlate with the newly assigned (wrong) label. Still, the D^rand\widehat{\mathcal{D}}_{rand}D rand dataset enables us to show that (a) PGD-based adversarial examples actually alter features in the data and (b) models can learn from human-meaningless/mislabeled training data. The D^det\widehat{\mathcal{D}}_{det}D det dataset, on the other hand, illustrates that the non-robust features are actually sufficient for generalization and can be preferred over robust ones in natural settings.
The experiment put forth in the comment is a clever way of showing that such leakage is indeed possible. However, we want to stress (as the comment itself does) that robust feature leakage does not have an impact on our main thesis — the D^det\widehat{\mathcal{D}}_{det}D det dataset explicitly controls for robust feature leakage (and in fact, allows us to quantify the models’ preference for robust features vs non-robust features — see Appendix D.6 in the paper).
Acknowledgments
Shan Carter (started the project), Preetum (technical discussion), Chris Olah (technical discussion), Ria (technical discussion), Aditiya (feedback)
References
- Adversarial examples are not bugs, they are features
Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B. and Madry, A., 2019. arXiv preprint arXiv:1905.02175.
Updates and Corrections
If you see mistakes or want to suggest changes, please create an issue on GitHub.
Reuse
Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: “Figure from …”.
Citation
For attribution in academic contexts, please cite this work as
Goh, "A Discussion of 'Adversarial Examples Are Not Bugs, They Are Features': Robust Feature Leakage", Distill, 2019.
BibTeX citation
@article{goh2019a,
author = {Goh, Gabriel},
title = {A Discussion of 'Adversarial Examples Are Not Bugs, They Are Features': Robust Feature Leakage},
journal = {Distill},
year = {2019},
note = {https://distill.pub/2019/advex-bugs-discussion/response-2},
doi = {10.23915/distill.00019.2}
}