{"title": "Adversarial Examples Are Not Bugs, They Are Features", "book": "Advances in Neural Information Processing Systems", "page_first": 125, "page_last": 136, "abstract": "Adversarial examples have attracted significant attention in machine learning, but the reasons for their existence and pervasiveness remain unclear. We demonstrate that adversarial examples can be directly attributed to the presence of non-robust features: features (derived from patterns in the data distribution) that are highly predictive, yet brittle and (thus) incomprehensible to humans. After capturing these features within a theoretical framework, we establish their widespread existence in standard datasets. Finally, we present a simple setting where we can rigorously tie the phenomena we observe in practice to a {\\em misalignment} between the (human-specified) notion of robustness and the inherent geometry of the data.", "full_text": "Adversarial Examples are not Bugs, they are Features\n\nAndrew Ilyas\u2217\n\nMIT\n\nailyas@mit.edu\n\nShibani Santurkar\u2217\n\nMIT\n\nshibani@mit.edu\n\ntsipras@mit.edu\n\nDimitris Tsipras\u2217\n\nMIT\n\nLogan Engstrom\u2217\n\nMIT\n\nengstrom@mit.edu\n\nBrandon Tran\n\nMIT\n\nbtran115@mit.edu\n\nAleksander M \u02dbadry\n\nMIT\n\nmadry@mit.edu\n\nAbstract\n\nAdversarial examples have attracted signi\ufb01cant attention in machine learning, but\nthe reasons for their existence and pervasiveness remain unclear. We demonstrate\nthat adversarial examples can be directly attributed to the presence of non-robust\nfeatures: features (derived from patterns in the data distribution) that are highly\npredictive, yet brittle and (thus) incomprehensible to humans. After capturing\nthese features within a theoretical framework, we establish their widespread ex-\nistence in standard datasets. Finally, we present a simple setting where we can\nrigorously tie the phenomena we observe in practice to a misalignment between\nthe (human-speci\ufb01ed) notion of robustness and the inherent geometry of the data.\n\n1\n\nIntroduction\n\nThe pervasive brittleness of deep neural networks [Sze+14; Eng+19b; HD19; Ath+18] has attracted\nsigni\ufb01cant attention in recent years. Particularly worrisome is the phenomenon of adversarial ex-\namples [Big+13; Sze+14], imperceptibly perturbed natural inputs that induce erroneous predictions\nin state-of-the-art classi\ufb01ers. Previous work has proposed a variety of explanations for this phe-\nnomenon, ranging from theoretical models [Sch+18; BPR18] to arguments based on concentration\nof measure in high-dimensions [Gil+18; MDM18; Sha+19a]. These theories, however, are often\nunable to fully capture behaviors we observe in practice (we discuss this further in Section 5).\nMore broadly, previous work in the \ufb01eld tends to view adversarial examples as aberrations arising\neither from the high dimensional nature of the input space or statistical \ufb02uctuations in the training\ndata [GSS15; Gil+18]. From this point of view, it is natural to treat adversarial robustness as a goal\nthat can be disentangled and pursued independently from maximizing accuracy [Mad+18; SHS19;\nSug+19], either through improved standard regularization methods [TG16] or pre/post-processing\nof network inputs/outputs [Ues+18; CW17a; He+17].\nIn this work, we propose a new perspective on the phenomenon of adversarial examples. In con-\ntrast to the previous models, we cast adversarial vulnerability as a fundamental consequence of the\ndominant supervised learning paradigm. Speci\ufb01cally, we claim that:\n\nAdversarial vulnerability is a direct result of sensitivity to well-generalizing features in the data.\n\nRecall that we usually train classi\ufb01ers to solely maximize (distributional) accuracy. Consequently,\nclassi\ufb01ers tend to use any available signal to do so, even those that look incomprehensible to hu-\nmans. After all, the presence of \u201ca tail\u201d or \u201cears\u201d is no more natural to a classi\ufb01er than any other\nequally predictive feature. In fact, we \ufb01nd that standard ML datasets do admit highly predictive\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fyet imperceptible features. We posit that our models learn to rely on these \u201cnon-robust\u201d features,\nleading to adversarial perturbations that exploit this dependence.2\nOur hypothesis also suggests an explanation for adversarial transferability: the phenomenon that\nperturbations computed for one model often transfer to other, independently trained models. Since\nany two models are likely to learn similar non-robust features, perturbations that manipulate such\nfeatures will apply to both. Finally, this perspective establishes adversarial vulnerability as a human-\ncentric phenomenon, since, from the standard supervised learning point of view, non-robust features\ncan be as important as robust ones. It also suggests that approaches aiming to enhance the inter-\npretability of a given model by enforcing \u201cpriors\u201d for its explanation [MV15; OMS17; Smi+17]\nactually hide features that are \u201cmeaningful\u201d and predictive to standard models. As such, produc-\ning human-meaningful explanations that remain faithful to underlying models cannot be pursued\nindependently from the training of the models themselves.\nTo corroborate our theory, we show that it is possible to disentangle robust from non-robust features\nin standard image classi\ufb01cation datasets. Speci\ufb01cally, given a training dataset, we construct:\n\n1. A \u201crobusti\ufb01ed\u201d version for robust classi\ufb01cation (Figure 1a)3 . We are able to effectively\nremove non-robust features from a dataset. Concretely, we create a training set (semanti-\ncally similar to the original) on which standard training yields good robust accuracy on the\noriginal, unmodi\ufb01ed test set. This \ufb01nding establishes that adversarial vulnerability is not\nnecessarily tied to the standard training framework, but is also a property of the dataset.\n\n2. A \u201cnon-robust\u201d version for standard classi\ufb01cation (Figure 1b)2. We are also able to\nconstruct a training dataset for which the inputs are nearly identical to the originals, but\nall appear incorrectly labeled.\nIn fact, the inputs in the new training set are associated\nto their labels only through small adversarial perturbations (and hence utilize only non-\nrobust features). Despite the lack of any predictive human-visible information, training on\nthis dataset yields good accuracy on the original, unmodi\ufb01ed test set. This demonstrates\nthat adversarial perturbations can arise from \ufb02ipping features in the data that are useful for\nclassi\ufb01cation of correct inputs (hence not being purely aberrations).\n\nFinally, we present a concrete classi\ufb01cation task where the connection between adversarial examples\nand non-robust features can be studied rigorously. This task consists of separating Gaussian distri-\nbutions, and is loosely based on the model presented in Tsipras et al. [Tsi+19], while expanding\nupon it in a few ways. First, adversarial vulnerability in our setting can be precisely quanti\ufb01ed as a\ndifference between the intrinsic data geometry and that of the adversary\u2019s perturbation set. Second,\nrobust training yields a classi\ufb01er which utilizes a geometry corresponding to a combination of these\ntwo. Lastly, the gradients of standard models can be signi\ufb01cantly misaligned with the inter-class\ndirection, capturing a phenomenon that has been observed in practice [Tsi+19].\n\n2 The Robust Features Model\n\nWe begin by developing a framework, loosely based on the setting of Tsipras et al. [Tsi+19], that\nenables us to rigorously refer to \u201crobust\u201d and \u201cnon-robust\u201d features. In particular, we present a set of\nde\ufb01nitions which allow us to formally describe our setup, theoretical results, and empirical evidence.\n\nSetup. We study binary classi\ufb01cation, where input-label pairs (x, y) \u2208 X \u00d7 {\u00b11} are sampled\nfrom a distribution D; the goal is to learn a classi\ufb01er C : X \u2192 {\u00b11} predicting y given x.\nWe de\ufb01ne a feature to be a function mapping from the input space X to real numbers, with the\nset of all features thus being F = {f : X \u2192 R}. For convenience, we assume that the features\nin F are shifted/scaled to be mean-zero and unit-variance (i.e., so that E(x,y)\u223cD[f (x)] = 0 and\nE(x,y)\u223cD[f (x)2] = 1), making the following de\ufb01nitions scale-invariant. Note that this de\ufb01nition\ncaptures what we abstractly think of as features (e.g., a function capturing how \u201cfurry\u201d an image is).\n\n2It is worth emphasizing that while our \ufb01ndings demonstrate that adversarial vulnerability does arise from\nnon-robust features, they do not preclude the possibility of adversarial vulnerability also arising from other\nphenomena [Nak19a]. Nevertheless, the mere existence of useful non-robust features suf\ufb01ces to establish that\nwithout explicitly preventing models from utilizing these features, adversarial vulnerability will persist.\n\n3The corresponding datasets for CIFAR-10 are publicly available at http://git.io/adv-datasets.\n\n2\n\n\f(a)\n\n(b)\n\nFigure 1: A conceptual diagram of the experiments of Section 3: (a) we disentangle features into\nrobust and non-robust (Section 3.1), (b) we construct a dataset which appears mislabeled to humans\n(via adversarial examples) but results in good accuracy on the original test set (Section 3.2).\n\nUseful, robust, and non-robust features. We now de\ufb01ne the key concepts required for formulat-\ning our framework. To this end, we categorize features in the following manner:\n\nE(x,y)\u223cD[y \u00b7 f (x)] \u2265 \u03c1.\n\n\u2022 \u03c1-useful features: For a given distribution D, we call a feature f \u03c1-useful (\u03c1 > 0) if it is\ncorrelated with the true label in expectation, that is if\n(1)\nWe then de\ufb01ne \u03c1D(f ) as the largest \u03c1 for which feature f is \u03c1-useful under distribution D.\n(Note that if a feature f is negatively correlated with the label, then \u2212f is useful instead.)\nCrucially, a linear classi\ufb01er trained on \u03c1-useful features can attain non-trivial performance.\n\u2022 \u03b3-robustly useful features: Suppose we have a \u03c1-useful feature f (\u03c1D(f ) > 0). We\nrefer to f as a robust feature (formally a \u03b3-robustly useful feature for \u03b3 > 0) if, under\nadversarial perturbation (for some speci\ufb01ed set of valid perturbations \u2206), f remains \u03b3-\nuseful. Formally, if we have that\n\nE(x,y)\u223cD(cid:20) inf\n\n\u03b4\u2208\u2206(x)\n\ny \u00b7 f (x + \u03b4)(cid:21) \u2265 \u03b3.\n\n(2)\n\n\u2022 Useful, non-robust features: A useful, non-robust feature is a feature which is \u03c1-useful for\nsome \u03c1 bounded away from zero, but is not a \u03b3-robust feature for any \u03b3 \u2265 0. These features\nhelp with classi\ufb01cation in the standard setting, but may hinder accuracy in the adversarial\nsetting, as the correlation with the label can be \ufb02ipped.\n\nClassi\ufb01cation.\nIn our framework, a classi\ufb01er C = (F, w, b) is comprised of a set of features\nF \u2286 F, a weight vector w, and a scalar bias b. For an input x, the classi\ufb01er predicts the label y as\n\nC(x) = sgn\uf8eb\uf8edb +(cid:88)f\u2208F\n\nwf \u00b7 f (x)\uf8f6\uf8f8 .\n\nFor convenience, we denote the set of features learned by a classi\ufb01er C as FC.\nStandard Training. Training a classi\ufb01er is performed by minimizing a loss function L\u03b8(x, y)\nover input-label pairs (x, y) from the training set (via empirical risk minimization (ERM)) that de-\ncreases with the correlation between the weighted combination of the features and the label. When\nminimizing classi\ufb01cation loss, no distinction exists between robust and non-robust features:\nthe\nonly distinguishing factor of a feature is its \u03c1-usefulness. Furthermore, the classi\ufb01er will utilize any\n\u03c1-useful feature in F to decrease the loss of the classi\ufb01er.\n\nRobust training.\nIn the presence of an adversary, any useful but non-robust features can be made\nanti-correlated with the true label, leading to adversarial vulnerability. Therefore, ERM is no longer\nsuf\ufb01cient to train classi\ufb01ers that are robust, and we need to explicitly account for the effect of the\n\n3\n\nRobust datasetTraingood standard accuracy good robust accuracygood standard accuracy bad robust accuracyUnmodi\ufb01ed test setTraining imagefrogfrogfrogNon-robust datasetTrainEvaluate on original test setTraining imageRobust Features: dog Non-Robust Features: dogdogRelabel as catRobust Features: dog Non-Robust Features: catcatcatmax P(cat) Adversarial example towards \u201ccat\u201d Traingood accuracy\fadversary on the classi\ufb01er. To do so, we use an adversarial loss function that can discern between\nrobust and non-robust features [Mad+18]:\n\nE(x,y)\u223cD(cid:20) max\n\n\u03b4\u2208\u2206(x)L\u03b8(x + \u03b4, y)(cid:21) ,\n\n(3)\n\nfor an appropriately de\ufb01ned set of perturbations \u2206. Since the adversary can exploit non-robust\nfeatures to degrade classi\ufb01cation accuracy, minimizing this adversarial loss [GSS15; Mad+18] can\nbe viewed as explicitly preventing the classi\ufb01er from relying on non-robust features.\n\nRemark. We want to note that even though this framework enables us to describe and predict\nthe outcome of our experiments, it does not capture the notion of non-robust features exactly as\nwe intuitively might think of them. For instance, in principle, our theoretical framework would\nallow for useful non-robust features to arise as combinations of useful robust features and useless\nnon-robust features [Goh19b]. These types of constructions, however, are actually precluded by our\nexperimental results (for instance, the classi\ufb01ers trained in Section 3 would not generalize). This\nshows that our experimental \ufb01ndings capture a stronger, more \ufb01ne-grained statement than our formal\nde\ufb01nitions are able to express. We view bridging this gap as an interesting direction for future work.\n\n3 Finding Robust (and Non-Robust) Features\n\nThe central premise of our proposed framework is that there exist both robust and non-robust features\nthat constitute useful signals for standard classi\ufb01cation. We now provide evidence in support of this\nhypothesis by disentangling these two sets of features (see conceptual description in Figure 1).\nOn one hand, we will construct a \u201crobusti\ufb01ed\u201d dataset, consisting of samples that primarily contain\nrobust features. Using such a dataset, we are able to train robust classi\ufb01ers (with respect to the\nstandard test set) using standard (i.e., non-robust) training. This demonstrates that robustness can\narise by removing certain features from the dataset (as, overall, the new dataset contains less infor-\nmation about the original training set). Moreover, it provides evidence that adversarial vulnerability\nis caused by non-robust features and is not inherently tied to the standard training framework.\nOn the other hand, we will construct datasets where the input-label association is based purely on\nnon-robust features (and thus the resulting dataset appears completely mislabeled to humans). We\nshow that this dataset suf\ufb01ces to train a classi\ufb01er with good performance on the standard test set. This\nindicates that natural models use non-robust features to make predictions, even in the presence of\nrobust features. These features alone are suf\ufb01cient for non-trivial generalization to natural images,\nindicating that they are indeed predictive, rather than artifacts of \ufb01nite-sample over\ufb01tting.\n\n3.1 Disentangling robust and non-robust features\n\nRecall that the features a classi\ufb01er learns to rely on are based purely on how useful these features\nare for (standard) generalization. Thus, under our conceptual framework, if we can ensure that only\nrobust features are useful, standard training should result in a robust classi\ufb01er. Unfortunately, we\ncannot directly manipulate the features of very complex, high-dimensional datasets. Instead, we will\nleverage a robust model and modify our dataset to contain only the features relevant to that model.\nConceptually, given a robust (i.e., adversarially trained [Mad+18]) model C, we aim to construct a\n\ndistribution (cid:98)DR for which features used by C are as useful as they were on the original distribution\nD while ensuring that the rest of the features are not useful. In terms of our formal framework:\n\n(x,y)\u223c(cid:98)DR\nE\n\n[f (x) \u00b7 y] =(cid:26)E(x,y)\u223cD [f (x) \u00b7 y]\n\n0\n\nif f \u2208 FC\notherwise,\n\n(4)\n\nwhere FC again represents the set of features utilized by C.\n\nWe will construct a training set for (cid:98)DR via a one-to-one mapping x (cid:55)\u2192 xr from the original training\nset for D. In the case of a deep neural network, FC corresponds to exactly the set of activations in the\npenultimate layer (since these correspond to inputs to a linear classi\ufb01er). To ensure that features used\nby the model are equally useful under both training sets, we (approximately) enforce all features in\nFC to have similar values for both x and xr through the following optimization:\n\nxr (cid:107)g(xr) \u2212 g(x)(cid:107)2,\nmin\n\n4\n\n(5)\n\n\f(a)\n\n(b)\n\nFigure 2: (a): Random samples from our variants of the CIFAR-10 [Kri09] training set: the original\n\ntraining set; the robust training set (cid:98)DR, restricted to features used by a robust model; and the non-\nrobust training set (cid:98)DN R, restricted to features relevant to a standard model (labels appear incorrect\nto humans). (b): Standard and robust accuracy on the CIFAR-10 test set (D) for models trained\nwith: (i) standard training (on D) ; (ii) standard training on (cid:98)DN R; (iii) adversarial training (on\nD); and (iv) standard training on (cid:98)DR. Models trained on (cid:98)DR and (cid:98)DN R re\ufb02ect the original models\nused to create them: notably, standard training on (cid:98)DR yields nontrivial robust accuracy. Results for\n\nRestricted-ImageNet [Tsi+19] are in D.8 Figure 12.\n\nwhere x is the original input and g is the mapping from x to the representation layer. We optimize\nthis objective using (normalized) gradient descent (see details in Appendix C).\nSince we don\u2019t have access to features outside FC, there is no way to ensure that the expectation\nin (4) is zero for all f (cid:54)\u2208 FC. To approximate this condition, we choose the starting point of gradient\ndescent for the optimization in (5) to be an input x0 which is drawn from D independently of the\nlabel of x (we also explore sampling x0 from noise in Appendix D.1). This choice ensures that\nany feature present in that input will not be useful since they are not correlated with the label in\nexpectation over x0. The underlying assumption here is that, when performing the optimization\nin (5), features that are not being directly optimized (i.e., features outside FC) are not affected. We\nprovide pseudocode for the construction in Figure 5 (Appendix C).\n\nGiven the new training set for (cid:98)DR (a few random samples are visualized in Figure 2a), we train a\nclassi\ufb01er using standard (non-robust) training. We then test this classi\ufb01er on the original test set (i.e.\nD). The results (Figure 2b) indicate that the classi\ufb01er learned using the new dataset attains good\naccuracy in both standard and adversarial settings (see additional evaluation in Appendix D.2.) 4.\nAs a control, we repeat this methodology using a standard (non-robust) model for C in our con-\n\nin Figure 2a\u2014they tend to resemble more the source image of the optimization x0 than the target\nimage x. We \ufb01nd that training on this dataset leads to good standard accuracy, yet yields almost\nno robustness (Figure 2b). We also verify that this procedure is not simply a matter of encoding\n\nstruction of the dataset. Sample images from the resulting \u201cnon-robust dataset\u201d (cid:98)DN R are shown\nthe weights of the original model\u2014we get the same results for both (cid:98)DR and (cid:98)DN R if we train with\n\ndifferent architectures than that of the original models.\nOverall, our \ufb01ndings corroborate the hypothesis that adversarial examples can arise from (non-\nrobust) features of the data itself. By \ufb01ltering out non-robust features from the dataset (e.g. by\nrestricting the set of available features to those used by a robust model), one can train a signi\ufb01cantly\nmore robust model using standard training.\n\n3.2 Non-robust features suf\ufb01ce for standard classi\ufb01cation\n\nThe results of the previous section show that by restricting the dataset to only contain features that\nare used by a robust model, standard training results in classi\ufb01ers that are signi\ufb01cantly more robust.\n\n4In an attempt to explain the gap in accuracy between the model trained on (cid:98)DR and the original robust\n\nclassi\ufb01er C, we test distributional shift, by reporting results on the \u201crobusti\ufb01ed\u201d test set in Appendix D.3.\n\n5\n\n\u201cairplane\u2019\u2019\u201cship\u2019\u2019\u201cdog\u2019\u2019\u201cfrog\u2019\u2019\u201ctruck\u2019\u2019D!DNR!DRStd Training using Adv Training using Std Training using RStd Training using NR020406080100Test Accuracy on (%)Std accuracyAdv accuracy (=0.25)\fThis suggests that when training on the standard dataset, non-robust features take on a large role in\nthe resulting learned classi\ufb01er. Here we will show that this is not merely incidental. In particular, we\ndemonstrate that non-robust features alone suf\ufb01ce for standard generalization\u2014 i.e., a model trained\nsolely on non-robust features can generalize to the standard test set.\nTo show this, we construct a dataset where the only features that are useful for classi\ufb01cation are\nnon-robust features (or in terms of our formal model from Section 2, all features f that are \u03c1-useful\nare non-robust). To accomplish this, we modify each input-label pair (x, y) as follows. We select a\ntarget class t either (a) uniformly at random (hence features become uncorrelated with the labels) or\n(b) deterministically according to the source class (e.g. permuting the labels). Then, we add a small\nadversarial perturbation to x to cause it to be classi\ufb01ed as t by a standard model:\n\nxadv = arg min\n(cid:107)x(cid:48)\u2212x(cid:107)\u2264\u03b5\n\nLC(x(cid:48), t),\n\n(6)\n\nwhere LC is the loss under a standard (non-robust) classi\ufb01er C and \u03b5 is a small constant. The result-\ning inputs are indistinguishable from the originals (Appendix D Figure 9)\u2014to a human observer,\nit thus appears that the label t assigned to the modi\ufb01ed input is simply incorrect. The resulting\ninput-label pairs (xadv, t) make up the new training set (pseudocode in Appendix C Figure 6).\nNow, since (cid:107)xadv \u2212 x(cid:107) is small, by de\ufb01nition the robust features of xadv are still correlated with\nclass y (and not t) in expectation over the dataset. After all, humans still recognize the original class.\nOn the other hand, since every xadv is strongly classi\ufb01ed as t by a standard classi\ufb01er, it must be that\nsome of the non-robust features are now strongly correlated with t (in expectation).\nIn the case where t is chosen at random, the robust features are originally uncorrelated with the label\nt (in expectation), and after the perturbation can be only slightly correlated (hence being signi\ufb01cantly\n\nless useful for classi\ufb01cation than before) 5. Formally, we aim to construct a dataset (cid:98)Drand where\n\nif f non-robustly useful under D,\notherwise.\n\n(x,y)\u223c(cid:98)Drand\nE\n\n[y \u00b7 f (x)](cid:26)> 0\n\n(cid:39) 0\n\n(7)\n\nIn contrast, when t is chosen deterministically based on y, the robust features actually point away\nfrom the assigned label t. In particular, all of the inputs labeled with class t exhibit non-robust\nfeatures correlated with t, but robust features correlated with the original class y. Thus, robust\nfeatures on the original training set provide signi\ufb01cant predictive power on the training set, but will\n\nactually hurt generalization on the standard test set. Formally, our goal is to construct (cid:98)Ddet such that\n\n(8)\n\n(x,y)\u223c(cid:98)Ddet\nE\n\n[y \u00b7 f (x)]\uf8f1\uf8f2\uf8f3\n\n> 0\n< 0\n\nif f non-robustly useful under D,\nif f robustly useful under D\n\u2208 R otherwise (f not useful under D).\n\nWe \ufb01nd that standard training on these datasets actually generalizes to the original test set, as shown\nin Table 1). This indicates that non-robust features are indeed useful for classi\ufb01cation in the standard\n\nsetting. Remarkably, even training on (cid:98)Ddet (where all the robust features are correlated with the\n\nwrong class), results in a well-generalizing classi\ufb01er. This indicates that non-robust features can be\npicked up by models during standard training, even in the presence of predictive robust features 6\n\n3.3 Transferability can arise from non-robust features\n\nOne of the most intriguing properties of adversarial examples is that they transfer across models with\ndifferent architectures and independently sampled training sets [Sze+14; PMG16; CRP19]. Here, we\nshow that this phenomenon can in fact be viewed as a natural consequence of the existence of non-\nrobust features. Recall that, according to our main thesis, adversarial examples can arise as a result\nof perturbing well-generalizing, yet brittle features. Given that such features are inherent to the data\ndistribution, different classi\ufb01ers trained on independent samples from that distribution are likely\nto utilize similar non-robust features. Consequently, perturbations constructed by exploiting non-\nrobust features learned by one classi\ufb01er will transfer to other classi\ufb01ers utilizing similar features.\n\nobtain a (small) amount of test accuracy by leveraging robust feature leakage on (cid:98)Drand.\n\n5Goh [Goh19a] provides an approach to quantifying this \u201crobust feature leakage\u201d and \ufb01nds that one can\n\n6Additional results and analysis are in App. D.5, D.6, and D.7.\n\n6\n\n\fIn order to illustrate and corroborate this hypothesis, we train \ufb01ve different architectures on the\ndataset generated in Section 3.2 (adversarial examples with deterministic labels) for a standard\nResNet-50 [He+16]. Our hypothesis would suggest that architectures which learn better from this\ntraining set (in terms of performance on the standard test set) are more likely to learn similar non-\nrobust features to the original classi\ufb01er. Indeed, we \ufb01nd that the test accuracy of each architecture is\npredictive of how often adversarial examples transfer from the original model to standard classi\ufb01ers\nwith that architecture (Figure 3). In a similar vein, Nakkiran [Nak19a] constructs a set of adversarial\nperturbations that is explicitly non-transferable and \ufb01nds that these perturbations cannot be used to\nlearn a good classi\ufb01er. These \ufb01ndings thus corroborate our hypothesis that adversarial transferability\narises when models learn similar brittle features of the underlying dataset.\n\nDataset\n\nCIFAR-10\n\nImageNetR\n\n95.3%\n\n63.3%\n43.7%\n\n96.6%\n\n87.9%\n64.4%\n\nSource Dataset\n\nD\n\n(cid:98)Drand\n(cid:98)Ddet\n\nTable 1: Test accuracy (on D) of classi\ufb01ers\ntrained on the D, (cid:98)Drand, and (cid:98)Ddet train-\ning sets created using a standard (non-robust)\nmodel. For both (cid:98)Drand and (cid:98)Ddet, only non-\nrobust features correspond to useful features\non both the train set and D. These datasets\nare constructed using adversarial perturba-\ntions of x towards a class t (random for\n(cid:98)Drand and deterministic for (cid:98)Ddet); the result-\n\ning images are relabeled as t.\n\nFigure 3: Transfer rate of adversarial exam-\nples from a ResNet-50 to different architec-\ntures alongside test set performance of these\narchitecture when trained on the dataset gen-\nerated in Section 3.2. Architectures more\nsusceptible to transfer attacks also performed\nbetter on the standard test set supporting\nour hypothesis that adversarial transferability\narises from using similar non-robust features.\n\n4 A Theoretical Framework for Studying (Non)-Robust Features\n\nThe experiments from the previous section demonstrate that the conceptual framework of robust and\nnon-robust features is strongly predictive of the empirical behavior of state-of-the-art models on real-\nworld datasets. In order to further strengthen our understanding of the phenomenon, we instantiate\nthe framework in a concrete setting that allows us to theoretically study various properties of the\ncorresponding model. Our model is similar to that of Tsipras et al. [Tsi+19] in the sense that it\ncontains a dichotomy between robust and non-robust features, extending upon it in a few ways: a)\nthe adversarial vulnerability can be explicitly expressed as a difference between the inherent data\nmetric and the (cid:96)2 metric, b) robust learning corresponds exactly to learning a combination of these\ntwo metrics, c) the gradients of robust models align better with the adversary\u2019s metric.\n\nSetup. We study a simple problem of maximum likelihood classi\ufb01cation between two Gaussian\ndistributions. In particular, given samples (x, y) sampled from D according to\n\n(9)\n\n(10)\n\nour goal is to learn parameters \u0398 = (\u00b5, \u03a3) such that\n\nu.a.r.\n\ny\n\n\u223c {\u22121, +1},\n\nx \u223c N (y \u00b7 \u00b5\u2217, \u03a3\u2217),\n\n\u0398 = arg min\n\u00b5,\u03a3\n\nE(x,y)\u223cD [(cid:96)(x; y \u00b7 \u00b5, \u03a3)] ,\n\nwhere (cid:96)(x; \u00b5, \u03a3) represents the Gaussian negative log-likelihood (NLL) function. Intuitively, we\n\ufb01nd the parameters \u00b5, \u03a3 which maximize the likelihood of the sampled data under the given model.\nClassi\ufb01cation can be accomplished via likelihood test: given an unlabeled sample x, we predict y as\n\ny = arg max\n\ny\n\n(cid:96)(x; y \u00b7 \u00b5, \u03a3) = sign(cid:0)x(cid:62)\u03a3\u22121\u00b5(cid:1) .\n\n7\n\n253035404550Test accuracy (%; trained on Dy+1)60708090100Transfer success rate (%)VGG-16Inception-v3ResNet-18DenseNetResNet-50\f\u0398r = arg min\n\u00b5,\u03a3\n\nE(x,y)\u223cD(cid:20) max\n\n(cid:107)\u03b4(cid:107)2\u2264\u03b5\n\n(cid:96)(x + \u03b4; y \u00b7 \u00b5, \u03a3)(cid:21) ,\n\n(11)\n\nIn turn, the robust analogue of this problem arises from replacing (cid:96)(x; y \u00b7 \u00b5, \u03a3) with the NLL under\nadversarial perturbation. The resulting robust parameters \u0398r can be written as\n\nA detailed analysis appears in Appendix E\u2014here we present a high-level overview of the results.\n\n(1) Vulnerability from metric misalignment (non-robust features). Note that in this model, one\ncan rigorously refer to an inner product (and thus a metric) induced by the features. In particular,\none can view the learned parameters of a Gaussian \u0398 = (\u00b5, \u03a3) as de\ufb01ning an inner product over the\ninput space given by (cid:104)x, y(cid:105)\u0398 = (x\u2212\u00b5)(cid:62)\u03a3\u22121(y\u2212\u00b5). This in turn induces the Mahalanobis distance,\nwhich represents how a change in the input affects the features of the classi\ufb01er. This metric is not\nnecessarily aligned with the metric in which the adversary is constrained, the (cid:96)2-norm. Actually, we\nshow that adversarial vulnerability arises exactly as a misalignment of these two metrics.\nTheorem 1 (Adversarial vulnerability from misalignment). Consider an adversary whose pertur-\nbation is determined by the \u201cLagrangian penalty\u201d form of (11), i.e.\n(cid:96)(x + \u03b4; y \u00b7 \u00b5, \u03a3) \u2212 C \u00b7 (cid:107)\u03b4(cid:107)2,\n\nmax\n\nwhere C \u2265\n\u03c3min(\u03a3\u2217) is a constant trading off NLL minimization and the adversarial constraint (the\nbound on C ensures the problem is concave). Then, the adversarial loss Ladv incurred by (\u00b5, \u03a3) is\n\n1\n\n\u03b4\n\nLadv(\u0398) \u2212 L(\u0398) = tr(cid:20)(cid:16)I + (C \u00b7 \u03a3\u2217 \u2212 I)\n\n\u22121(cid:17)2(cid:21) \u2212 d,\n\nd I.\n\nand, for a \ufb01xed tr(\u03a3\u2217) = k the above is minimized by \u03a3\u2217 = k\nIn fact, note that such a misalignment corresponds precisely to the existence of non-robust features\u2014\n\u201csmall\u201d changes in the adversary\u2019s metric along certain directions can cause large changes under the\nnotion of distance established by the parameters (illustrated in Figure 4).\n(2) Robust Learning. The (non-robust) maximum likelihood estimate is \u0398 = \u0398\u2217, and thus the\nvulnerability for the standard MLE depends entirely on the data distribution. The following theorem\ncharacterizes the behaviour of the learned parameters in the robust problem (we study a slight relax-\nation of (11) that becomes exact exponentially fast as d \u2192 \u221e, see Appendix E.3.3). In fact, we can\nprove (Section E.3.4) that performing (sub)gradient descent on the inner maximization (known as\nadversarial training [GSS15; Mad+18]) yields exactly \u0398r. We \ufb01nd that as the perturbation budget \u03b5\nincreases, the metric induced by the classi\ufb01er mixes (cid:96)2 and the metric induced by the data features.\nTheorem 2 (Robustly Learned Parameters). Just as in the non-robust case, \u00b5r = \u00b5\u2217, i.e. the true\nmean is learned. For the robust covariance \u03a3r, there exists an \u03b50 > 0, such that for any \u03b5 \u2208 [0, \u03b50),\n\u2126(cid:18) 1 + \u03b51/2\n\u03b51/2 + \u03b53/2(cid:19) \u2264 \u03bb \u2264 O(cid:18) 1 + \u03b51/2\n\u03b51/2 (cid:19) .\n\u03a3r =\n\n\u03bb \u00b7I +(cid:114) 1\n\n\u03bb \u00b7 \u03a3\u2217 +\n\nwhere\n\n\u03a3\u2217 +\n\n\u03a32\u2217,\n\n1\n2\n\n1\n4\n\nThe effect of robust optimization under an (cid:96)2-constrained adversary is visualized in Figure 4. As \u0001\ngrows, the learned covariance becomes more aligned with identity. For instance, we can see that the\nclassi\ufb01er learns to be less sensitive in certain directions, despite their usefulness for classi\ufb01cation.\n\n1\n\n(3) Gradient Interpretability.\nTsipras et al. [Tsi+19] observe that gradients of robust models\ntend to look more semantically meaningful. It turns out that under our model, this behaviour arises\nas a natural consequence of Theorem 2. In particular, we show that the resulting robustly learned\nparameters cause the gradient of the linear classi\ufb01er and the vector connecting the means of the two\ndistributions to better align (in a worst-case sense) under the (cid:96)2 inner product.\nTheorem 3 (Gradient alignment). Let f (x) and fr(x) be monotonic classi\ufb01ers based on the linear\nseparator induced by standard and (cid:96)2-robust maximum likelihood classi\ufb01cation, respectively. The\nmaximum angle formed between the gradient of the classi\ufb01er (wrt input) and the vector connecting\nthe classes can be smaller for the robust model:\n(cid:104)\u00b5,\u2207xfr(x)(cid:105)\n(cid:107)\u00b5(cid:107) \u00b7 (cid:107)\u2207xfr(x)(cid:107)\n\n(cid:104)\u00b5,\u2207xf (x)(cid:105)\n(cid:107)\u00b5(cid:107) \u00b7 (cid:107)\u2207xf (x)(cid:107)\n\nFigure 4 illustrates this phenomenon in the two-dimensional case, where (cid:96)2-robustness causes the\ngradient direction to become increasingly aligned with the vector between the means (\u00b5).\n\n> min\n\nmin\n\n\u00b5\n\n\u00b5\n\n.\n\n8\n\n\fFigure 4: An empirical demonstration of the effect illustrated by Theorem 2\u2014as the adversarial\nperturbation budget \u03b5 is increased, the learned mean \u00b5 remains constant, but the learned covariance\n\u201cblends\u201d with the identity matrix, effectively adding uncertainty onto the non-robust feature.\n\nDiscussion. Our analysis suggests that rather than offering quantitative classi\ufb01cation bene\ufb01ts, a\nnatural way to view the role of robust optimization is as enforcing a prior over the features learned\nby the classi\ufb01er. In particular, training with an (cid:96)2-bounded adversary prevents the classi\ufb01er from\nrelying heavily on features which induce a metric dissimilar to the (cid:96)2 metric. The strength of the\nadversary then allows for a trade-off between the enforced prior, and the data-dependent features.\n\nRobustness and accuracy. Note that in the setting described so far, robustness can be at odds\nwith accuracy since robust training prevents us from learning the most accurate classi\ufb01er (a similar\nconclusion is drawn in [Tsi+19]). However, we note that there are very similar settings where non-\nrobust features manifest themselves in the same way, yet a classi\ufb01er with perfect robustness and\naccuracy is still attainable. Concretely, consider the distributions pictured in Figure 14 in Appendix\nD.10.\nIt is straightforward to show that while there are many perfectly accurate classi\ufb01ers, any\nstandard loss function will learn an accurate yet non-robust classi\ufb01er. Only when robust training is\nemployed does the classi\ufb01er learn a perfectly accurate and perfectly robust decision boundary.\n\n5 Related Work\n\nSeveral models for explaining adversarial examples have been proposed in prior work, utilizing\nideas ranging from \ufb01nite-sample over\ufb01tting to high-dimensional statistical phenomena [Gil+18;\nFFF18; For+19; TG16; Sha+19a; MDM18; Sha+19b; GSS15; BPR18]. The key differentiating\naspect of our model is that adversarial perturbations arise as well-generalizing, yet brittle, features,\nrather than statistical anomalies. In particular, adversarial vulnerability does not stem from using\na speci\ufb01c model class or a speci\ufb01c training method, since standard training on the \u201crobusti\ufb01ed\u201d\ndata distribution of Section 3.1 leads to robust models. At the same time, as shown in Section 3.2,\nthese non-robust features are suf\ufb01cient to learn a good standard classi\ufb01er. We discuss the connection\nbetween our model and others in detail in Appendix A and additional related work in Appendix B.\n\n6 Conclusion\n\nIn this work, we cast the phenomenon of adversarial examples as a natural consequence of the\npresence of highly predictive but non-robust features in standard ML datasets. We provide support\nfor this hypothesis by explicitly disentangling robust and non-robust features in standard datasets,\nas well as showing that non-robust features alone are suf\ufb01cient for good generalization. Finally,\nwe study these phenomena in more detail in a theoretical setting where we can rigorously study\nadversarial vulnerability, robust training, and gradient alignment.\nOur \ufb01ndings prompt us to view adversarial examples as a fundamentally human phenomenon. In\nparticular, we should not be surprised that classi\ufb01ers exploit highly predictive features that happen\nto be non-robust under a human-selected notion of similarity, given such features exist in real-world\ndatasets. In the same manner, from the perspective of interpretability, as long as models rely on these\nnon-robust features, we cannot expect to have model explanations that are both human-meaningful\nand faithful to the models themselves. Overall, attaining models that are robust and interpretable\nwill require explicitly encoding human priors into the training process.\n\n9\n\n201510505101520Feature x110.07.55.02.50.02.55.07.510.0Feature x2Maximum likelihood estimate2 unit ball1-induced metric unit ballSamples from (0,)201510505101520Feature x110.07.55.02.50.02.55.07.510.0Feature x2True Parameters (=0)Samples from (,)Samples from (,)201510505101520Feature x110.07.55.02.50.02.55.07.510.0Feature x2Robust parameters, =1.0201510505101520Feature x110.07.55.02.50.02.55.07.510.0Feature x2Robust parameters, =10.0\fAcknowledgements\n\nWe thank Preetum Nakkiran for suggesting the experiment of Appendix D.9 (i.e. replicating Figure 3\nwith targeted attacks). We also are grateful to the authors of Engstrom et al. [Eng+19a] (Chris Olah,\nDan Hendrycks, Justin Gilmer, Reiichiro Nakano, Preetum Nakkiran, Gabriel Goh, Eric Wallace)\u2014\nfor their insights and efforts replicating, extending, and discussing our experimental results.\nWork supported in part by the NSF grants CCF-1553428, CCF-1563880, CNS-1413920, CNS-\n1815221, IIS-1447786, IIS-1607189, the Microsoft Corporation, the Intel Corporation, the MIT-\nIBM Watson AI Lab research grant, and an Analog Devices Fellowship.\n\nReferences\n\n[Big+13]\n\n[Ath+18]\n\n[BCN06]\n\n[ACW18] Anish Athalye, Nicholas Carlini, and David A. Wagner. \u201cObfuscated Gradients Give a\nFalse Sense of Security: Circumventing Defenses to Adversarial Examples\u201d. In: Inter-\nnational Conference on Machine Learning (ICML). 2018.\nAnish Athalye et al. \u201cSynthesizing Robust Adversarial Examples\u201d. In: International\nConference on Machine Learning (ICML). 2018.\nCristian Bucilu\u02c7a, Rich Caruana, and Alexandru Niculescu-Mizil. \u201cModel compres-\nsion\u201d. In: International Conference on Knowledge Discovery and Data Mining (KDD).\n2006.\nBattista Biggio et al. \u201cEvasion attacks against machine learning at test time\u201d. In:\nJoint European conference on machine learning and knowledge discovery in databases\n(ECML-KDD). 2013.\nS\u00e9bastien Bubeck, Eric Price, and Ilya Razenshteyn. \u201cAdversarial examples from com-\nputational constraints\u201d. In: arXiv preprint arXiv:1805.10204. 2018.\nNicholas Carlini et al. \u201cOn Evaluating Adversarial Robustness\u201d. In: ArXiv preprint\narXiv:1902.06705. 2019.\nJeremy M Cohen, Elan Rosenfeld, and J Zico Kolter. \u201cCerti\ufb01ed adversarial robustness\nvia randomized smoothing\u201d. In: arXiv preprint arXiv:1902.02918. 2019.\nZachary Charles, Harrison Rosenberg, and Dimitris Papailiopoulos. \u201cA Geometric Per-\nspective on the Transferability of Adversarial Directions\u201d. In: International Conference\non Arti\ufb01cial Intelligence and Statistics (AISTATS). 2019.\nNicholas Carlini and David Wagner. \u201cAdversarial Examples Are Not Easily Detected:\nBypassing Ten Detection Methods\u201d. In: Workshop on Arti\ufb01cial Intelligence and Secu-\nrity (AISec). 2017.\nNicholas Carlini and David Wagner. \u201cTowards evaluating the robustness of neural net-\nworks\u201d. In: Symposium on Security and Privacy (SP). 2017.\nJohn M. Danskin. The Theory of Max-Min and its Application to Weapons Allocation\nProblems. 1967.\nConstantinos Daskalakis et al. \u201cEf\ufb01cient Statistics, in High Dimensions, from Trun-\ncated Samples\u201d. In: Foundations of Computer Science (FOCS). 2019.\nGavin Weiguang Ding et al. \u201cOn the Sensitivity of Adversarial Robustness to Input\nData Distributions\u201d. In: International Conference on Learning Representations. 2019.\n[Eng+19a] Logan Engstrom et al. \u201cA Discussion of \u2019Adversarial Examples Are Not Bugs, They\nAre Features\u2019\u201d. In: Distill (2019). https://distill.pub/2019/advex-bugs-discussion. DOI:\n10.23915/distill.00019.\n\n[Das+19]\n\n[CW17b]\n\n[Din+19]\n\n[CW17a]\n\n[BPR18]\n\n[Car+19]\n\n[CRK19]\n\n[CRP19]\n\n[Dan67]\n\n[FFF18]\n\n[Eng+19b] Logan Engstrom et al. \u201cA Rotation and a Translation Suf\ufb01ce: Fooling CNNs with\nSimple Transformations\u201d. In: International Conference on Machine Learning (ICML).\n2019.\nAlhussein Fawzi, Hamza Fawzi, and Omar Fawzi. \u201cAdversarial vulnerability for any\nclassi\ufb01er\u201d. In: Advances in Neural Information Processing Systems (NeuRIPS). 2018.\nAlhussein Fawzi, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. \u201cRobustness\nof classi\ufb01ers: from adversarial to random noise\u201d. In: Advances in Neural Information\nProcessing Systems. 2016.\n\n[FMF16]\n\n10\n\n\f[For+19]\n\n[Fur+18]\n\n[Gei+19]\n\n[Gil+18]\n\n[Goh19a]\n\n[Goh19b]\n\n[GSS15]\n\n[HD19]\n\n[He+16]\n\n[He+17]\n\n[HVD14]\n\n[JLT18]\n\n[Kri09]\n\n[KSJ19]\n\nNic Ford et al. \u201cAdversarial Examples Are a Natural Consequence of Test Error in\nNoise\u201d. In: arXiv preprint arXiv:1901.10513. 2019.\nTommaso Furlanello et al. \u201cBorn-Again Neural Networks\u201d. In: International Confer-\nence on Machine Learning (ICML). 2018.\nRobert Geirhos et al. \u201cImageNet-trained CNNs are biased towards texture; increasing\nshape bias improves accuracy and robustness.\u201d In: International Conference on Learn-\ning Representations. 2019.\nJustin Gilmer et al. \u201cAdversarial spheres\u201d. In: Workshop of International Conference\non Learning Representations (ICLR). 2018.\nGabriel Goh. \u201cA Discussion of \u2019Adversarial Examples Are Not Bugs, They Are\nFeatures\u2019: Robust Feature Leakage\u201d. In: Distill (2019). https://distill.pub/2019/advex-\nbugs-discussion/response-2. DOI: 10.23915/distill.00019.2.\n\u2019Adversarial Examples Are Not Bugs, They\nGabriel Goh. \u201cA Discussion of\nIn: Distill\nAre Features\u2019: Two Examples of Useful, Non-Robust Features\u201d.\n(2019). https://distill.pub/2019/advex-bugs-discussion/response-3. DOI: 10 . 23915 /\ndistill.00019.3.\nIan J Goodfellow, Jonathon Shlens, and Christian Szegedy. \u201cExplaining and Harness-\ning Adversarial Examples\u201d. In: International Conference on Learning Representations\n(ICLR). 2015.\nDan Hendrycks and Thomas G. Dietterich. \u201cBenchmarking Neural Network Robust-\nness to Common Corruptions and Surface Variations\u201d. In: International Conference on\nLearning Representations (ICLR). 2019.\nKaiming He et al. \u201cDeep Residual Learning for Image Recognition\u201d. In: Conference\non Computer Vision and Pattern Recognition (CVPR). 2016.\nWarren He et al. \u201cAdversarial example defense: Ensembles of weak defenses are not\nstrong\u201d. In: USENIX Workshop on Offensive Technologies (WOOT). 2017.\nGeoffrey Hinton, Oriol Vinyals, and Jeff Dean. \u201cDistilling the Knowledge in a Neu-\nral Network\u201d. In: Neural Information Processing Systems (NeurIPS) Deep Learning\nWorkshop. 2014.\nSaumya Jetley, Nicholas Lord, and Philip Torr. \u201cWith friends like these, who needs ad-\nversaries?\u201d In: Advances in Neural Information Processing Systems (NeurIPS). 2018.\nAlex Krizhevsky. \u201cLearning Multiple Layers of Features from Tiny Images\u201d. In: Tech-\nnical report. 2009.\nBeomsu Kim, Junghoon Seo, and Taegyun Jeon. \u201cBridging Adversarial Robustness and\nGradient Interpretability\u201d. In: International Conference on Learning Representations\nWorkshop on Safe Machine Learning (ICLR SafeML). 2019.\n\n[Lec+19] Mathias Lecuyer et al. \u201cCerti\ufb01ed robustness to adversarial examples with differential\n\nprivacy\u201d. In: Symposium on Security and Privacy (SP). 2019.\nYanpei Liu et al. \u201cDelving into Transferable Adversarial Examples and Black-box At-\ntacks\u201d. In: International Conference on Learning Representations (ICLR). 2017.\nBeatrice Laurent and Pascal Massart. \u201cAdaptive estimation of a quadratic functional\nby model selection\u201d. In: Annals of Statistics. 2000.\n\n[Mad+18] Aleksander Madry et al. \u201cTowards deep learning models resistant to adversarial at-\n\n[Liu+17]\n\n[LM00]\n\n[MDM18]\n\n[Moo+17]\n\n[MV15]\n\n[Nak19a]\n\ntacks\u201d. In: International Conference on Learning Representations (ICLR). 2018.\nSaeed Mahloujifar, Dimitrios I Diochnos, and Mohammad Mahmoody. \u201cThe curse of\nconcentration in robust learning: Evasion and poisoning attacks from concentration of\nmeasure\u201d. In: AAAI Conference on Arti\ufb01cial Intelligence (AAAI). 2018.\nSeyed-Mohsen Moosavi-Dezfooli et al. \u201cUniversal adversarial perturbations\u201d. In: con-\nference on computer vision and pattern recognition (CVPR). 2017.\nAravindh Mahendran and Andrea Vedaldi. \u201cUnderstanding deep image representations\nby inverting them\u201d. In: computer vision and pattern recognition (CVPR). 2015.\n\u2019Adversarial Examples Are Not Bugs,\nPreetum Nakkiran. \u201cA Discussion of\nIn: Distill\nThey Are Features\u2019: Adversarial Examples are Just Bugs, Too\u201d.\n(2019). https://distill.pub/2019/advex-bugs-discussion/response-5. DOI: 10 . 23915 /\ndistill.00019.5.\n\n11\n\n\f[Nak19b]\n\n[OMS17]\n\n[Pap+17]\n\n[PMG16]\n\n[Rec+19]\n\n[RSL18]\n\n[Rus+15]\n\n[Sch+18]\n\nPreetum Nakkiran. \u201cAdversarial robustness may be at odds with simplicity\u201d. In: arXiv\npreprint arXiv:1901.00532. 2019.\nChris Olah, Alexander Mordvintsev, and Ludwig Schubert. \u201cFeature Visualization\u201d.\nIn: Distill. 2017.\nNicolas Papernot et al. \u201cPractical black-box attacks against machine learning\u201d. In: Asia\nConference on Computer and Communications Security. 2017.\nNicolas Papernot, Patrick McDaniel, and Ian Goodfellow. \u201cTransferability in Machine\nLearning: from Phenomena to Black-box Attacks using Adversarial Samples\u201d. In:\nArXiv preprint arXiv:1605.07277. 2016.\nBenjamin Recht et al. \u201cDo CIFAR-10 Classi\ufb01ers Generalize to CIFAR-10?\u201d In: Inter-\nnational Conference on Machine Learning (ICML). 2019.\nAditi Raghunathan, Jacob Steinhardt, and Percy Liang. \u201cCerti\ufb01ed defenses against\nadversarial examples\u201d. In: International Conference on Learning Representations\n(ICLR). 2018.\nOlga Russakovsky et al. \u201cImageNet Large Scale Visual Recognition Challenge\u201d. In:\nInternational Journal of Computer Vision (IJCV). 2015.\nLudwig Schmidt et al. \u201cAdversarially Robust Generalization Requires More Data\u201d. In:\nAdvances in Neural Information Processing Systems (NeurIPS). 2018.\n\n[Sha+19a] Ali Shafahi et al. \u201cAre adversarial examples inevitable?\u201d In: International Conference\n\non Learning Representations (ICLR). 2019.\n\n[Sha+19b] Adi Shamir et al. \u201cA Simple Explanation for the Existence of Adversarial Examples\n\n[SHS19]\n\n[Smi+17]\n\n[Sug+19]\n\n[Sze+14]\n\n[TG16]\n\n[Tra+17]\n\n[Tsi+19]\n\n[Ues+18]\n\n[Wan+18]\n\n[WK18]\n\n[Xia+19]\n\n[Zou+18]\n\nwith Small Hamming Distance\u201d. In: arXiv preprint arXiv:1901.10861. 2019.\nDavid Stutz, Matthias Hein, and Bernt Schiele. \u201cDisentangling Adversarial Robustness\nand Generalization\u201d. In: Computer Vision and Pattern Recognition (CVPR). 2019.\nD. Smilkov et al. \u201cSmoothGrad: removing noise by adding noise\u201d. In: ICML workshop\non visualization for deep learning. 2017.\nArun Sai Suggala et al. \u201cRevisiting Adversarial Risk\u201d. In: Conference on Arti\ufb01cial\nIntelligence and Statistics (AISTATS). 2019.\nChristian Szegedy et al. \u201cIntriguing properties of neural networks\u201d. In: International\nConference on Learning Representations (ICLR). 2014.\nThomas Tanay and Lewis Grif\ufb01n. \u201cA Boundary Tilting Perspective on the Phenomenon\nof Adversarial Examples\u201d. In: ArXiv preprint arXiv:1608.07690. 2016.\nFlorian Tramer et al. \u201cThe Space of Transferable Adversarial Examples\u201d. In: ArXiv\npreprint arXiv:1704.03453. 2017.\nDimitris Tsipras et al. \u201cRobustness May Be at Odds with Accuracy\u201d. In: International\nConference on Learning Representations (ICLR). 2019.\nJonathan Uesato et al. \u201cAdversarial Risk and the Dangers of Evaluating Against Weak\nAttacks\u201d. In: International Conference on Machine Learning (ICML). 2018.\nTongzhou Wang et al. \u201cDataset Distillation\u201d. In: ArXiv preprint arXiv:1811.10959.\n2018.\nEric Wong and J Zico Kolter. \u201cProvable defenses against adversarial examples via the\nconvex outer adversarial polytope\u201d. In: International Conference on Machine Learning\n(ICML). 2018.\nKai Y. Xiao et al. \u201cTraining for Faster Adversarial Robustness Veri\ufb01cation via Inducing\nReLU Stability\u201d. In: International Conference on Learning Representations (ICLR).\n2019.\nHaosheng Zou et al. \u201cGeometric Universality of Adversarial Examples in Deep Learn-\ning\u201d. In: Geometry in Machine Learning ICML Workshop (GIML). 2018.\n\n12\n\n\f", "award": [], "sourceid": 71, "authors": [{"given_name": "Andrew", "family_name": "Ilyas", "institution": "MIT"}, {"given_name": "Shibani", "family_name": "Santurkar", "institution": "MIT"}, {"given_name": "Dimitris", "family_name": "Tsipras", "institution": "MIT"}, {"given_name": "Logan", "family_name": "Engstrom", "institution": "MIT"}, {"given_name": "Brandon", "family_name": "Tran", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Aleksander", "family_name": "Madry", "institution": "MIT"}]}