{"title": "Learning with Noisy Labels", "book": "Advances in Neural Information Processing Systems", "page_first": 1196, "page_last": 1204, "abstract": "In this paper, we theoretically study the problem of binary classification in the presence of random classification noise --- the learner, instead of seeing the true labels, sees labels that have independently been flipped with some small probability. Moreover, random label noise is \\emph{class-conditional} --- the flip probability depends on the class. We provide two approaches to suitably modify any given surrogate loss function. First, we provide a simple unbiased estimator of any loss, and obtain performance bounds for empirical risk minimization in the presence of iid data with noisy labels. If the loss function satisfies a simple symmetry condition, we show that the method leads to an efficient algorithm for empirical minimization. Second, by leveraging a reduction of risk minimization under noisy labels to classification with weighted 0-1 loss, we suggest the use of a simple weighted surrogate loss, for which we are able to obtain strong empirical risk bounds. This approach has a very remarkable consequence --- methods used in practice such as biased SVM and weighted logistic regression are provably noise-tolerant. On a synthetic non-separable dataset, our methods achieve over 88\\% accuracy even when 40\\% of the labels are corrupted, and are competitive with respect to recently proposed methods for dealing with label noise in several benchmark datasets.", "full_text": "Learning with Noisy Labels\n\nDepartment of Computer Science, University of Texas, Austin.\n{naga86,inderjit,pradeepr}@cs.utexas.edu\n\nAmbuj Tewari\n\nNagarajan Natarajan\n\nInderjit S. Dhillon\n\nPradeep Ravikumar\n\nDepartment of Statistics, University of Michigan, Ann Arbor.\n\ntewaria@umich.edu\n\nAbstract\n\nIn this paper, we theoretically study the problem of binary classi\ufb01cation in the\npresence of random classi\ufb01cation noise \u2014 the learner, instead of seeing the true la-\nbels, sees labels that have independently been \ufb02ipped with some small probability.\nMoreover, random label noise is class-conditional \u2014 the \ufb02ip probability depends\non the class. We provide two approaches to suitably modify any given surrogate\nloss function. First, we provide a simple unbiased estimator of any loss, and ob-\ntain performance bounds for empirical risk minimization in the presence of iid\ndata with noisy labels. If the loss function satis\ufb01es a simple symmetry condition,\nwe show that the method leads to an ef\ufb01cient algorithm for empirical minimiza-\ntion. Second, by leveraging a reduction of risk minimization under noisy labels\nto classi\ufb01cation with weighted 0-1 loss, we suggest the use of a simple weighted\nsurrogate loss, for which we are able to obtain strong empirical risk bounds. This\napproach has a very remarkable consequence \u2014 methods used in practice such\nas biased SVM and weighted logistic regression are provably noise-tolerant. On\na synthetic non-separable dataset, our methods achieve over 88% accuracy even\nwhen 40% of the labels are corrupted, and are competitive with respect to recently\nproposed methods for dealing with label noise in several benchmark datasets.\n\n1 Introduction\nDesigning supervised learning algorithms that can learn from data sets with noisy labels is a problem\nof great practical importance. Here, by noisy labels, we refer to the setting where an adversary has\ndeliberately corrupted the labels [Biggio et al., 2011], which otherwise arise from some \u201cclean\u201d\ndistribution; learning from only positive and unlabeled data [Elkan and Noto, 2008] can also be cast\nin this setting. Given the importance of learning from such noisy labels, a great deal of practical\nwork has been done on the problem (see, for instance, the survey article by Nettleton et al. [2010]).\nThe theoretical machine learning community has also investigated the problem of learning from\nnoisy labels. Soon after the introduction of the noise-free PAC model, Angluin and Laird [1988]\nproposed the random classi\ufb01cation noise (RCN) model where each label is \ufb02ipped independently\nwith some probability \u03c1 \u2208 [0, 1/2). It is known [Aslam and Decatur, 1996, Cesa-Bianchi et al.,\n1999] that \ufb01niteness of the VC dimension characterizes learnability in the RCN model. Similarly, in\nthe online mistake bound model, the parameter that characterizes learnability without noise \u2014 the\nLittestone dimension \u2014 continues to characterize learnability even in the presence of random label\nnoise [Ben-David et al., 2009]. These results are for the so-called \u201c0-1\u201d loss. Learning with convex\nlosses has been addressed only under limiting assumptions like separability or uniform noise rates\n[Manwani and Sastry, 2013].\n\nIn this paper, we consider risk minimization in the presence of class-conditional random label noise\n(abbreviated CCN). The data consists of iid samples from an underlying \u201cclean\u201d distribution D.\nThe learning algorithm sees samples drawn from a noisy version D\u03c1 of D \u2014 where the noise rates\ndepend on the class label. To the best of our knowledge, general results in this setting have not been\nobtained before. To this end, we develop two methods for suitably modifying any given surrogate\nloss function \u2113, and show that minimizing the sample average of the modi\ufb01ed proxy loss function\n\n1\n\n\f\u02dc\u2113 leads to provable risk bounds where the risk is calculated using the original loss \u2113 on the clean\ndistribution.\n\nIn our \ufb01rst approach, the modi\ufb01ed or proxy loss is an unbiased estimate of the loss function. The\nidea of using unbiased estimators is well-known in stochastic optimization [Nemirovski et al., 2009],\nand regret bounds can be obtained for learning with noisy labels in an online learning setting (See\nAppendix B). Nonetheless, we bring out some important aspects of using unbiased estimators of\nloss functions for empirical risk minimization under CCN. In particular, we give a simple symmetry\ncondition on the loss (enjoyed, for instance, by the Huber, logistic, and squared losses) to ensure that\nthe proxy loss is also convex. Hinge loss does not satisfy the symmetry condition, and thus leads\nto a non-convex problem. We nonetheless provide a convex surrogate, leveraging the fact that the\nnon-convex hinge problem is \u201cclose\u201d to a convex problem (Theorem 6).\n\nOur second approach is based on the fundamental observation that the minimizer of the risk (i.e.\nprobability of misclassi\ufb01cation) under the noisy distribution differs from that of the clean distribu-\ntion only in where it thresholds \u03b7(x) = P (Y = 1|x) to decide the label. In order to correct for the\nthreshold, we then propose a simple weighted loss function, where the weights are label-dependent,\nas the proxy loss function. Our analysis builds on the notion of consistency of weighted loss func-\ntions studied by Scott [2012]. This approach leads to a very remarkable result that appropriately\nweighted losses like biased SVMs studied by Liu et al. [2003] are robust to CCN.\n\nThe main results and the contributions of the paper are summarized below:\n\n1. To the best of our knowledge, we are the \ufb01rst to provide guarantees for risk minimization under\nrandom label noise in the general setting of convex surrogates, without any assumptions on the\ntrue distribution.\n\n2. We provide two different approaches to suitably modifying any given surrogate loss function,\nthat surprisingly lead to very similar risk bounds (Theorems 3 and 11). These general results\ninclude some existing results for random classi\ufb01cation noise as special cases.\n\n3. We resolve an elusive theoretical gap in the understanding of practical methods like biased SVM\n\nand weighted logistic regression \u2014 they are provably noise-tolerant (Theorem 11).\n\n4. Our proxy losses are easy to compute \u2014 both the methods yield ef\ufb01cient algorithms.\n5. Experiments on benchmark datasets show that the methods are robust even at high noise rates.\n\nThe outline of the paper is as follows. We introduce the problem setting and terminology in Section\n2. In Section 3, we give our \ufb01rst main result concerning the method of unbiased estimators. In\nSection 4, we give our second and third main results for certain weighted loss functions. We present\nexperimental results on synthetic and benchmark data sets in Section 5.\n\n1.1 Related Work\nStarting from the work of Bylander [1994], many noise tolerant versions of the perceptron algorithm\nhave been developed. This includes the passive-aggressive family of algorithms [Crammer et al.,\n2006], con\ufb01dence weighted learning [Dredze et al., 2008], AROW [Crammer et al., 2009] and the\nNHERD algorithm [Crammer and Lee, 2010]. The survey article by Khardon and Wachman [2007]\nprovides an overview of some of this literature. A Bayesian approach to the problem of noisy labels\nis taken by Graepel and Herbrich [2000] and Lawrence and Sch\u00a8olkopf [2001]. As Adaboost is very\nsensitive to label noise, random label noise has also been considered in the context of boosting. Long\nand Servedio [2010] prove that any method based on a convex potential is inherently ill-suited to\nrandom label noise. Freund [2009] proposes a boosting algorithm based on a non-convex potential\nthat is empirically seen to be robust against random label noise.\n\nStempfel and Ralaivola [2009] proposed the minimization of an unbiased proxy for the case of\nthe hinge loss. However the hinge loss leads to a non-convex problem. Therefore, they proposed\nheuristic minimization approaches for which no theoretical guarantees were provided (We address\nthe issue in Section 3.1). Cesa-Bianchi et al. [2011] focus on the online learning algorithms where\nthey only need unbiased estimates of the gradient of the loss to provide guarantees for learning with\nnoisy data. However, they consider a much harder noise model where instances as well as labels\nare noisy. Because of the harder noise model, they necessarily require multiple noisy copies per\nclean example and the unbiased estimation schemes also become fairly complicated. In particular,\ntheir techniques break down for non-smooth losses such as the hinge loss. In contrast, we show\nthat unbiased estimation is always possible in the more benign random classi\ufb01cation noise setting.\nManwani and Sastry [2013] consider whether empirical risk minimization of the loss itself on the\n\n2\n\n\fnoisy data is a good idea when the goal is to obtain small risk under the clean distribution. But\nit holds promise only for 0-1 and squared losses. Therefore, if empirical risk minimization over\nnoisy samples has to work, we necessarily have to change the loss used to calculate the empirical\nrisk. More recently, Scott et al. [2013] study the problem of classi\ufb01cation under class-conditional\nnoise model. However, they approach the problem from a different set of assumptions \u2014 the noise\nrates are not known, and the true distribution satis\ufb01es a certain \u201cmutual irreducibility\u201d property.\nFurthermore, they do not give any ef\ufb01cient algorithm for the problem.\n\n2 Problem Setup and Background\nLet D be the underlying true distribution generating (X, Y ) \u2208 X \u00d7 {\u00b11} pairs from which n iid\nsamples (X1, Y1), . . . , (Xn, Yn) are drawn. After injecting random classi\ufb01cation noise (indepen-\ndently for each i) into these samples, corrupted samples (X1, \u02dcY1), . . . , (Xn, \u02dcYn) are obtained. The\nclass-conditional random noise model (CCN, for short) is given by:\n\nP ( \u02dcY = \u22121|Y = +1) = \u03c1+1, P ( \u02dcY = +1|Y = \u22121) = \u03c1\u22121, and \u03c1+1 + \u03c1\u22121 < 1\n\nThe corrupted samples are what the learning algorithm sees. We will assume that the noise rates\n\u03c1+1 and \u03c1\u22121 are known1 to the learner. Let the distribution of (X, \u02dcY ) be D\u03c1. Instances are denoted\nby x \u2208 X \u2286 Rd. Noisy labels are denoted by \u02dcy.\nLet f : X \u2192 R be some real-valued decision function. The risk of f w.r.t. the 0-1 loss is given by\nRD(f ) = E(X,Y )\u223cD(cid:2)1{sign(f (X))6=Y }(cid:3). The optimal decision function (called Bayes optimal) that\nminimizes RD over all real-valued decision functions is given by f \u22c6(x) = sign(\u03b7(x) \u2212 1/2) where\n\u03b7(x) = P (Y = 1|x). We denote by R\u2217 the corresponding Bayes risk under the clean distribution\nD, i.e. R\u2217 = RD(f\u2217). Let \u2113(t, y) denote a loss function where t \u2208 R is a real-valued prediction and\ny \u2208 {\u00b11} is a label. Let \u02dc\u2113(t, \u02dcy) denote a suitably modi\ufb01ed \u2113 for use with noisy labels (obtained using\nmethods in Sections 3 and 4). It is helpful to summarize the three important quantities associated\nwith a decision function f :\n\n1. Empirical \u02dc\u2113-risk on the observed sample: bR\u02dc\u2113(f ) := 1\n2. As n grows, we expect bR\u02dc\u2113(f ) to be close to the \u02dc\u2113-risk under the noisy distribution D\u03c1:\n\n\u02dc\u2113(f (Xi), \u02dcYi).\n\n(f ) := E\n\nR\u02dc\u2113,D\u03c1\n\nnPn\n\ni=1\n\n(X, \u02dcY )\u223cD\u03c1h\u02dc\u2113(f (X), \u02dcY )i .\n\n3. \u2113-risk under the \u201cclean\u201d distribution D: R\u2113,D(f ) := E(X,Y )\u223cD [\u2113(f (X), Y )].\n\nTypically, \u2113 is a convex function that is calibrated with respect to an underlying loss function such as\nthe 0-1 loss. \u2113 is said to be classi\ufb01cation-calibrated [Bartlett et al., 2006] if and only if there exists a\nconvex, invertible, nondecreasing transformation \u03c8\u2113 (with \u03c8\u2113(0) = 0) such that \u03c8\u2113(RD(f )\u2212 R\u2217) \u2264\nR\u2113,D(f )\u2212minf R\u2113,D(f ). The interpretation is that we can control the excess 0-1 risk by controlling\nthe excess \u2113-risk.\nIf f is not quanti\ufb01ed in a minimization, then it is implicit that the minimization is over all measurable\nfunctions. Though most of our results apply to a general function class F, we instantiate F to be the\nset of hyperplanes of bounded L2 norm, W = {w \u2208 Rd : kwk2 \u2264 W2} for certain speci\ufb01c results.\nProofs are provided in the Appendix A.\n\n3 Method of Unbiased Estimators\nLet F : X \u2192 R be a \ufb01xed class of real-valued decision functions, over which the empirical risk is\nminimized. The method of unbiased estimators uses the noise rates to construct an unbiased estima-\ntor \u02dc\u2113(t, \u02dcy) for the loss \u2113(t, y). However, in the experiments we will tune the noise rate parameters\nthrough cross-validation. The following key lemma tells us how to construct unbiased estimators of\nthe loss from noisy labels.\nLemma 1. Let \u2113(t, y) be any bounded loss function. Then, if we de\ufb01ne,\n(1 \u2212 \u03c1\u2212y) \u2113(t, y) \u2212 \u03c1y \u2113(t,\u2212y)\n\n\u02dc\u2113(t, y) :=\n\nwe have, for any t, y, E\u02dcyh\u02dc\u2113(t, \u02dcy)i = \u2113(t, y) .\n\n1This is not necessary in practice. See Section 5.\n\n1 \u2212 \u03c1+1 \u2212 \u03c1\u22121\n\n3\n\n\fWe can try to learn a good predictor in the presence of label noise by minimizing the sample average\n\n\u02c6f \u2190 argmin\nf\u2208F\n\nbR\u02dc\u2113(f ) .\n\nBy unbiasedness of \u02dc\u2113 (Lemma 1), we know that, for any \ufb01xed f \u2208 F, the above sample average\nconverges to R\u2113,D(f ) even though the former is computed using noisy labels whereas the latter\ndepends on the true labels. The following result gives a performance guarantee for this procedure in\nterms of the Rademacher complexity of the function class F. The main idea in the proof is to use\nthe contraction principle for Rademacher complexity to get rid of the dependence on the proxy loss\n\u02dc\u2113. The price to pay for this is L\u03c1, the Lipschitz constant of \u02dc\u2113.\nLemma 2. Let \u2113(t, y) be L-Lipschitz in t (for every y). Then, with probability at least 1 \u2212 \u03b4,\nwhere R(F ) := EXi,\u01ebi(cid:2)supf\u2208F\nn \u01ebif (Xi)(cid:3) is the Rademacher complexity of the function class F\nand L\u03c1 \u2264 2L/(1 \u2212 \u03c1+1 \u2212 \u03c1\u22121) is the Lipschitz constant of \u02dc\u2113. Note that \u01ebi\u2019s are iid Rademacher\n(symmetric Bernoulli) random variables.\n\n(f )| \u2264 2L\u03c1R(F ) +r log(1/\u03b4)\n\nf\u2208F |bR\u02dc\u2113(f ) \u2212 R\u02dc\u2113,D\u03c1\n\nmax\n\n2n\n\n1\n\nThe above lemma immediately leads to a performance bound for \u02c6f with respect to the clean distri-\nbution D. Our \ufb01rst main result is stated in the theorem below.\nTheorem 3 (Main Result 1). With probability at least 1 \u2212 \u03b4,\n\nR\u2113,D(f ) + 4L\u03c1R(F ) + 2r log(1/\u03b4)\n\n2n\n\n.\n\nFurthermore, if \u2113 is classi\ufb01cation-calibrated, there exists a nondecreasing function \u03b6\u2113 with \u03b6\u2113(0) = 0\nsuch that,\n\nR\u2113,D( \u02c6f ) \u2264 min\nf\u2208F\nRD( \u02c6f ) \u2212 R\u2217 \u2264 \u03b6\u2113(cid:18) min\n\nf\u2208F\n\nR\u2113,D(f ) \u2212 min\n\nf\n\nR\u2113,D(f ) + 4L\u03c1R(F ) + 2r log(1/\u03b4)\n2n (cid:19) .\n\nThe term on the right hand side involves both approximation error (that is small if F is large) and\nestimation error (that is small if F is small). However, by appropriately increasing the richness of\nthe class F with sample size, we can ensure that the misclassi\ufb01cation probability of \u02c6f approaches\nthe Bayes risk of the true distribution. This is despite the fact that the method of unbiased estimators\ncomputes the empirical minimizer \u02c6f on a sample from the noisy distribution. Getting the optimal\nempirical minimizer \u02c6f is ef\ufb01cient if \u02dc\u2113 is convex. Next, we address the issue of convexity of \u02dc\u2113.\n3.1 Convex losses and their estimators\nNote that the loss \u02dc\u2113 may not be convex even if we start with a convex \u2113. An example is provided\nby the familiar hinge loss \u2113hin(t, y) = [1 \u2212 yt]+. Stempfel and Ralaivola [2009] showed that \u02dc\u2113hin is\nnot convex in general (of course, when \u03c1+1 = \u03c1\u22121 = 0, it is convex). Below we provide a simple\ncondition to ensure convexity of \u02dc\u2113.\nLemma 4. Suppose \u2113(t, y) is convex and twice differentiable almost everywhere in t (for every y)\nand also satis\ufb01es the symmetry property\n\n\u2200t \u2208 R, \u2113\u2032\u2032(t, y) = \u2113\u2032\u2032(t,\u2212y) .\n\nThen \u02dc\u2113(t, y) is also convex in t.\nExamples satisfying the conditions of the lemma above are the squared loss \u2113sq(t, y) = (t \u2212 y)2, the\nlogistic loss \u2113log(t, y) = log(1 + exp(\u2212ty)) and the Huber loss:\nif yt < \u22121\nif \u2212 1 \u2264 yt \u2264 1\nif yt > 1\n\n\u22124yt\n(t \u2212 y)2\n0\n\nConsider the case where \u02dc\u2113 turns out to be non-convex when \u2113 is convex, as in \u02dc\u2113hin. In the online\nlearning setting (where the adversary chooses a sequence of examples, and the prediction of a learner\nat round i is based on the history of i \u2212 1 examples with independently \ufb02ipped labels), we could\nuse a stochastic mirror descent type algorithm [Nemirovski et al., 2009] to arrive at risk bounds (See\nAppendix B) similar to Theorem 3. Then, we only need the expected loss to be convex and therefore\n\n\u2113Hub(t, y) =\uf8f1\uf8f2\uf8f3\n\n4\n\n\f\u2113hin does not present a problem. At \ufb01rst blush, it may appear that we do not have much hope of\nobtaining \u02c6f in the iid setting ef\ufb01ciently. However, Lemma 2 provides a clue.\n\n(w). Since R\u02dc\u2113,D\u03c1\n\nWe will now focus on the function class W of hyperplanes. Even though bR\u02dc\u2113(w) is non-convex, it\n(w) = R\u2113,D(w), this shows that bR\u02dc\u2113(w) is uniformly\nis uniformly close to R\u02dc\u2113,D\u03c1\nclose to a convex function over w \u2208 W. The following result shows that we can therefore approx-\nimately minimize F (w) = bR\u02dc\u2113(w) by minimizing the biconjugate F \u22c6\u22c6. Recall that the (Fenchel)\nbiconjugate F \u22c6\u22c6 is the largest convex function that minorizes F .\nLemma 5. Let F : W \u2192 R be a non-convex function de\ufb01ned on function class W such it is \u03b5-close\nto a convex function G : W \u2192 R:\nThen any minimizer of F \u22c6\u22c6 is a 2\u03b5-approximate (global) minimizer of F .\nNow, the following theorem establishes bounds for the case when \u02dc\u2113 is non-convex, via the solution\nobtained by minimizing the convex function F \u2217\u2217.\nTheorem 6. Let \u2113 be a loss, such as the hinge loss, for which \u02dc\u2113 is non-convex. Let W = {w :\nkw2k \u2264 W2}, let kXik2 \u2264 X2 almost surely, and let \u02c6wapprox be any (exact) minimizer of the\nconvex problem\n\n\u2200w \u2208 W, |F (w) \u2212 G(w)| \u2264 \u03b5\n\nF \u22c6\u22c6(w) ,\n\nmin\nw\u2208W\n\nwhere F \u22c6\u22c6(w) is the (Fenchel) biconjugate of the function F (w) = bR\u02dc\u2113(w). Then, with probability\nat least 1 \u2212 \u03b4, \u02c6wapprox is a 2\u03b5-minimizer of bR\u02dc\u2113(\u00b7) where\n\n\u03b5 =\n\n2L\u03c1X2W2\u221an\n\n+r log(1/\u03b4)\n\n2n\n\n.\n\nTherefore, with probability at least 1 \u2212 \u03b4,\n\nR\u2113,D( \u02c6wapprox) \u2264 min\nw\u2208W\n\nR\u2113,D(w) + 4\u03b5 .\n\nNumerical or symbolic computation of the biconjugate of a multidimensional function is dif\ufb01cult,\nin general, but can be done in special cases. It will be interesting to see if techniques from Compu-\ntational Convex Analysis [Lucet, 2010] can be used to ef\ufb01ciently compute the biconjugate above.\n4 Method of label-dependent costs\nWe develop the method of label-dependent costs from two key observations. First, the Bayes clas-\nsi\ufb01er for noisy distribution, denoted \u02dcf\u2217, for the case \u03c1+1 6= \u03c1\u22121, simply uses a threshold different\nfrom 1/2. Second, \u02dcf\u2217 is the minimizer of a \u201clabel-dependent 0-1 loss\u201d on the noisy distribution. The\nframework we develop here generalizes known results for the uniform noise rate setting \u03c1+1 = \u03c1\u22121\nand offers a more fundamental insight into the problem. The \ufb01rst observation is formalized in the\nlemma below.\nLemma 7. Denote P (Y = 1|X) by \u03b7(X) and P ( \u02dcY = 1|X) by \u02dc\u03b7(X). The Bayes classi\ufb01er under\nthe noisy distribution, \u02dcf\u2217 = argminf E(X, \u02dcY )\u223cD\u03c1(cid:2)1\n\n\u02dcf\u2217(x) = sign(\u02dc\u03b7(x) \u2212 1/2) = sign(cid:18)\u03b7(x) \u2212\n\n{sign(f (X))6= \u02dcY }(cid:3) is given by,\n1 \u2212 \u03c1+1 \u2212 \u03c1\u22121(cid:19).\n\n1/2 \u2212 \u03c1\u22121\n\nInterestingly, this \u201cnoisy\u201d Bayes classi\ufb01er can also be obtained as the minimizer of a weighted 0-1\nloss; which as we will show, allows us to \u201ccorrect\u201d for the threshold under the noisy distribution.\nLet us \ufb01rst introduce the notion of \u201clabel-dependent\u201d costs for binary classi\ufb01cation. We can write\nthe 0-1 loss as a label-dependent loss as follows:\n\n1{sign(f (X))6=Y } = 1{Y =1}1{f (X)\u22640} + 1{Y =\u22121}1{f (X)>0}\n\nWe realize that the classical 0-1 loss is unweighted. Now, we could consider an \u03b1-weighted version\nof the 0-1 loss as:\n\nU\u03b1(t, y) = (1 \u2212 \u03b1)1{y=1}1{t\u22640} + \u03b11{y=\u22121}1{t>0},\n\nwhere \u03b1 \u2208 (0, 1). In fact we see that minimization w.r.t.\nthe 0-1 loss is equivalent to that w.r.t.\nU1/2(f (X), Y ). It is not a coincidence that Bayes optimal f\u2217 has a threshold 1/2. The following\nlemma [Scott, 2012] shows that in fact for any \u03b1-weighted 0-1 loss, the minimizer thresholds \u03b7(x)\nat \u03b1.\n\n5\n\n\fLemma 8 (\u03b1-weighted Bayes optimal [Scott, 2012]). De\ufb01ne U\u03b1-risk under distribution D as\n\nR\u03b1,D(f ) = E(X,Y )\u223cD[U\u03b1(f (X), Y )].\n\nThen, f\u2217\u03b1(x) = sign(\u03b7(x) \u2212 \u03b1) minimizes U\u03b1-risk.\nNow consider the risk of f w.r.t. the \u03b1-weighted 0-1 loss under noisy distribution D\u03c1:\n\nR\u03b1,D\u03c1 (f ) = E(X, \u02dcY )\u223cD\u03c1(cid:2)U\u03b1(f (X), \u02dcY )(cid:3).\n\nAt this juncture, we are interested in the following question: Does there exist an \u03b1 \u2208 (0, 1) such\nthat the minimizer of U\u03b1-risk under noisy distribution D\u03c1 has the same sign as that of the Bayes\noptimal f\u2217? We now present our second main result in the following theorem that makes a stronger\nstatement \u2014 the U\u03b1-risk under noisy distribution D\u03c1 is linearly related to the 0-1 risk under the\nclean distribution D. The corollary of the theorem answers the question in the af\ufb01rmative.\nTheorem 9 (Main Result 2). For the choices,\n1 \u2212 \u03c1+1 + \u03c1\u22121\n\n1 \u2212 \u03c1+1 \u2212 \u03c1\u22121\n\nand A\u03c1 =\n\n\u03b1\u2217 =\n\n,\n\n2\n\n2\n\nthere exists a constant BX that is independent of f such that, for all functions f ,\n\nR\u03b1\u2217,D\u03c1 (f ) = A\u03c1RD(f ) + BX .\n\nCorollary 10. The \u03b1\u22c6-weighted Bayes optimal classi\ufb01er under noisy distribution coincides with\nthat of 0-1 loss under clean distribution:\n\nargmin\n\nf\n\nR\u03b1\u2217,D\u03c1(f ) = argmin\n\nf\n\nRD(f ) = sign(\u03b7(x) \u2212 1/2).\n\n4.1 Proposed Proxy Surrogate Losses\nConsider any surrogate loss function \u2113; and the following decomposition:\n\n\u2113(t, y) = 1{y=1}\u21131(t) + 1{y=\u22121}\u2113\u22121(t)\n\nwhere \u21131 and \u2113\u22121 are partial losses of \u2113. Analogous to the 0-1 loss case, we can de\ufb01ne \u03b1-weighted\nloss function (Eqn. (1)) and the corresponding \u03b1-weighted \u2113-risk. Can we hope to minimize an \u03b1-\nweighted \u2113-risk with respect to noisy distribution D\u03c1 and yet bound the excess 0-1 risk with respect\nto the clean distribution D? Indeed, the \u03b1\u22c6 speci\ufb01ed in Theorem 9 is precisely what we need. We are\nready to state our third main result, which relies on a generalized notion of classi\ufb01cation calibration\nfor \u03b1-weighted losses [Scott, 2012]:\nTheorem 11 (Main Result 3). Consider the empirical risk minimization problem with noisy labels:\n\nDe\ufb01ne \u2113\u03b1 as an \u03b1-weighted margin loss function of the form:\n\n\u02c6f\u03b1 = argmin\n\n\u2113\u03b1(f (Xi), \u02dcYi).\n\n1\nn\n\nnXi=1\n\nf\u2208F\n\n\u2032\n\n\u03c1 \u03b6\u2113\u03b1\u22c6(cid:18) min\n\n\u2113\u03b1(t, y) = (1 \u2212 \u03b1)1{y=1}\u2113(t) + \u03b11{y=\u22121}\u2113(\u2212t)\n\n(1)\nwhere \u2113 : R \u2192 [0,\u221e) is a convex loss function with Lipschitz constant L such that it is classi\ufb01cation-\n(0) < 0). Then, for the choices \u03b1\u2217 and A\u03c1 in Theorem 9, there exists a nonde-\ncalibrated (i.e. \u2113\ncreasing function \u03b6\u2113\u03b1\u22c6 with \u03b6\u2113\u03b1\u22c6 (0) = 0, such that the following bound holds with probability at\nleast 1 \u2212 \u03b4:\nRD( \u02c6f\u03b1\u2217 ) \u2212 R\u2217 \u2264 A\u22121\nAside from bounding excess 0-1 risk under the clean distribution, the importance of the above the-\norem lies in the fact that it prescribes an ef\ufb01cient algorithm for empirical minimization with noisy\nlabels: \u2113\u03b1 is convex if \u2113 is convex. Thus for any surrogate loss function including \u2113hin, \u02c6f\u03b1\u2217 can be\nef\ufb01ciently computed using the method of label-dependent costs. Note that the choice of \u03b1\u2217 above\nis quite intuitive. For instance, when \u03c1\u22121 \u226a \u03c1+1 (this occurs in settings such as Liu et al. [2003]\nwhere there are only positive and unlabeled examples), \u03b1\u2217 < 1 \u2212 \u03b1\u2217 and therefore mistakes on\npositives are penalized more than those on negatives. This makes intuitive sense since an observed\nnegative may well have been a positive but the other way around is unlikely. In practice we do not\nneed to know \u03b1\u2217, i.e.\nthe noise rates \u03c1+1 and \u03c1\u22121. The optimization problem involves just one\nparameter that can be tuned by cross-validation (See Section 5).\n\nR\u03b1\u2217,D\u03c1(f ) + 4LR(F ) + 2r log(1/\u03b4)\n2n (cid:19).\n\nR\u03b1\u2217,D\u03c1(f ) \u2212 min\n\nf\u2208F\n\nf\n\n6\n\n\f5 Experiments\nWe show the robustness of the proposed algorithms to increasing rates of label noise on synthetic and\nreal-world datasets. We compare the performance of the two proposed methods with state-of-the-art\nmethods for dealing with random classi\ufb01cation noise. We divide each dataset (randomly) into 3\ntraining and test sets. We use a cross-validation set to tune the parameters speci\ufb01c to the algorithms.\nAccuracy of a classi\ufb01cation algorithm is de\ufb01ned as the fraction of examples in the test set classi\ufb01ed\ncorrectly with respect to the clean distribution. For given noise rates \u03c1+1 and \u03c1\u22121, labels of the\ntraining data are \ufb02ipped accordingly and average accuracy over 3 train-test splits is computed2. For\nevaluation, we choose a representative algorithm based on each of the two proposed methods \u2014 \u02dc\u2113log\nfor the method of unbiased estimators and the widely-used C-SVM [Liu et al., 2003] method (which\napplies different costs on positives and negatives) for the method of label-dependent costs.\n5.1 Synthetic data\nFirst, we use the synthetic 2D linearly separable dataset shown in Figure 1(a). We observe from\nexperiments that our methods achieve over 90% accuracy even when \u03c1+1 = \u03c1\u22121 = 0.4. Figure 1\nshows the performance of \u02dc\u2113log on the dataset for different noise rates. Next, we use a 2D UCI\nbenchmark non-separable dataset (\u2018banana\u2019). The dataset and classi\ufb01cation results using C-SVM\n(in fact, for uniform noise rates, \u03b1\u2217 = 1/2, so it is just the regular SVM) are shown in Figure 2. The\nresults for higher noise rates are impressive as observed from Figures 2(d) and 2(e). The \u2018banana\u2019\ndataset has been used in previous research on classi\ufb01cation with noisy labels.\nIn particular, the\nRandom Projection classi\ufb01er [Stempfel and Ralaivola, 2007] that learns a kernel perceptron in the\npresence of noisy labels achieves about 84% accuracy at \u03c1+1 = \u03c1\u22121 = 0.3 as observed from\nour experiments (as well as shown by Stempfel and Ralaivola [2007]), and the random hyperplane\nsampling method [Stempfel et al., 2007] gets about the same accuracy at (\u03c1+1, \u03c1\u22121) = (0.2, 0.4) (as\nreported by Stempfel et al. [2007]). Contrast these with C-SVM that achieves about 90% accuracy\nat \u03c1+1 = \u03c1\u22121 = 0.2 and over 88% accuracy at \u03c1+1 = \u03c1\u22121 = 0.4.\n\n100\n\n80\n\n60\n\n40\n\n20\n\n0\n\n\u221220\n\n\u221240\n\n\u221260\n\n\u221280\n\n100\n\n80\n\n60\n\n40\n\n20\n\n0\n\n\u221220\n\n\u221240\n\n\u221260\n\n\u221280\n\n100\n\n80\n\n60\n\n40\n\n20\n\n0\n\n\u221220\n\n\u221240\n\n\u221260\n\n\u221280\n\n100\n\n80\n\n60\n\n40\n\n20\n\n0\n\n\u221220\n\n\u221240\n\n\u221260\n\n\u221280\n\n100\n\n80\n\n60\n\n40\n\n20\n\n0\n\n\u221220\n\n\u221240\n\n\u221260\n\n\u221280\n\n\u2212100\n\n\u2212100\n\n\u221280\n\n\u221260\n\n\u221240\n\n\u221220\n\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\n\u2212100\n\n\u2212100\n\n\u221280\n\n\u221260\n\n\u221240\n\n\u221220\n\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\n\u2212100\n\n\u2212100\n\n\u221280\n\n\u221260\n\n\u221240\n\n\u221220\n\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\n\u2212100\n\n\u2212100\n\n\u221280\n\n\u221260\n\n\u221240\n\n\u221220\n\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\n\u2212100\n\n\u2212100\n\n\u221280\n\n\u221260\n\n\u221240\n\n\u221220\n\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 1: Classi\ufb01cation of linearly separable synthetic data set using \u02dc\u2113log. The noise-free data is\nshown in the leftmost panel. Plots (b) and (c) show training data corrupted with noise rates (\u03c1+1 =\n\u03c1\u22121 = \u03c1) 0.2 and 0.4 respectively. Plots (d) and (e) show the corresponding classi\ufb01cation results.\nThe algorithm achieves 98.5% accuracy even at 0.4 noise rate per class. (Best viewed in color).\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\u22124\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\u22124\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\u22124\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\u22124\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\u22124\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 2: Classi\ufb01cation of \u2018banana\u2019 data set using C-SVM. The noise-free data is shown in (a). Plots\n(b) and (c) show training data corrupted with noise rates (\u03c1+1 = \u03c1\u22121 = \u03c1) 0.2 and 0.4 respectively.\nNote that for \u03c1+1 = \u03c1\u22121, \u03b1\u2217 = 1/2 (i.e. C-SVM reduces to regular SVM). Plots (d) and (e) show\nthe corresponding classi\ufb01cation results (Accuracies are 90.6% and 88.5% respectively). Even when\n40% of the labels are corrupted (\u03c1+1 = \u03c1\u22121 = 0.4), the algorithm recovers the class structures as\nobserved from plot (e). Note that the accuracy of the method at \u03c1 = 0 is 90.8%.\n\n5.2 Comparison with state-of-the-art methods on UCI benchmark\nWe compare our methods with three state-of-the-art methods for dealing with random classi-\n\ufb01cation noise: Random Projection (RP) classi\ufb01er [Stempfel and Ralaivola, 2007]), NHERD\n\n2Note that training and cross-validation are done on the noisy training data in our setting. To account for\nrandomness in the \ufb02ips to simulate a given noise rate, we repeat each experiment 3 times \u2014 independent\ncorruptions of the data set for same setting of \u03c1+1 and \u03c1\u22121, and present the mean accuracy over the trials.\n\n7\n\n\fDATASET (d, n+, n\u2212)\nBreast cancer\n(9, 77, 186)\n\nDiabetes\n(8, 268, 500)\n\nThyroid\n(5, 65, 150)\n\nGerman\n(20, 300, 700)\n\nHeart\n(13, 120, 150)\n\nImage\n(18, 1188, 898)\n\n\u03c1+1 = 0.3, \u03c1\u22121 = 0.1\n\n\u03c1+1 = 0.3, \u03c1\u22121 = 0.1\n\nNoise rates\n\n\u03c1+1 = \u03c1\u22121 = 0.2\n\u03c1+1 = \u03c1\u22121 = 0.4\n\u03c1+1 = \u03c1\u22121 = 0.2\n\u03c1+1 = \u03c1\u22121 = 0.4\n\u03c1+1 = \u03c1\u22121 = 0.2\n\u03c1+1 = \u03c1\u22121 = 0.4\n\u03c1+1 = \u03c1\u22121 = 0.2\n\u03c1+1 = \u03c1\u22121 = 0.4\n\u03c1+1 = \u03c1\u22121 = 0.2\n\u03c1+1 = \u03c1\u22121 = 0.4\n\u03c1+1 = \u03c1\u22121 = 0.2\n\u03c1+1 = \u03c1\u22121 = 0.4\n\n\u03c1+1 = 0.3, \u03c1\u22121 = 0.1\n\n\u03c1+1 = 0.3, \u03c1\u22121 = 0.1\n\n\u03c1+1 = 0.3, \u03c1\u22121 = 0.1\n\n\u03c1+1 = 0.3, \u03c1\u22121 = 0.1\n\n\u02dc\u2113log C-SVM PAM NHERD\n64.90\n65.68\n56.50\n73.18\n74.74\n71.09\n78.49\n87.78\n85.95\n67.80\n67.80\n54.80\n82.96\n81.48\n52.59\n77.76\n79.39\n69.61\n\n69.34\n67.79\n67.05\n69.53\n65.89\n65.36\n96.22\n86.85\n70.98\n63.80\n67.80\n67.80\n69.63\n62.22\n53.33\n92.90\n89.55\n73.15\n\n67.85\n67.81\n67.79\n66.41\n66.41\n65.89\n94.31\n92.46\n66.32\n68.40\n68.40\n68.40\n61.48\n57.04\n54.81\n91.95\n89.26\n63.47\n\n70.12\n70.07\n67.79\n76.04\n75.52\n65.89\n87.80\n80.34\n83.10\n71.80\n71.40\n67.19\n82.96\n84.44\n57.04\n82.45\n82.55\n63.47\n\nRP\n69.38\n66.28\n54.19\n75.00\n67.71\n62.76\n84.02\n83.12\n57.96\n62.80\n67.40\n59.79\n72.84\n79.26\n68.15\n65.29\n70.66\n64.72\n\nTable 1: Comparative study of classi\ufb01cation algorithms on UCI benchmark datasets. Entries within\n1% from the best in each row are in bold. All the methods except NHERD variants (which\nare not kernelizable) use Gaussian kernel with width 1. All method-speci\ufb01c parameters are esti-\nmated through cross-validation. Proposed methods (\u02dc\u2113log and C-SVM) are competitive across all the\ndatasets. We show the best performing NHERD variant (\u2018project\u2019 and \u2018exact\u2019) in each case.\n[Crammer and Lee, 2010]) (project and exact variants3), and perceptron algorithm with mar-\ngin (PAM) which was shown to be robust to label noise by Khardon and Wachman [2007].\nWe use the standard UCI classi\ufb01cation datasets, preprocessed and made available by Gunnar\nR\u00a8atsch(http://theoval.cmp.uea.ac.uk/matlab). For kernelized algorithms, we use\nGaussian kernel with width set to the best width obtained by tuning it for a traditional SVM on\nthe noise-free data. For \u02dc\u2113log, we use \u03c1+1 and \u03c1\u22121 that give the best accuracy in cross-validation. For\nC-SVM, we \ufb01x one of the weights to 1, and tune the other. Table 1 shows the performance of the\nmethods for different settings of noise rates. C-SVM is competitive in 4 out of 6 datasets (Breast\ncancer, Thyroid, German and Image), while relatively poorer in the other two. On the other hand,\n\u02dc\u2113log is competitive in all the data sets, and performs the best more often. When about 20% labels are\ncorrupted, uniform (\u03c1+1 = \u03c1\u22121 = 0.2) and non-uniform cases (\u03c1+1 = 0.3, \u03c1\u22121 = 0.1) have similar\naccuracies in all the data sets, for both C-SVM and \u02dc\u2113log. Overall, we observe that the proposed\nmethods are competitive and are able to tolerate moderate to high amounts of label noise in the data.\nFinally, in domains where noise rates are approximately known, our methods can bene\ufb01t from the\nknowledge of noise rates. Our analysis shows that the methods are fairly robust to misspeci\ufb01cation\nof noise rates (See Appendix C for results).\n6 Conclusions and Future Work\nWe addressed the problem of risk minimization in the presence of random classi\ufb01cation noise, and\nobtained general results in the setting using the methods of unbiased estimators and weighted loss\nfunctions. We have given ef\ufb01cient algorithms for both the methods with provable guarantees for\nlearning under label noise. The proposed algorithms are easy to implement and the classi\ufb01cation\nperformance is impressive even at high noise rates and competitive with state-of-the-art methods on\nbenchmark data. The algorithms already give a new family of methods that can be applied to the\npositive-unlabeled learning problem [Elkan and Noto, 2008], but the implications of the methods for\nthis setting should be carefully analysed. We could consider harder noise models such as label noise\ndepending on the example, and \u201cnasty label noise\u201d where labels to \ufb02ip are chosen adversarially.\n7 Acknowledgments\nThis research was supported by DOD Army grant W911NF-10-1-0529 to ID; PR acknowledges the\nsupport of ARO via W911NF-12-1-0390 and NSF via IIS-1149803, IIS-1320894.\n\n3A family of methods proposed by Crammer and coworkers [Crammer et al., 2006, 2009, Dredze et al.,\n2008] could be compared to, but [Crammer and Lee, 2010] show that the 2 NHERD variants perform the best.\n\n8\n\n\fReferences\nD. Angluin and P. Laird. Learning from noisy examples. Mach. Learn., 2(4):343\u2013370, 1988.\nJaved A. Aslam and Scott E. Decatur. On the sample complexity of noise-tolerant learning. Inf. Process. Lett.,\n\n57(4):189\u2013195, 1996.\n\nPeter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classi\ufb01cation, and risk bounds. Journal\n\nof the American Statistical Association, 101(473):138\u2013156, 2006.\n\nShai Ben-David, D\u00b4avid P\u00b4al, and Shai Shalev-Shwartz. Agnostic online learning. In Proceedings of the 22nd\n\nConference on Learning Theory, 2009.\n\nBattista Biggio, Blaine Nelson, and Pavel Laskov. Support vector machines under adversarial label noise.\n\nJournal of Machine Learning Research - Proceedings Track, 20:97\u2013112, 2011.\n\nTom Bylander. Learning linear threshold functions in the presence of classi\ufb01cation noise. In Proc. of the 7th\n\nCOLT, pages 340\u2013347, NY, USA, 1994. ACM.\n\nNicol`o Cesa-Bianchi, Eli Dichterman, Paul Fischer, Eli Shamir, and Hans Ulrich Simon. Sample-ef\ufb01cient\n\nstrategies for learning in the presence of noise. J. ACM, 46(5):684\u2013719, 1999.\n\nNicol`o Cesa-Bianchi, Shai Shalev-Shwartz, and Ohad Shamir. Online learning of noisy data. IEEE Transac-\n\ntions on Information Theory, 57(12):7907\u20137931, 2011.\n\nK. Crammer and D. Lee. Learning via gaussian herding. In Advances in NIPS 23, pages 451\u2013459, 2010.\nKoby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. Online passive-aggressive\n\nalgorithms. J. Mach. Learn. Res., 7:551\u2013585, 2006.\n\nKoby Crammer, Alex Kulesza, and Mark Dredze. Adaptive regularization of weight vectors. In Advances in\n\nNIPS 22, pages 414\u2013422, 2009.\n\nMark Dredze, Koby Crammer, and Fernando Pereira. Con\ufb01dence-weighted linear classi\ufb01cation. In Proceedings\n\nof the Twenty-Fifth ICML, pages 264\u2013271, 2008.\n\nC. Elkan and K. Noto. Learning classi\ufb01ers from only positive and unlabeled data. In Proc. of the 14th ACM\n\nSIGKDD intl. conf. on Knowledge discovery and data mining, pages 213\u2013220, 2008.\n\nYoav Freund. A more robust boosting algorithm, 2009. preprint arXiv:0905.2138 [stat.ML] available\n\nat http://arxiv.org/abs/0905.2138.\n\nT. Graepel and R. Herbrich. The kernel Gibbs sampler. In Advances in NIPS 13, pages 514\u2013520, 2000.\nRoni Khardon and Gabriel Wachman. Noise tolerant variants of the perceptron algorithm. J. Mach. Learn.\n\nRes., 8:227\u2013248, 2007.\n\nNeil D. Lawrence and Bernhard Sch\u00a8olkopf. Estimating a kernel Fisher discriminant in the presence of label\n\nnoise. In Proceedings of the Eighteenth ICML, pages 306\u2013313, 2001.\n\nBing Liu, Yang Dai, Xiaoli Li, Wee Sun Lee, and Philip S Yu. Building text classi\ufb01ers using positive and\n\nunlabeled examples. In ICDM 2003., pages 179\u2013186. IEEE, 2003.\n\nPhilip M. Long and Rocco A. Servedio. Random classi\ufb01cation noise defeats all convex potential boosters.\n\nMach. Learn., 78(3):287\u2013304, 2010.\n\nYves Lucet. What shape is your conjugate? a survey of computational convex analysis and its applications.\n\nSIAM Rev., 52(3):505\u2013542, August 2010. ISSN 0036-1445.\n\nNaresh Manwani and P. S. Sastry. Noise tolerance under risk minimization. To appear in IEEE Trans. Syst.\n\nMan and Cybern. Part B, 2013. URL: http://arxiv.org/abs/1109.5231.\n\nA. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic\n\nprogramming. SIAM J. on Opt., 19(4):1574\u20131609, 2009.\n\nDavid F. Nettleton, A. Orriols-Puig, and A. Fornells. A study of the effect of different types of noise on the\n\nprecision of supervised learning techniques. Artif. Intell. Rev., 33(4):275\u2013306, 2010.\n\nClayton Scott. Calibrated asymmetric surrogate losses. Electronic J. of Stat., 6:958\u2013992, 2012.\nClayton Scott, Gilles Blanchard, and Gregory Handy. Classi\ufb01cation with asymmetric label noise: Consistency\n\nand maximal denoising. To appear in COLT, 2013.\n\nG. Stempfel and L. Ralaivola. Learning kernel perceptrons on noisy data using random projections. In Algo-\n\nrithmic Learning Theory, pages 328\u2013342. Springer, 2007.\n\nG. Stempfel, L. Ralaivola, and F. Denis. Learning from noisy data using hyperplane sampling and sample\n\naverages. 2007.\n\nGuillaume Stempfel and Liva Ralaivola. Learning SVMs from sloppily labeled data. In Proc. of the 19th Intl.\n\nConf. on Arti\ufb01cial Neural Networks: Part I, pages 884\u2013893. Springer-Verlag, 2009.\n\nMartin Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In Proceedings\n\nof the Twentieth ICML, pages 928\u2013936, 2003.\n\n9\n\n\f", "award": [], "sourceid": 622, "authors": [{"given_name": "Nagarajan", "family_name": "Natarajan", "institution": "UT Austin"}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": "University of Texas"}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": "UT Austin"}, {"given_name": "Ambuj", "family_name": "Tewari", "institution": "University of Michigan"}]}