{"title": "A Minimax Approach to Supervised Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 4240, "page_last": 4248, "abstract": "Given a task of predicting Y from X, a loss function L, and a set of probability distributions Gamma on (X,Y), what is the optimal decision rule minimizing the worst-case expected loss over Gamma? In this paper, we address this question by introducing a generalization of the maximum entropy principle. Applying this principle to sets of distributions with marginal on X constrained to be the empirical marginal, we provide a minimax interpretation of the maximum likelihood problem over generalized linear models as well as some popular regularization schemes. For quadratic and logarithmic loss functions we revisit well-known linear and logistic regression models. Moreover, for the 0-1 loss we derive a classifier which we call the minimax SVM. The minimax SVM minimizes the worst-case expected 0-1 loss over the proposed Gamma by solving a tractable optimization problem. We perform several numerical experiments to show the power of the minimax SVM in outperforming the SVM.", "full_text": "A Minimax Approach to Supervised Learning\n\nFarzan Farnia\u2217\n\nfarnia@stanford.edu\n\nDavid Tse\u2217\n\ndntse@stanford.edu\n\nAbstract\n\nGiven a task of predicting Y from X, a loss function L, and a set of probability\ndistributions \u0393 on (X, Y ), what is the optimal decision rule minimizing the worst-\ncase expected loss over \u0393? In this paper, we address this question by introducing\na generalization of the maximum entropy principle. Applying this principle to\nsets of distributions with marginal on X constrained to be the empirical marginal,\nwe provide a minimax interpretation of the maximum likelihood problem over\ngeneralized linear models as well as some popular regularization schemes. For\nquadratic and logarithmic loss functions we revisit well-known linear and logistic\nregression models. Moreover, for the 0-1 loss we derive a classi\ufb01er which we\ncall the minimax SVM. The minimax SVM minimizes the worst-case expected\n0-1 loss over the proposed \u0393 by solving a tractable optimization problem. We\nperform several numerical experiments to show the power of the minimax SVM in\noutperforming the SVM.\n\n1\n\nIntroduction\n\nSupervised learning, the task of inferring a function that predicts a target Y from a feature vector\nX = (X1, . . . , Xd) by using n labeled training samples {(x1, y1), . . . , (xn, yn)}, has been a problem\nof central interest in machine learning. Given the underlying distribution \u02dcPX,Y , the optimal prediction\nrules had long been studied and formulated in the statistics literature. However, the advent of high-\ndimensional problems raised this important question: What would be a good prediction rule when we\ndo not have enough samples to estimate the underlying distribution?\nTo understand the dif\ufb01culty of learning in high-dimensional settings, consider a genome-based\nclassi\ufb01cation task where we seek to predict a binary trait of interest Y from an observation of\n3, 000, 000 SNPs, each of which can be considered as a discrete variable Xi \u2208 {0, 1, 2}. Hence, to\nestimate the underlying distribution we need O(33,000,000) samples.\nWith no possibility of estimating the underlying \u02dcP in such problems, several approaches have been\nproposed to deal with high-dimensional settings. The standard approach in statistical learning\ntheory is empirical risk minimization (ERM) [1]. ERM learns the prediction rule by minimizing an\napproximated loss under the empirical distribution of samples. However, to avoid over\ufb01tting, ERM\nrestricts the set of allowable decision rules to a class of functions with limited complexity measured\nthrough its VC-dimension.\nThis paper focuses on a complementary approach to ERM where one can learn the prediction rule\nthrough minimizing a decision rule\u2019s worst-case loss over a larger set of distributions \u0393( \u02c6P ) centered\nat the empirical distribution \u02c6P . In other words, instead of restricting the class of decision rules, we\nconsider and evaluate all possible decision rules, but based on a more stringent criterion that they will\nhave to perform well over all distributions in \u0393( \u02c6P ). As seen in Figure 1, this minimax approach can\nbe broken into three main steps:\n\n1. We compute the empirical distribution \u02c6P from the data,\n\n\u2217Department of Electrical Engineering, Stanford University, Stanford, CA 94305.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: Minimax Approach\n\nFigure 2: Minimax-hinge Loss\n\n2. We form a distribution set \u0393( \u02c6P ) based on \u02c6P ,\n3. We learn a prediction rule \u03c8\u2217 that minimizes the worst-case expected loss over \u0393( \u02c6P ).\n\nSome special cases of this minimax approach, which are based on learning a prediction rule from\nlow-order marginal/moments, have been addressed in the literature: [2] solves a robust minimax\nclassi\ufb01cation problem for continuous settings with \ufb01xed \ufb01rst and second-order moments; [3] develops\na classi\ufb01cation approach by minimizing the worst-case hinge loss subject to \ufb01xed low-order marginals;\n[4] \ufb01ts a model minimizing the maximal correlation under \ufb01xed pairwise marginals to design a robust\nclassi\ufb01cation scheme. In this paper, we develop a general minimax approach for supervised learning\nproblems with arbitrary loss functions.\nTo formulate Step 3 in Figure 1, given a general loss function L and set of distribution \u0393( \u02c6P ) we\ngeneralize the problem formulation discussed at [3] to\n\nE(cid:2) L(cid:0)Y, \u03c8(X)(cid:1)(cid:3) .\n\nargmin\n\n\u03c8\u2208\u03a8\n\nmax\nP\u2208\u0393( \u02c6P )\n\n(1)\n\nHere, \u03a8 is the space of all decision rules. Notice the difference with the ERM problem where \u03a8 was\nrestricted to smaller function classes while \u0393( \u02c6P ) = { \u02c6P}.\nIf we have to predict Y with no access to X, (1) will reduce to the formulation studied at [5]. There,\nthe authors propose to use the principle of maximum entropy [6], for a generalized de\ufb01nition of\nentropy, to \ufb01nd the optimal prediction rule minimizing the worst-case expected loss. By the principle\nof maximum entropy, we should predict based on a distribution in \u0393( \u02c6P ) that maximizes the entropy\nfunction.\nHow can we use the principle of maximum entropy to solve (1) when we observe X as well? A\nnatural idea is to apply the maximum entropy principle to the conditional PY |X=x instead of the\nmarginal PY . This idea motivates a generalized version of the principle of maximum entropy, which\nwe call the principle of maximum conditional entropy. In fact, this principle breaks Step 3 into two\nsmaller steps:\n\n3a. We search for P \u2217 the distribution maximizing the conditional entropy over \u0393( \u02c6P ),\n3b. We \ufb01nd \u03c8\u2217 the optimal decision rule for P \u2217.\n\nAlthough the principle of maximum conditional entropy characterizes the solution to (1), computing\nthe maximizing distribution is hard in general. In [7], the authors propose a conditional version of the\nprinciple of maximum entropy, for the speci\ufb01c case of Shannon entropy, and draw the principle\u2019s\nconnection to (1). They call it the principle of minimum mutual information, by which one should\npredict based on the distribution minimizing mutual information among X and Y . However, they\ndevelop their theory targeting a broad class of distribution sets, which results in a convex problem,\nyet the number of variables is exponential in the dimension of the problem.\nTo overcome this issue, we propose a speci\ufb01c structure for the distribution set by matching the\nmarginal PX of all the joint distributions PX,Y in \u0393( \u02c6P ) to the empirical marginal \u02c6PX while matching\nonly the cross-moments between X and Y with those of the empirical distribution \u02c6PX,Y. We show\nthat this choice of \u0393( \u02c6P ) has two key advantages: 1) the minimax decision rule \u03c8\u2217 can be computed\nef\ufb01ciently; 2) the minimax generalization error can be controlled by allowing a level of uncertainty in\nthe matching of the cross-moments, which can be viewed as regularization in the minimax framework.\nOur solution is achieved through convex duality. For some loss functions, the dual problem turns\nout to be equivalent to the maximum likelihood problem for generalized linear models. For example,\n\n2\n\n\funder quadratic and logarithmic loss functions this minimax approach revisits the linear and logistic\nregression models respectively.\nOn the other hand, for 0-1 loss, the minimax approach leads to a new randomized linear classi\ufb01er\nwhich we call the minimax SVM. The minimax SVM minimizes the worst-case expected 0-1 loss\nover \u0393( \u02c6P ) by solving a tractable optimization problem. In contrast, the classic ERM formulation of\nminimizing the 0-1 loss over linear classi\ufb01ers is well-known to be NP-hard [8]. Interestingly, the dual\nproblem for the 0-1 loss minimax problem corresponds also to an ERM problem for linear classi\ufb01ers,\nbut with a loss function different from 0-1 loss. This loss function, which we call the minimax-hinge\nloss, is also different from the classic hinge loss (Figure 2). We emphasize that while the hinge loss is\nan adhoc surrogate loss function chosen to convexify the 0-1 loss ERM problem, the minimax-hinge\nloss emerges from the minimax formulation. We also perform several numerical experiments to\ndemonstrate the power of the minimax SVM in outperforming the standard SVM which minimizes\nthe surrogate hinge loss.\n\n2 Principle of Maximum Conditional Entropy\n\nIn this section, we provide a conditional version of the key de\ufb01nitions and results developed in [5].\nWe propose the principle of maximum conditional entropy to break Step 3 into 3a and 3b in Figure 1.\nWe also de\ufb01ne and characterize Bayes decision rules for different loss functions to address Step 3b.\n\n2.1 Decision Problems, Bayes Decision Rules, Conditional Entropy\nConsider a decision problem. Here the decision maker observes X \u2208 X from which she predicts a\nrandom target variable Y \u2208 Y using an action a \u2208 A. Let PX,Y = (PX , PY |X ) be the underlying\ndistribution for the random pair (X, Y ). Given a loss function L : Y \u00d7A \u2192 [0,\u221e], L(y, a) indicates\nthe loss suffered by the decision maker by deciding action a when Y = y. The decision maker uses\na decision rule \u03c8 : X \u2192 A to select an action a = \u03c8(x) from A based on an observation x \u2208 X .\nWe will in general allow the decision rules to be random, i.e. \u03c8 is random. The main purpose of\nextending to the space of randomized decision rules is to form a convex set of decision rules. Later in\nTheorem 1, this convexity is used to prove a saddle-point theorem.\nWe call a (randomized) decision rule \u03c8Bayes a Bayes decision rule if for all decision rules \u03c8 and for\nall x \u2208 X :\n\nE[L(Y, \u03c8Bayes(X))|X = x] \u2264 E[L(Y, \u03c8(X))|X = x].\n\nIt should be noted that \u03c8Bayes depends only on PY |X, i.e. it remains a Bayes decision rule under a\ndifferent PX. The (unconditional) entropy of Y is de\ufb01ned as [5]\nE[L(Y, a)].\n\n(2)\n\nH(Y ) := inf\na\u2208A\n\nSimilarly, we can de\ufb01ne conditional entropy of Y given X = x as\n\nH(Y |X = x) := inf\n\nE[L(Y, \u03c8(X))|X = x],\n\n\u03c8\n\n(3)\n\n(cid:88)\n\nand the conditional entropy of Y given X as\n\nH(Y |X) :=\n\n(4)\nNote that H(Y |X = x) and H(Y | X) are both concave in PY |X. Applying Jensen\u2019s inequality, this\nconcavity implies that\n\n\u03c8\n\nx\n\nPX (x)H(Y |X = x) = inf\n\nE[L(Y, \u03c8(X))].\n\nH(Y |X) \u2264 H(Y ),\n\nwhich motivates the following de\ufb01nition for the information that X carries about Y ,\n\nI(X; Y ) := H(Y ) \u2212 H(Y |X),\n\n(5)\ni.e. the reduction of expected loss in predicting Y by observing X. In [9], the author has de\ufb01ned\nthe same concept to which he calls a coherent dependence measure. It can be seen that I(X; Y ) =\nEPX [ D(PY |X , PY ) ] where D is the divergence measure corresponding to the loss L, de\ufb01ned for\nany two probability distributions PY , QY with Bayes actions aP , aQ as [5]\n\nD(PY , QY ) := EP [L(Y, aQ)] \u2212 EP [L(Y, aP )] = EP [L(Y, aQ)] \u2212 HP (Y ).\n\n(6)\n\n3\n\n\f2.2 Examples\n\n2.2.1 Logarithmic loss\nFor an outcome y \u2208 Y and distribution QY , de\ufb01ne logarithmic loss as Llog(y, QY ) = \u2212 log QY (y).\nIt can be seen Hlog(Y ), Hlog(Y |X), Ilog(X; Y ) are the well-known unconditional, conditional\nShannon entropy and mutual information [10]. Also, the Bayes decision rule for a distribution PX,Y\nis given by \u03c8Bayes(x) = PY |X (\u00b7|x).\n2.2.2 0-1 loss\nThe 0-1 loss function is de\ufb01ned for any y, \u02c6y \u2208 Y as L0-1(y, \u02c6y) = I(\u02c6y (cid:54)= y). Then, we can show\n\nH0-1(Y ) = 1 \u2212 max\n\ny\u2208Y PY (y), H0-1(Y |X) = 1 \u2212(cid:88)\n\nmax\ny\u2208Y PX,Y (x, y).\n\nx\u2208X\n\nThe Bayes decision rule for a distribution PX,Y is the well-known maximum a posteriori (MAP) rule,\ni.e. \u03c8Bayes(x) = argmaxy\u2208Y PY |X (y|x).\n2.2.3 Quadratic loss\nThe quadratic loss function is de\ufb01ned as L2(y, \u02c6y) = (y \u2212 \u02c6y)2. It can be seen\n\nH2(Y ) = Var(Y ), H2(Y |X) = E [Var(Y |X)],\n\nI2(X; Y ) = Var (E[Y |X]) .\n\nThe Bayes decision rule for any PX,Y is the well-known minimum mean-square error (MMSE)\nestimator that is \u03c8Bayes(x) = E[Y |X = x].\n2.3 Principle of Maximum Conditional Entropy & Robust Bayes decision rules\n\nGiven a distribution set \u0393, consider the following minimax problem to \ufb01nd a decision rule minimizing\nthe worst-case expected loss over \u0393\n\nargmin\n\n\u03c8\u2208\u03a8\n\nmax\nP\u2208\u0393\n\nEP [L(Y, \u03c8(X))],\n\n(7)\n\nwhere \u03a8 is the space of all randomized mappings from X to A and EP denotes the expected value\nover distribution P . We call any solution \u03c8\u2217 to the above problem a robust Bayes decision rule\nagainst \u0393. The following results motivate a generalization of the maximum entropy principle to \ufb01nd a\nrobust Bayes decision rule. Refer to the supplementary material for the proofs.\nTheorem 1.A. (Weak Version) Suppose \u0393 is convex and closed, and let L be a bounded loss function.\nAssume X ,Y are \ufb01nite and that the risk set S = { [L(y, a)]y\u2208Y : a \u2208 A} is closed. Then there\nexists a robust Bayes decision rule \u03c8\u2217 against \u0393, which is a Bayes decision rule for a distribution P \u2217\nthat maximizes the conditional entropy H(Y |X) over \u0393.\nTheorem 1.B. (Strong Version) Suppose \u0393 is convex and that under any P \u2208 \u0393 there exists a Bayes\ndecision rule. We also assume the continuity in Bayes decision rules for distributions in \u0393 (See the\nsupplementary material for the exact condition). Then, if P \u2217 maximizes H(Y |X) over \u0393, any Bayes\ndecision rule for P \u2217 is a robust Bayes decision rule against \u0393.\n\nPrinciple of Maximum Conditional Entropy: Given a set of distributions \u0393, predict Y based on a\ndistribution in \u0393 that maximizes the conditional entropy of Y given X, i.e.\n\nH(Y |X)\n\nargmax\n\nP\u2208\u0393\n\n(8)\n\nNote that while the weak version of Theorem 1 guarantees only the existence of a saddle point for\n(7), the strong version further guarantees that any Bayes decision rule of the maximizing distribution\nresults in a robust Bayes decision rule. However, the continuity in Bayes decision rules does not hold\nfor the discontinuous 0-1 loss, which requires considering the weak version of Theorem 1 to address\nthis issue.\n\n4\n\n\f3 Prediction via Maximum Conditional Entropy Principle\n\nConsider a prediction task with target variable Y and feature vector X = (X1, . . . , Xd). We do not\nrequire the variables to be discrete. As discussed earlier, the maximum conditional entropy principle\nreduces (7) to (8), which formulate steps 3 and 3a in Figure 1, respectively. However, a general\nformulation of (8) in terms of the joint distribution PX,Y leads to an exponential computational\ncomplexity in the feature dimension d.\nThe key question is therefore under what structures of \u0393( \u02c6P ) in Step 2 we can solve (8) ef\ufb01ciently. In\nthis section, we propose a speci\ufb01c structure for \u0393( \u02c6P ), under which we provide an ef\ufb01cient solution\nto Steps 3a and 3b in Figure 1. In addition, we prove a bound on the excess worst-case risk for the\nproposed \u0393( \u02c6P ).\nTo describe this structure, consider a set of distributions \u0393(Q) centered around a given distribution\nQX,Y , where for a given norm (cid:107) \u00b7 (cid:107), mapping vector \u03b8(Y )t\u00d71,\n\n\u0393(Q) = { PX,Y : PX = QX ,\n\n\u2200 1 \u2264 i \u2264 t : (cid:107) EP [\u03b8i(Y )X] \u2212 EQ [\u03b8i(Y )X](cid:107) \u2264 \u0001i }.\n\n(9)\n\nHere \u03b8 encodes Y with t-dimensional \u03b8(Y ), and \u03b8i(Y ) denotes the ith entry of \u03b8(Y ). The \ufb01rst\nconstraint in the de\ufb01nition of \u0393(Q) requires all distributions in \u0393(Q) to share the same marginal on\nX as Q; the second imposes constraints on the cross-moments between X and Y , allowing for some\nuncertainty in estimation. When applied to the supervised learning problem, we will choose Q to be\nthe empirical distribution \u02c6P and select \u03b8 appropriately based on the loss function L. However, for\nnow we will consider the problem of solving (8) over \u0393(Q) for general Q and \u03b8.\nTo that end, we use a similar technique as in the Fenchel\u2019s duality theorem, also used at [11, 12, 13]\nto address divergence minimization problems. However, we consider a different version of convex\nconjugate for \u2212H, which is de\ufb01ned with respect to \u03b8. Considering PY as the set of all probability\ndistributions for the variable Y , we de\ufb01ne F\u03b8 : Rt \u2192 R as the convex conjugate of \u2212H(Y ) with\nrespect to the mapping \u03b8,\n\nF\u03b8(z) := max\nP\u2208PY\n\nH(Y ) + E[\u03b8(Y )]T z.\n\n(10)\n\nTheorem 2. De\ufb01ne \u0393(Q), F\u03b8 as given by (9), (10). Then the following duality holds\n\nH(Y |X) = min\nA\u2208Rt\u00d7d\n\nEQ\n\nmax\nP\u2208\u0393(Q)\n\n(11)\nwhere (cid:107)Ai(cid:107)\u2217 denotes (cid:107) \u00b7 (cid:107)\u2019s dual norm of the A\u2019s ith row. Furthermore, for the optimal P \u2217 and A\u2217\n(12)\n\nEP \u2217 [ \u03b8(Y )| X = x ] = \u2207F\u03b8 (A\u2217x).\n\n(cid:2) F\u03b8(AX) \u2212 \u03b8(Y )T AX(cid:3) +\n\nt(cid:88)\n\ni=1\n\n\u0001i(cid:107)Ai(cid:107)\u2217,\n\nProof. Refer to the the supplementary material for the proof.\n\nWhen applying Theorem 2 on a supervised learning problem with a speci\ufb01c loss function, \u03b8 will be\nchosen such that EP \u2217 [ \u03b8(Y )| X = x ] provides suf\ufb01cient information to compute the Bayes decision\nrule \u03a8\u2217 for P \u2217. This enables the direct computation of \u03c8\u2217, i.e. step 3 of Figure 1, without the need\nto explicitly compute P \u2217 itself. For the loss functions discussed at Subsection 2.2, we choose the\nidentity \u03b8(Y ) = Y for the quadratic loss and the one-hot encoding \u03b8(Y ) = [ I(Y = i) ]t\ni=1 for the\nlogarithmic and 0-1 loss functions. Later in this section, we will discuss how this theorem applies to\nthese loss functions.\n\n3.1 Generalization Bounds for the Worst-case Risk\n\nBy establishing the objective\u2019s Lipschitzness and boundedness through appropriate assumptions, we\ncan bound the rate of uniform convergence for the problem in the RHS of (11) [14]. Here we consider\nthe uniform convergence of the empirical averages, when Q = \u02c6Pn is the empirical distribution of n\nsamples drawn i.i.d. from the underlying distribution \u02dcP , to their expectations when Q = \u02dcP .\nIn the supplementary material, we prove the following theorem which bounds the excess worst-case\nrisk. Here \u02c6\u03c8n and \u02dc\u03c8 denote the robust Bayes decision rules against \u0393( \u02c6Pn) and \u0393( \u02dcP ), respectively.\n\n5\n\n\fFigure 3: Duality of Maximum Conditional Entropy/Maximum Likelihood in GLMs\n\nAs explained earlier, by the maximum conditional entropy principle we can learn \u02c6\u03c8n by solving the\nRHS of (11) for the empirical distribution of samples and then applying (12).\nTheorem 3. Consider a loss function L with the entropy function H and suppose \u03b8(Y ) includes\nonly one element, i.e. t = 1. Let M = maxP\u2208PY H(Y ) be the maximum entropy value over PY.\nAlso, take (cid:107) \u00b7 (cid:107)/(cid:107) \u00b7 (cid:107)\u2217 to be the (cid:96)p/(cid:96)q pair where 1\nq = 1, 1 \u2264 q \u2264 2. Given that (cid:107)X(cid:107)2 \u2264 B and\np + 1\n|\u03b8(Y )| \u2264 L, for any \u03b4 > 0 with probability at least 1 \u2212 \u03b4\n\n(cid:18)\n\n(cid:114)\n\n(cid:19)\n\nmax\nP\u2208\u0393( \u02dcP )\n\nE[L(Y, \u02c6\u03c8n(X))] \u2212 max\nP\u2208\u0393( \u02dcP )\n\nE[L(Y, \u02dc\u03c8(X))] \u2264 4BLM\nn\n\n\u221a\n\n\u0001\n\n1 +\n\n9 log(4/\u03b4)\n\n8\n\n.\n\n(13)\n\nTheorem 3 states that though we learn the prediction rule \u02c6\u03c8n by solving the maximum conditional\nproblem for the empirical case, we can bound the excess \u0393-based worst-case risk. This result justi\ufb01es\nthe speci\ufb01c constraint of \ufb01xing the marginal PX across the proposed \u0393(Q) and explains the role of\nthe uncertainty parameter \u0001 in bounding the excess worst-case risk.\n\n3.2 A Minimax Interpretation of Generalized Linear Models\n\nWe make the key observation that if F\u03b8 is the log-partition function of an expoenetial-family distribu-\ntion, the problem in the RHS of (11), when \u0001i = 0 for all i\u2019s, is equivalent to minimizing the negative\nlog-likelihood for \ufb01tting a generalized linear model [15] given by\n\n\u2022 An exponential-family distribution p(y|\u03b7) = h(y) exp(cid:0)\u03b7T \u03b8(y) \u2212 F\u03b8(\u03b7)(cid:1) with the log-partition\n\nfunction F\u03b8 and the suf\ufb01cient statistic \u03b8(Y ),\n\n\u2022 A linear predictor, \u03b7(X) = AX,\n\u2022 A mean function, E[ \u03b8(Y )|X = x] = \u2207F\u03b8(\u03b7(x)).\nTherefore, Theorem 2 reveals a duality between the maximum conditional entropy problem over\n\u0393(Q) and the regularized maximum likelihood problem for the speci\ufb01ed generalized linear model.\nAs a geometric interpretation of this duality, by solving the regularized maximum likelihood problem\nin the RHS of (11), we in fact minimize a regularized KL-divergence\n\nt(cid:88)\n\ni=1\n\n\u0001i(cid:107)Ai(PY |X)(cid:107)\u2217,\n\nEQX [ DKL( QY |X || PY |X ) ] +\n\nargmin\nPY |X\u2208SF\n\n(14)\nwhere SF = {PY |X(y|x) = h(y) exp( \u03b8(y)T Ax\u2212F\u03b8(Ax)| A \u2208 Rt\u00d7s} is the set of all exponential-\nfamily conditional distributions for the speci\ufb01ed generalized linear model. This can be viewed as\nprojecting Q onto (QX , SF ) (See Figure 3).\nFurthermore, for a label-invariant entropy H(Y ) the Bayes act for the uniform distribution UY leads\nto the same expected loss under any distribution on Y . Based on the divergence D\u2019s de\ufb01nition in\n(6), maximizing H(Y |X) over \u0393(Q) in the LHS of (11) is therefore equivalent to the following\ndivergence minimization problem\n\nargmin\n\nPY |X: (QX,PY |X)\u2208\u0393(Q)\n\nEQX[ D(PY |X,UY |X) ].\n\n(15)\n\n6\n\n\fHere UY |X denotes the uniform conditional distribution over Y given any x \u2208 X . This can be\ninterpreted as projecting the joint distribution (QX,UY |X) onto \u0393(Q) (See Figure 3). Then, the\nduality shown in Theorem 2 implies the following corollary.\nCorollary 1. Given a label-invariant H, the solution to (14) also minimizes (15), i.e. (14) \u2286 (15).\n3.3 Examples\n\n3.3.1 Logarithmic Loss: Logistic Regression\nTo gain suf\ufb01cient information for the Bayes decision rule under the logarithmic loss, for Y \u2208 Y =\n{1, . . . , t + 1}, let \u03b8(Y ) be the one-hot encoding of Y , i.e. \u03b8i(Y ) = I(Y = i) for 1 \u2264 i \u2264 t. Here,\n\nwe exclude i = t + 1 as I(Y = t + 1) = 1 \u2212(cid:80)t\nF\u03b8(z) = log(cid:0)1 +\n\nexp(zj)(cid:1),\n\n\u2200 1 \u2264 i \u2264 t : (cid:0)\u2207F\u03b8(z)(cid:1)\n\nt(cid:88)\n\nexp(zj)(cid:1), (16)\n\ni = exp (zi) /(cid:0)1 +\n\nI(Y = i). Then\n\nt(cid:88)\n\ni=1\n\nj=1\n\nj=1\n\nwhich is the logistic regression model [16]. Also, the RHS of (11) will be the regularized maximum\nlikelihood problem for logistic regression. This particular result is well-studied in the literature and\nstraightforward using the duality shown in [17].\n\n3.3.2 0-1 Loss: Minimax SVM\n\nTo get suf\ufb01cient information for the Bayes decision rule under the 0-1 loss, we again consider the\none-hot encoding \u03b8 described for the logarithmic loss. We show in the supplementary material that if\n\u02dcz = (z, 0) and \u02dcz(i) denotes the ith largest element of \u02dcz,\n\nk \u2212 1 +(cid:80)k\n\nj=1 \u02dcz(j)\n\n(17)\nIn particular, if Y \u2208 Y = {\u22121, 1} is binary the dual problem (11) for learning the optimal linear\npredictor \u03b1\u2217 given n samples (xi, yi)n\n\nF\u03b8(z) = max\n\n1\u2264k\u2264t+1\n\ni=1 will be\n\nk\n\n.\n\nmin\n\n\u03b1\n\n1\nn\n\nmax\n\n0 ,\n\n1 \u2212 yi\u03b1T xi\n\n2\n\n, \u2212yi\u03b1T xi\n\n+ \u0001(cid:107)\u03b1(cid:107)\u2217.\n\n(18)\n\n(cid:27)\n\nThe \ufb01rst term is the empirical risk of a linear classi\ufb01er over the minimax-hinge loss max{0, 1\u2212z\n2 ,\u2212z}\nas shown in Figure 2. In contrast, the standard SVM is formulated using the hinge loss max{0, 1\u2212 z}:\n\nmax(cid:8)0 , 1 \u2212 yi\u03b1T xi\n\n(cid:9) + \u0001(cid:107)\u03b1(cid:107)\u2217,\n\n(19)\n\nn(cid:88)\n\ni=1\n\nmin\n\n\u03b1\n\n1\nn\n\n(cid:26)\n\nn(cid:88)\n\ni=1\n\nWe therefore call this classi\ufb01cation approach the minimax SVM. However, unlike the standard SVM,\nthe minimax SVM is naturally extended to multi-class classi\ufb01cation.\nUsing Theorem 1.A2, we prove that for 0-1 loss the robust Bayes decision rule exists and is randomized\nin general, where given the optimal linear predictor \u02dcz = (A\u2217x, 0) randomly predicts a label according\nto the following \u02dcz-based distribution on labels\n\n\u2200 1 \u2264 i \u2264 t + 1 :\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 \u02dcz(i) +\nindex k satisfying(cid:80)k\nprobability min(cid:8)1 , max{0 , (1 + xT \u03b1\u2217)/2}(cid:9).\n\np\u03c3(i) =\n\n0\n\n1 \u2212(cid:80)kmax\n\nj=1 \u02dcz(j)\n\nkmax\n\nHere \u03c3 is the permutation sorting \u02dcz in the ascending order, i.e. \u02dcz\u03c3(i) = \u02dcz(i), and kmax is the largest\ni=1[\u02dcz(i) \u2212 \u02dcz(k) ] < 1. For example, in the binary case discussed, the minimax\nSVM \ufb01rst solves (18) to \ufb01nd the optimal \u03b1\u2217 and then predicts label y = 1 vs. label y = \u22121 with\n\nif \u03c3(i) \u2264 kmax,\nOtherwise.\n\n(20)\n\n2We show that given the speci\ufb01c structure of \u0393(Q) Theorem 1.A holds whether X is \ufb01nite or in\ufb01nite.\n\n7\n\n\fDataset\nadult\ncredit\nkr-vs-kp\npromoters\n\nvotes\n\nhepatitis\n\nmmSVM\n\n17\n12\n4\n5\n3\n17\n\nSVM\n22\n16\n3\n9\n5\n20\n\nDCC\n18\n14\n10\n5\n3\n19\n\nMPM\n\n22\n13\n5\n6\n4\n18\n\nTAN\n17\n17\n7\n44\n8\n17\n\nTable 1: Methods Performance (error in %)\n\nDRC\n17\n13\n5\n6\n3\n17\n\n3.3.3 Quadratic Loss: Linear Regression\n\nBased on the Bayes decision rule for the quadratic loss, we choose \u03b8(Y ) = Y . To derive F\u03b8, note that\nif we let PY in (10) include all possible distributions, the maximized entropy (variance for quadratic\nloss) and thus the value of F\u03b8 would be in\ufb01nity. Therefore, given a parameter \u03c1, we restrict the\nsecond moment of distributions in PY = {PY : E[Y 2] \u2264 \u03c12} and then apply (10). We show in the\nsupplementary material that an adjusted version of Theorem 2 holds after this change, and\n\n(cid:26)z2/4\n\n\u03c1(|z| \u2212 \u03c1)\n\nif |z/2| \u2264 \u03c1\nif |z/2| > \u03c1,\n\nF\u03b8(z) \u2212 \u03c12 =\n\n(21)\n\nwhich is the Huber function [18]. Given the samples of a supervised learning task if we choose the\nparameter \u03c1 large enough, by solving the RHS of (11) when F\u03b8(z) is replaced with z2/4 and set \u03c1\ngreater than maxi |A\u2217xi|, we can equivalently take F\u03b8(z) = z2/4 + \u03c12. Then, by (12) we derive the\nlinear regression model and the RHS of (11) is equivalent to\n\u2013 Least squares when \u0001 = 0.\n\u2013 Lasso [19, 20] when (cid:107) \u00b7 (cid:107)/(cid:107) \u00b7 (cid:107)\u2217 is the (cid:96)\u221e/(cid:96)1 pair.\n\u2013 Ridge regression [21] when (cid:107) \u00b7 (cid:107) is the (cid:96)2-norm.\n\u2013 (overlapping) Group lasso [22, 23] with the (cid:96)1,p penalty when \u0393GL(Q) is de\ufb01ned, given subsets\nI1, . . . Ik of {1, . . . , d} and 1/p + 1/q = 1, as\n\n\u0393GL(Q) = { PX,Y : PX = QX ,\n\n\u2200 1 \u2264 j \u2264 k : (cid:107) EP\n\n(cid:2)Y XIj\n\n(cid:3) \u2212 EQ\n\n(cid:2)Y XIj\n\n(cid:3)(cid:107)q \u2264 \u0001j }.\n\n(22)\n\n4 Numerical Experiments\n\nWe evaluated the performance of the minimax SVM on six binary classi\ufb01cation datasets from the\nUCI repository, compared to these \ufb01ve benchmarks: Support Vector Machines (SVM) [24], Discrete\nChebyshev Classi\ufb01ers (DCC) [3], Minimax Probabilistic Machine (MPM) [2], Tree Augmented\nNaive Bayes (TAN) [25], and Discrete R\u00e9nyi Classi\ufb01ers (DRC) [4]. The results are summarized in\nTable 1 where the numbers indicate the percentage of error in the classi\ufb01cation task.\nWe implemented the minimax SVM by applying the subgradient descent to (18) with the regularizer\n\u03bb(cid:107)\u03b1(cid:107)2\n2. We determined the parameters by cross validation, where we used a randomly-selected 70%\nof the training set for training and the rest 30% for testing. We tested the values in {2\u221210, . . . , 210}.\nUsing the tuned parameters, we trained the algorithm over all the training set and then evaluated the\nerror rate over the test set. We performed this procedure in 1000 Monte Carlo runs each training on\n70% of the data points and testing on the rest 30% and averaged the results.\nAs seen in the table, the minimax SVM results in the best performance for \ufb01ve of the six datasets.\nTo compare these methods in high-dimensional problems, we ran an experiment over synthetic\ndata with n = 200 samples and d = 10000 features. We generated features by i.i.d. Bernoulli\nwith P (Xi = 1) = 0.7, and considered y = sign(\u03b3T x + z) where z \u223c N (0, 1). Using the above\nprocedure, we evaluated 19.3% for the mmSVM, 19.5% error rate for SVM, 19.6% error rate for\nDRC, which indicates the mmSVM can outperform SVM and DRC in high-dimensional settings as\nwell. Also, the average training time for training mmSVM was 0.085 seconds, faster than the training\ntime for the SVM (using Matlab\u2019s SVM command) with the average 0.105 seconds.\nAcknowledgments: We are grateful to Stanford University providing a Stanford Graduate Fellowship,\nand the Center for Science of Information (CSoI), an NSF Science and Technology Center under\ngrant agreement CCF-0939370 , for the support during this research.\n\n8\n\n\fReferences\n\n[1] Vladimir Vapnik. The nature of statistical learning theory. Springer Science & Business Media, 2013.\n\n[2] Gert RG Lanckriet, Laurent El Ghaoui, Chiranjib Bhattacharyya, and Michael I Jordan. A robust minimax\n\napproach to classi\ufb01cation. The Journal of Machine Learning Research, 3:555\u2013582, 2003.\n\n[3] Elad Eban, Elad Mezuman, and Amir Globerson. Discrete chebyshev classi\ufb01ers. In Proceedings of the\n\n31st International Conference on Machine Learning (ICML-14), pages 1233\u20131241, 2014.\n\n[4] Meisam Razaviyayn, Farzan Farnia, and David Tse. Discrete r\u00e9nyi classi\ufb01ers. In Advances in Neural\n\nInformation Processing Systems 28, pages 3258\u20133266, 2015.\n\n[5] Peter D. Gr\u00fcnwald and Philip Dawid. Game theory, maximum entropy, minimum discrepancy and robust\n\nbayesian decision theory. The Annals of Statistics, 32(4):1367\u20131433, 2004.\n\n[6] Edwin T Jaynes. Information theory and statistical mechanics. Physical review, 106(4):620, 1957.\n\n[7] Amir Globerson and Naftali Tishby. The minimum information principle for discriminative learning. In\n\nProceedings of the 20th conference on Uncertainty in arti\ufb01cial intelligence, pages 193\u2013200, 2004.\n\n[8] Vitaly Feldman, Venkatesan Guruswami, Prasad Raghavendra, and Yi Wu. Agnostic learning of monomials\n\nby halfspaces is hard. SIAM Journal on Computing, 41(6):1558\u20131590, 2012.\n\n[9] Philip Dawid. Coherent measures of discrepancy, uncertainty and dependence, with applications to\nbayesian predictive experimental design. Technical Report 139, University College London, 1998.\nhttp://www.ucl.ac.uk/Stats/research/abs94.html.\n\n[10] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.\n\n[11] Yasemin Altun and Alexander Smola. Unifying divergence minimisation and statistical inference via\n\nconvex duality. In Learning Theory: Conference on Learning Theory COLT 2006, Proceedings, 2006.\n\n[12] Miroslav Dud\u00edk, Steven J Phillips, and Robert E Schapire. Maximum entropy density estimation with\ngeneralized regularization and an application to species distribution modeling. Journal of Machine Learning\nResearch, 8(6):1217\u20131260, 2007.\n\n[13] Ayse Erkan and Yasemin Altun. Semi-supervised learning via generalized maximum entropy. In AISTATS,\n\npages 209\u2013216, 2010.\n\n[14] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural\n\nresults. Journal of Machine Learning Research, 3(Nov):463\u2013482, 2002.\n\n[15] Peter McCullagh and John A Nelder. Generalized linear models, volume 37. CRC press, 1989.\n\n[16] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1.\n\nSpringer, 2001.\n\n[17] Adam L Berger, Vincent J Della Pietra, and Stephen A Della Pietra. A maximum entropy approach to\n\nnatural language processing. Computational linguistics, 22(1):39\u201371, 1996.\n\n[18] Peter J Huber. Robust Statistics. Wiley, 1981.\n\n[19] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society.\n\nSeries B (Methodological), pages 267\u2013288, 1996.\n\n[20] Scott Shaobing Chen, David L. Donoho, and Michael A. Saunders. Atomic decomposition by basis pursuit.\n\nSIAM Journal on Scienti\ufb01c Computing, 20(1):33\u201361, 1998.\n\n[21] Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems.\n\nTechnometrics, 12(1):55\u201367, 1970.\n\n[22] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of\n\nthe Royal Statistical Society: Series B (Statistical Methodology), 68(1):49\u201367, 2006.\n\n[23] Laurent Jacob, Guillaume Obozinski, and Jean-Philippe Vert. Group lasso with overlap and graph lasso. In\n\nProceedings of the 26th annual international conference on machine learning, pages 433\u2013440, 2009.\n\n[24] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273\u2013297, 1995.\n\n[25] CK Chow and CN Liu. Approximating discrete probability distributions with dependence trees. Information\n\nTheory, IEEE Transactions on, 14(3):462\u2013467, 1968.\n\n9\n\n\f", "award": [], "sourceid": 2103, "authors": [{"given_name": "Farzan", "family_name": "Farnia", "institution": "Stanford University"}, {"given_name": "David", "family_name": "Tse", "institution": "Stanford University"}]}