{"title": "Convex Calibrated Surrogates for Low-Rank Loss Matrices with Applications to Subset Ranking Losses", "book": "Advances in Neural Information Processing Systems", "page_first": 1475, "page_last": 1483, "abstract": "The design of convex, calibrated surrogate losses, whose minimization entails consistency with respect to a desired target loss, is an important concept to have emerged in the theory of machine learning in recent years. We give an explicit construction of a convex least-squares type surrogate loss that can be designed to be calibrated for any multiclass learning problem for which the target loss matrix has a low-rank structure; the surrogate loss operates on a surrogate target space of dimension at most the rank of the target loss. We use this result to design convex calibrated surrogates for a variety of subset ranking problems, with target losses including the precision@q, expected rank utility, mean average precision, and pairwise disagreement.", "full_text": "Convex Calibrated Surrogates for Low-Rank Loss\nMatrices with Applications to Subset Ranking Losses\n\nHarish G. Ramaswamy\n\nShivani Agarwal\n\nComputer Science & Automation\n\nComputer Science & Automation\n\nIndian Institute of Science\n\nharish gurup@csa.iisc.ernet.in\n\nIndian Institute of Science\nshivani@csa.iisc.ernet.in\n\nAmbuj Tewari\n\nStatistics and EECS\n\nUniversity of Michigan\n\ntewaria@umich.edu\n\nAbstract\n\nThe design of convex, calibrated surrogate losses, whose minimization entails\nconsistency with respect to a desired target loss, is an important concept to have\nemerged in the theory of machine learning in recent years. We give an explicit\nconstruction of a convex least-squares type surrogate loss that can be designed to\nbe calibrated for any multiclass learning problem for which the target loss matrix\nhas a low-rank structure; the surrogate loss operates on a surrogate target space\nof dimension at most the rank of the target loss. We use this result to design\nconvex calibrated surrogates for a variety of subset ranking problems, with target\nlosses including the precision@q, expected rank utility, mean average precision,\nand pairwise disagreement.\n\nIntroduction\n\n1\nThere has been much interest in recent years in understanding consistency properties of learning\nalgorithms \u2013 particularly algorithms that minimize a surrogate loss \u2013 for a variety of \ufb01nite-output\nlearning problems, including binary classi\ufb01cation, multiclass classi\ufb01cation, multi-label classi\ufb01ca-\ntion, subset ranking, and others [1\u201317]. For algorithms minimizing a surrogate loss, the question\nof consistency reduces to the question of calibration of the surrogate loss with respect to the target\nloss of interest [5\u20137, 16]; in general, one is interested in convex surrogates that can be minimized\nef\ufb01ciently. In particular, the existence (and lack thereof) of convex calibrated surrogates for various\nsubset ranking problems, with target losses including for example the discounted cumulative gain\n(DCG), mean average precision (MAP), mean reciprocal rank (MRR), and pairwise disagreement\n(PD), has received signi\ufb01cant attention recently [9, 11\u201313, 15\u201317].\nIn this paper, we develop a general result which allows us to give an explicit convex, calibrated\nsurrogate de\ufb01ned on a low-dimensional surrogate space for any \ufb01nite-output learning problem for\nwhich the loss matrix has low rank. Recently, Ramaswamy and Agarwal [16] showed the existence\nof such surrogates, but their result involved an unwieldy surrogate space, and moreover did not give\nan explicit, usable construction for the mapping needed to transform predictions in the surrogate\nspace back to the original prediction space. Working in the same general setting as theirs, we give\nan explicit construction that leads to a simple least-squares type surrogate. We then apply this\nresult to obtain several new results related to subset ranking. Speci\ufb01cally, we \ufb01rst obtain calibrated,\nscore-based surrogates for the Precision@q loss, which includes the winner-take-all (WTA) loss as\na special case, and the expected rank utility (ERU) loss; to the best of our knowledge, consistency\nwith respect to these losses has not been studied previously in the literature. When there are r\ndocuments to be ranked for each query, the score-based surrogates operate on an r-dimensional\nsurrogate space. We then turn to the MAP and PD losses, which are both widely used in practice, and\nfor which it has been shown that no convex score-based surrogate can be calibrated for all probability\ndistributions [11, 15, 16]. For the PD loss, Duchi et al. [11] gave certain low-noise conditions on the\nprobability distribution under which a convex, calibrated score-based surrogate could be designed;\n\n1\n\n\fwe are unaware of such a result for the MAP loss. A straightforward application of our low-rank\nresult to these losses yields convex calibrated surrogates de\ufb01ned on O(r2)-dimensional surrogate\nspaces, but in both cases, the mapping needed to transform back to predictions in the original space\ninvolves solving a computationally hard problem.\nInspired by these surrogates, we then give a\nconvex score-based surrogate with an ef\ufb01cient mapping that is calibrated with respect to MAP under\ncertain conditions on the probability distribution; this is the \ufb01rst such result for the MAP loss that\nwe are aware of. We also give a family of convex score-based surrogates calibrated with the PD\nloss under certain noise conditions, generalizing the surrogate and conditions of Duchi et al. [11].\nFinally, we give an ef\ufb01cient mapping for the O(r2)-dimensional surrogate for the PD loss, and show\nthat this leads to a convex surrogate calibrated with the PD loss under a more general condition, i.e.\nover a larger set of probability distributions, than those associated with the score-based surrogates.\nPaper outline. We start with some preliminaries and background in Section 2. Section 3 gives our\nprimary result, namely an explicit convex surrogate calibrated for low-rank loss matrices, de\ufb01ned on\na surrogate space of dimension at most the rank of the matrix. Sections 4\u20137 then give applications\nof this result to the Precision@q, ERU, MAP, and PD losses, respectively. All proofs not included\nin the main text can be found in the appendix.\n\nP\n\n\u2212\u2192 er\ufffd,\u2217D .1\n\nD[hm]\n\n2 Preliminaries and Background\nSetup. We work in the same general setting as that of Ramaswamy and Agarwal [16]. There is an\ninstance space X , a \ufb01nite set of class labels Y = [n] = {1, . . . , n}, and a \ufb01nite set of target labels\n(possible predictions) T = [k] = {1, . . . , k}. Given training examples (X1, Y1), . . . , (Xm, Ym)\ndrawn i.i.d. from a distribution D on X \u00d7Y, the goal is to learn a prediction model h : X\u2192T . Often,\nT = Y, but this is not always the case (for example, in the subset ranking problems we consider,\nthe labels in Y are typically relevance vectors or preference graphs over a set of r documents, while\nthe target labels in T are permutations over the r documents). The performance of a prediction\nmodel h : X\u2192T is measured via a loss function \ufffd : Y \u00d7 T \u2192R+ (where R+ = [0,\u221e)); here \ufffd(y, t)\ndenotes the loss incurred on predicting t \u2208 T when the label is y \u2208 Y. Speci\ufb01cally, the goal is\nto learn a model h with low expected loss or \ufffd-error er\ufffd\nD[h] = E(X,Y )\u223cD[\ufffd(Y, h(X))]; ideally, one\nwants the \ufffd-error of the learned model to be close to the optimal \ufffd-error er\ufffd,\u2217D = inf h:X\u2192T er\ufffd\nD[h].\nAn algorithm which when given a random training sample as above produces a (random) model\nhm : X\u2192T is said to be consistent w.r.t. \ufffd if the \ufffd-error of the learned model hm converges in\nprobability to the optimal: er\ufffd\nTypically, minimizing the discrete \ufffd-error directly is computationally dif\ufb01cult; therefore one uses\ninstead a surrogate loss function \u03c8 : Y \u00d7 Rd\u2192 \u00afR+ (where \u00afR+ = [0,\u221e]), de\ufb01ned on the continuous\nsurrogate target space Rd for some d \u2208 Z+ instead of the discrete target space T , and learns a\nmodel f : X\u2192Rd by minimizing (approximately, based on the training sample) the \u03c8-error er\u03c8\nD[f ] =\nE(X,Y )\u223cD[\u03c8(Y, f (X))]. Predictions on new instances x \u2208 X are then made by applying the learned\nmodel f and mapping back to predictions in the target space T via some mapping pred : Rd\u2192T ,\ngiving h(x) = pred(f (x)). Under suitable conditions, algorithms that approximately minimize the\n\u03c8-error based on a training sample are known to be consistent with respect to \u03c8, i.e. to converge in\nprobability to the optimal \u03c8-error er\u03c8,\u2217D = inf f :X\u2192Rd er\u03c8\nD[f ]. A desirable property of \u03c8 is that it be\ncalibrated w.r.t. \ufffd, in which case consistency w.r.t. \u03c8 also guarantees consistency w.r.t. \ufffd; we give a\nformal de\ufb01nition of calibration and statement of this result below.\n+ :\ufffdi pi = 1}.\nIn what follows, we will denote by \u0394n the probability simplex in Rn: \u0394n = {p \u2208 Rn\nFor z \u2208 R, let (z)+ = max(z, 0). We will \ufb01nd it convenient to view the loss function \ufffd : Y\u00d7T \u2192R+\nas an n \u00d7 k matrix with elements \ufffdyt = \ufffd(y, t) for y \u2208 [n], t \u2208 [k], and column vectors \ufffdt =\n+ for t \u2208 [k]. We will also represent the surrogate loss \u03c8 : Y \u00d7 Rd\u2192 \u00afR+\n(\ufffd1t, . . . , \ufffdnt)\ufffd \u2208 Rn\nas a vector function \u03c8 : Rd\u2192 \u00afRn\n+ with \u03c8y(u) = \u03c8(y, u) for y \u2208 [n], u \u2208 Rd, and \u03c8(u) =\n(\u03c81(u), . . . , \u03c8n(u))\ufffd \u2208 \u00afRn\nDe\ufb01nition 1 (Calibration). Let \ufffd : Y \u00d7T \u2192R+ and let P \u2286 \u0394n. A surrogate loss \u03c8 : Y \u00d7 Rd\u2192 \u00afR+\nis said to be calibrated w.r.t. \ufffd over P if there exists a function pred : Rd\u2192T such that\n\n+ for u \u2208 Rd.\n\n\u2200p \u2208 P :\n\np\ufffd\u03c8(u) > inf\nu\u2208Rd\n\np\ufffd\u03c8(u) .\n\ninf\n\nu\u2208Rd:pred(u) /\u2208argmintp\ufffd\ufffdt\n\n1Here P\n\n\u2212\u2192 denotes convergence in probability: Xm\n\nP\n\n\u2212\u2192 a if \u2200\ufffd > 0, P(|Xm \u2212 a| \u2265 \ufffd)\u2192 0 as m\u2192\u221e.\n\n2\n\n\fIn this case we also say (\u03c8, pred) is (\ufffd,P)-calibrated, or if P = \u0394n, simply \ufffd-calibrated.\nTheorem 2 ( [6, 7, 16]). Let \ufffd : Y \u00d7 T \u2192R+ and \u03c8 : Y \u00d7 Rd\u2192 \u00afR+. Then \u03c8 is calibrated w.r.t. \ufffd\nover \u0394n iff \u2203 a function pred : Rd\u2192T such that for all distributions D on X \u00d7 Y and all sequences\nof random (vector) functions fm : X\u2192Rd (depending on (X1, Y1), . . . , (Xm, Ym)),\n\nD[fm] P\u2212\u2192 er\u03c8,\u2217D\ner\u03c8\n\nimplies\n\nD[pred \u25e6 fm] P\u2212\u2192 er\ufffd,\u2217D .\ner\ufffd\n\nFor any instance x \u2208 X , let p(x) \u2208 \u0394n denote the conditional label probability vector at x, given by\np(x) = (p1(x), . . . , pn(x))\ufffd where py(x) = P(Y = y | X = x). Then one can extend the above\nresult to show that for P \u2282 \u0394n, \u03c8 is calibrated w.r.t. \ufffd over P iff \u2203 a function pred : Rd\u2192T such\nthat the above implication holds for all distributions D on X \u00d7 Y for which p(x) \u2208 P \u2200x \u2208 X .\nSubset ranking. Subset ranking problems arise frequently in information retrieval applications.\nIn a subset ranking problem, each instance in X consists of a query together with a set of say\nr documents to be ranked. The label space Y varies from problem to problem:\nin some cases,\nlabels consist of binary or multi-level relevance judgements for the r documents, in which case\nY = {0, 1}r or Y = {0, 1, . . . , s}r for some appropriate s \u2208 Z+; in other cases, labels consist\nof pairwise preference graphs over the r documents, represented as (possibly weighted) directed\nacyclic graphs (DAGs) over r nodes. Given examples of such instance-label pairs, the goal is to\nlearn a model to rank documents for new queries/instances; in most cases, the desired ranking takes\nthe form of a permutation over the r documents, so that T = Sr (where Sr denotes the group of\npermutations on r objects). As noted earlier, various loss functions are used in practice, and there\nhas been much interest in understanding questions of consistency and calibration for these losses in\nrecent years [9\u201315, 17]. The focus so far has mostly been on designing r-dimensional surrogates,\nwhich operate on a surrogate target space of dimension d = r; these are also termed \u2018score-based\u2019\nsurrogates since the resulting algorithms can be viewed as learning one real-valued score function\nfor each of the r documents, and in this case the pred mapping usually consists of simply sorting\nthe documents according to these scores. Below we will apply our result on calibrated surrogates\nfor low-rank loss matrices to obtain new calibrated surrogates \u2013 both r-dimensional, score-based\nsurrogates and, in some cases, higher-dimensional surrogates \u2013 for several subset ranking losses.\n\n3 Calibrated Surrogates for Low Rank Loss Matrices\nThe following is the primary result of our paper. The result gives an explicit construction for a\nconvex, calibrated, least-squares type surrogate loss de\ufb01ned on a low-dimensional surrogate space\nfor any target loss matrix that has a low-rank structure.\nTheorem 3. Let \ufffd : Y \u00d7 T \u2192R+ be a loss function such that there exist d \u2208 Z+, vectors\n\u03b11, . . . , \u03b1n \u2208 Rd, \u03b21, . . . , \u03b2k \u2208 Rd and c \u2208 R such that\n\nd\ufffdi=1\nd\ufffdi=1\n\n\u03b1yi\u03b2ti + c .\n\n(ui \u2212 \u03b1yi)2\n\npred\u2217\ufffd (u) \u2208 argmint\u2208[k]u\ufffd\u03b2t .\n\n\ufffd(y, t) =\n\nLet \u03c8\u2217\ufffd : Y \u00d7 Rd\u2192 \u00afR+ be de\ufb01ned as\n\n\u03c8\u2217\ufffd (y, u) =\n\nand let pred\u2217\ufffd : Rd\u2192T be de\ufb01ned as\nThen\ufffd\u03c8\u2217\ufffd , pred\u2217\ufffd\ufffd is \ufffd-calibrated.\nProof. Let p \u2208 \u0394n. De\ufb01ne up \u2208 Rd as up\np\ufffd\u03c8\u2217\ufffd (u) =\n\nMinimizing this over u \u2208 Rd yields that up is the unique minimizer of p\ufffd\u03c8\u2217\ufffd (u). Also, for any\nt \u2208 [k], we have\n\ny=1 py\u03b1yi \u2200i \u2208 [d]. Now for any u \u2208 Rd, we have\n\npy(ui \u2212 \u03b1yi)2 .\n\ni =\ufffdn\nd\ufffdi=1\nn\ufffdy=1\n\n3\n\n\fNow, for each t \u2208 [k], de\ufb01ne\n\nregret\ufffd\n\np\ufffd\ufffdt =\n\nn\ufffdy=1\n\npy\ufffd d\ufffdi=1\np(t) \ufffd= p\ufffd\ufffdt \u2212 min\nt\ufffd\u2208[k]\n\n\u03b1yi\u03b2ti + c\ufffd = (up)\ufffd\u03b2t + c .\np\ufffd\ufffdt\ufffd = (up)\ufffd\u03b2t \u2212 min\nt\ufffd\u2208[k]\n\n(up)\ufffd\u03b2t\ufffd .\n\nClearly, by de\ufb01nition of pred\u2217\ufffd , we have regret\ufffd\np(t) = 0 for all\nt \u2208 [k], then trivially pred\u2217\ufffd (u) \u2208 argmintp\ufffd\ufffdt \u2200u \u2208 Rd (and there is nothing to prove in this case).\nTherefore assume \u2203t \u2208 [k] : regret\ufffd\n\np(pred\u2217\ufffd (up)) = 0. Also, if regret\ufffd\n\np(t) > 0, and let\n\ufffd =\n\nmin\nt\u2208[k]:regret\ufffd\n\np(t)>0\n\nregret\ufffd\n\np(t) .\n\nThen we have\n\ninf\n\nu\u2208Rd:pred\u2217\ufffd (u) /\u2208argmintp\ufffd\ufffdt\n\np\ufffd\u03c8\u2217\ufffd (u) =\n\n=\n\nu\u2208Rd:regret\ufffd\n\ninf\np(pred\u2217\ufffd (u))\u2265\ufffd\ninf\n\np\ufffd\u03c8\u2217\ufffd (u)\n\np(pred\u2217\ufffd (u))\u2265regret\ufffd\n\nu\u2208Rd:regret\ufffd\np(pred\u2217\ufffd (u)) is continuous at u = up. To see this,\n\np(pred\u2217\ufffd (up))+\ufffd\n\np\ufffd\u03c8\u2217\ufffd (u) .\n\nNow, we claim that the mapping u \ufffd\u2192 regret\ufffd\nsuppose the sequence {um} converges to up. Then we have\np(pred\u2217\ufffd (um)) = (up)\ufffd\u03b2pred\u2217\ufffd (um) \u2212 min\nt\ufffd\u2208[k]\n\nregret\ufffd\n\n(up)\ufffd\u03b2t\ufffd\n\n= (up \u2212 um)\ufffd\u03b2pred\u2217\ufffd (um) + u\ufffdm\u03b2pred\u2217\ufffd (um) \u2212 min\nt\ufffd\u2208[k]\n= (up \u2212 um)\ufffd\u03b2pred\u2217\ufffd (um) + min\nu\ufffdm\u03b2t\ufffd \u2212 min\nt\ufffd\u2208[k]\nt\ufffd\u2208[k]\n\n(up)\ufffd\u03b2t\ufffd\n\n(up)\ufffd\u03b2t\ufffd\n\nThe last equality holds by de\ufb01nition of pred\u2217\ufffd . It is easy to see the term on the right goes to zero\np(pred\u2217\ufffd (up)) = 0, yielding\nas um converges to up. Thus regret\ufffd\ncontinuity at up. In particular, this implies \u2203\u03b4 > 0 such that\n\np(pred\u2217\ufffd (um)) converges to regret\ufffd\n\n\ufffdu \u2212 up\ufffd < \u03b4 =\u21d2 regret\ufffd\n\np(pred\u2217\ufffd (u)) \u2212 regret\ufffd\n\np(pred\u2217\ufffd (up)) < \ufffd .\n\nThis gives\n\nu\u2208Rd:regret\ufffd\n\np(pred\u2217\ufffd (u))\u2265regret\ufffd\n\np(pred\u2217\ufffd (up))+\ufffd\n\ninf\n\np\ufffd\u03c8\u2217\ufffd (u) \u2265\n\np\ufffd\u03c8\u2217\ufffd (u)\n\ninf\n\nu\u2208Rd:\ufffdu\u2212up\ufffd\u2265\u03b4\np\ufffd\u03c8\u2217\ufffd (u) ,\n\n> inf\nu\u2208Rd\n\nwhere the last inequality holds since p\ufffd\u03c8\u2217\ufffd (u) is a strictly convex function of u and up is its unique\nminimizer. The above sequence of inequalities give us that\n\ninf\n\nu\u2208Rd:pred\u2217\ufffd (u) /\u2208argmintp\ufffd\ufffdt\n\np\ufffd\u03c8\u2217\ufffd (u) > inf\nu\u2208Rd\n\np\ufffd\u03c8\u2217\ufffd (u) .\n\nSince this holds for all p \u2208 \u0394n, we have that (\u03c8\u2217\ufffd , pred\u2217\ufffd ) is \ufffd-calibrated.\nWe note that Ramaswamy and Agarwal [16] showed a similar least-squares type surrogate calibrated\nfor any loss \ufffd : Y \u00d7 T \u2192R+; indeed our proof technique above draws inspiration from the proof\ntechnique there. However, the surrogate they gave was de\ufb01ned on a surrogate space of dimension\nn\u2212 1, where n is the number of class labels in Y. For many practical problems, this is an intractably\nlarge number. For example, as noted above, in the subset ranking problems we consider, the number\nof class labels is typically exponential in r, the number of documents associated with each query.\nOn the other hand, as we will see below, many subset ranking losses have a low-rank structure,\nwith rank linear or quadratic in r, allowing us to use the above result to design convex calibrated\nsurrogates on an O(r) or O(r2)-dimensional space. Ramaswamy and Agarwal also gave another\nresult in which they showed that any loss matrix of rank d has a d-dimensional convex calibrated\nsurrogate; however the surrogate there was de\ufb01ned such that it took values < \u221e on an awkward\nspace in Rd (not the full space Rd) that would be dif\ufb01cult to construct in practice, and moreover,\ntheir result did not yield an explicit construction for the pred mapping required to use a calibrated\nsurrogate in practice. Our result above combines the bene\ufb01ts of both these previous results, allowing\nexplicit construction of low-dimensional least-squares type surrogates for any low-rank loss matrix.\nThe following sections will illustrate several applications of this result.\n\n4\n\n\f4 Calibrated Surrogates for Precision@q\nThe Precision@q is a popular performance measure for subset ranking problems in information\nretrieval. As noted above, in a subset ranking problem, each instance in X consists of a query\ntogether with a set of r documents to be ranked. Consider a setting with binary relevance judgement\nlabels, so that Y = {0, 1}r with n = 2r. The prediction space is T = Sr (group of permutations on\nr objects) with k = r!. For y \u2208 {0, 1}r and \u03c3 \u2208 Sr, where \u03c3(i) denotes the position of document i\nunder \u03c3, the Precision@q loss for any integer q \u2208 [r] can be written as follows:\n\n\ufffdP@q(y, \u03c3) = 1 \u2212\n\n= 1 \u2212\n\n1\nq\n\n1\nq\n\ny\u03c3\u22121(i)\n\nyi \u00b7 1(\u03c3(i) \u2264 q) .\n\nq\ufffdi=1\nr\ufffdi=1\n\nTherefore, by Theorem 3, for the r-dimensional surrogate \u03c8\u2217P@q : {0, 1}r \u00d7 Rr\u2192 \u00afR+ and pred\u2217P@q :\nRr\u2192Sr de\ufb01ned as\n\n\u03c8\u2217P@q(y, u) =\n\nr\ufffdi=1\n\n(ui \u2212 yi)2\nr\ufffdi=1\n\npred\u2217P@q(u) \u2208 argmax\u03c3\u2208Sr\n\nui \u00b7 1(\u03c3(i) \u2264 q) ,\n\nwe have that (\u03c8\u2217P@q, pred\u2217P@q) is \ufffdP@q-calibrated. It can easily be seen that for any u \u2208 Rr, any\npermutation \u03c3 which places the top q documents sorted in decreasing order of scores ui in the top\nq positions achieves the maximum in pred\u2217P@q(u); thus pred\u2217P@q(u) can be implemented ef\ufb01ciently\nusing a standard sorting or selection algorithm. Note that the popular winner-take-all (WTA) loss,\nwhich assigns a loss of 0 if the top-ranked item is relevant (i.e. if y\u03c3\u22121(1) = 1) and 1 otherwise,\nis simply a special case of the above loss with q = 1; therefore the above construction also yields\na calibrated surrogate for the WTA loss. To our knowledge, this is the \ufb01rst example of convex,\ncalibrated surrogates for the Precision@q and WTA losses.\n\n5 Calibrated Surrogates for Expected Rank Utility\nThe expected rank utility (ERU) is a popular subset ranking performance measure used in recom-\nmender systems displaying short ranked lists [18].\nIn this case the labels consist of multi-level\nrelevance judgements (such as 0 to 5 stars), so that Y = {0, 1, . . . , s}r for some appropriate s \u2208 Z+\nwith n = (s + 1)r. The prediction space again is T = Sr with k = r!. For y \u2208 {0, 1, . . . , s}r and\n\u03c3 \u2208 Sr, where \u03c3(i) denotes the position of document i under \u03c3, the ERU loss is de\ufb01ned as\n\n\ufffdERU(y, \u03c3) = z \u2212\n\nmax(yi \u2212 v, 0) \u00b7 2\n\n1\u2212\u03c3(i)\nw\u22121\n\n,\n\nwhere z is a constant to ensure the positivity of the loss, v \u2208 [s] is a constant that indicates a\nneutral score, and w \u2208 R is a constant indicating the viewing half-life. Thus, by Theorem 3, for the\nr-dimensional surrogate \u03c8\u2217ERU : {0, 1, . . . , s}r \u00d7 Rr\u2192 \u00afR+ and pred\u2217ERU : Rr\u2192Sr de\ufb01ned as\n\nr\ufffdi=1\n\nr\ufffdi=1\n\n\u03c8\u2217ERU(y, u) =\n\n(ui \u2212 max(yi \u2212 v, 0))2\n\npred\u2217ERU(u) \u2208 argmax\u03c3\u2208Sr\n\n1\u2212\u03c3(i)\nw\u22121\n\n,\n\nui \u00b7 2\n\nr\ufffdi=1\n\nwe have that (\u03c8\u2217ERU, pred\u2217ERU) is \ufffdERU-calibrated. It can easily be seen that for any u \u2208 Rr, any\npermutation \u03c3 satisfying the condition\n\nachieves the maximum in pred\u2217ERU(u), and therefore pred\u2217ERU(u) can be implemented ef\ufb01ciently\nby simply sorting the r documents in decreasing order of scores ui. As for Precision@q, to our\nknowledge, this is the \ufb01rst example of a convex, calibrated surrogate for the ERU loss.\n\nui > uj =\u21d2 \u03c3(i) < \u03c3(j)\n\n5\n\n\f6 Calibrated Surrogates for Mean Average Precision\nThe mean average precision (MAP) is a widely used ranking performance measure in information\nretrieval and related applications [15, 19]. As with the Precision@q loss, Y = {0, 1}r and T = Sr.\nFor y \u2208 {0, 1}r and \u03c3 \u2208 Sr, where \u03c3(i) denotes the position of document i under \u03c3, the MAP loss\nis de\ufb01ned as follows:\n\n\ufffdMAP(y, \u03c3) = 1 \u2212\n\n1\n\n|{\u03b3 : y\u03b3 = 1}| \ufffdi:yi=1\n\n1\n\n\u03c3(i)\n\n\u03c3(i)\ufffdj=1\n\ny\u03c3\u22121(j) .\n\nIt was recently shown that there cannot exist any r-dimensional convex, calibrated surrogates for\nthe MAP loss [15]. We now re-write the MAP loss above in a manner that allows us to show the\nexistence of an O(r2)-dimensional convex, calibrated surrogate. In particular, we can write\n\n2\n\ni\n\nyiyj\n\n\ufffdr\n\nr\ufffdi=1\n\ni\ufffdj=1\n\n\ufffdr\n\n. = 1 \u2212\n\n1\n\u03b3=1 y\u03b3\n\n1\n\u03b3=1 y\u03b3\n\ny\u03c3\u22121(i)y\u03c3\u22121(j)\n\n\ufffdMAP(y, \u03c3) = 1 \u2212\n\nr\ufffdi=1\ni\ufffdj=1\nmax(\u03c3(i), \u03c3(j))\nThus, by Theorem 3, for the r(r+1)\n-dimensional surrogate \u03c8\u2217MAP : {0, 1}r \u00d7 Rr(r+1)/2\u2192 \u00afR+ and\npred\u2217MAP : Rr(r+1)/2\u2192Sr de\ufb01ned as\ni\ufffdj=1\ufffduij \u2212\n\u03b3=1 y\u03b3\ufffd2\nr\ufffdi=1\nyiyj\ufffdr\n\u03c8\u2217MAP(y, u) =\nr\ufffdi=1\ni\ufffdj=1\n\npred\u2217MAP(u) \u2208 argmax\u03c3\u2208Sr\nwe have that (\u03c8\u2217MAP, pred\u2217MAP) is \ufffdMAP-calibrated.\nNote however that the optimization problem associated with computing pred\u2217MAP(u) above can be\nwritten as a quadratic assignment problem (QAP), and most QAPs are known to be NP-hard. We\nconjecture that the QAP associated with the mapping pred\u2217MAP above is also NP-hard. Therefore,\nwhile the surrogate loss \u03c8\u2217MAP is calibrated for \ufffdMAP and can be minimized ef\ufb01ciently over a training\nsample to learn a model f : X\u2192Rr(r+1)/2, for large r, evaluating the mapping required to transform\npredictions in Rr(r+1)/2 back to predictions in Sr is likely to be computationally infeasible. Below\nwe describe an alternate mapping in place of pred\u2217MAP which can be computed ef\ufb01ciently, and show\nthat under certain conditions on the probability distribution, the surrogate \u03c8\u2217MAP together with this\nmapping is still calibrated for \ufffdMAP.\nSpeci\ufb01cally, de\ufb01ne predMAP : Rr(r+1)/2\u2192Sr as follows:\n\nmax(\u03c3(i), \u03c3(j))\n\nuij \u00b7\n\n1\n\n,\n\npredMAP(u) \u2208 \ufffd\u03c3 \u2208 Sr : uii > ujj =\u21d2 \u03c3(i) < \u03c3(j)\ufffd .\n\n\u2200i, j \u2208 [r] : i \u2265 j .\n\nClearly, predMAP(u) can be implemented ef\ufb01ciently by simply sorting the \u2018diagonal\u2019 elements uii\nfor i \u2208 [r]. Also, let \u0394Y denote the probability simplex over Y, and for each p \u2208 \u0394Y, de\ufb01ne\nup \u2208 Rr(r+1)/2 as follows:\n\nup\n\nij = EY \u223cp\ufffd YiYj\ufffdr\nNow de\ufb01ne Preinforce \u2282 \u0394Y as follows:\nPreinforce = \ufffdp \u2208 \u0394Y : up\n\n\u03b3=1 Y\u03b3\ufffd = \ufffdy\u2208Y\n\nii \u2265 up\n\njj =\u21d2 up\n\n\u03b3=1 y\u03b3\ufffd\npy\ufffd yiyj\ufffdr\njj + \ufffd\u03b3\u2208[r]\\{i,j}\nii \u2265 up\n\nij = up\n\nji for i < j. Then we have the following result:\n\nwhere we set up\nTheorem 4. (\u03c8\u2217MAP, predMAP) is (\ufffdMAP,Preinforce)-calibrated.\nThe ideal predictor pred\u2217MAP uses the entire u matrix, but the predictor predMAP, uses only the diag-\nonal elements. The noise conditions Preinforce can be viewed as basically enforcing that the diagonal\nelements dominate and enforce a clear ordering themselves.\nIn fact, since the mapping predMAP depends on only the diagonal elements of u, we can equivalently\nde\ufb01ne an r-dimensional surrogate that is calibrated w.r.t. \ufffdMAP over Preinforce. Speci\ufb01cally, we have\nthe following immediate corollary:\n\n(up\nj\u03b3 \u2212 up\n\ni\u03b3)+\ufffd ,\n\n6\n\n\fCorollary 5. Let \ufffd\u03c8MAP : {0, 1}r \u00d7 Rr\u2192 \u00afR+ and \ufffdpredMAP : Rr\u2192Sr be de\ufb01ned as\n\ufffd\u03c8MAP(y,\ufffdu) =\n\ufffdpredMAP(\ufffdu) \u2208 \ufffd\u03c3 \u2208 Sr :\ufffdui >\ufffduj =\u21d2 \u03c3(i) < \u03c3(j)\ufffd .\n\nr\ufffdi=1\ufffd\ufffdui \u2212\n\u03b3=1 y\u03b3\ufffd2\nyi\ufffdr\nThen (\ufffd\u03c8MAP,\ufffdpredMAP) is (\ufffdMAP,Preinforce)-calibrated.\nLooking at the form of \ufffd\u03c8MAP and \ufffdpredMAP, we can see that the function s : Y\u2192Rr de\ufb01ned as\nsi(y) = yi/(\ufffdr\n\u03b3=1 yr) is a \u2018standardization function\u2019 for the MAP loss over Preinforce, and therefore\nit follows that any \u2018order-preserving surrogate\u2019 with this standardization function is also calibrated\nwith the MAP loss over Preinforce [13]. To our knowledge, this is the \ufb01rst example of conditions on\nthe probability distribution under which a convex calibrated (and moreover, score-based) surrogate\ncan be designed for the MAP loss.\n\n7 Calibrated Surrogates for Pairwise Disagreement\nThe pairwise disagreement (PD) loss is a natural and widely used loss in subset ranking [11, 17].\nThe label space Y here consists of a \ufb01nite number of (possibly weighted) directed acyclic graphs\n(DAGs) over r nodes; we can represent each such label as a vector y \u2208 Rr(r\u22121)\nwhere at least one\nof yij or yji is 0 for each i \ufffd= j, with yij > 0 indicating a preference for document i over document\nj and yij denoting the weight of the preference. The prediction space as usual is T = Sr with\nk = r!. For y \u2208 Y and \u03c3 \u2208 Sr, where \u03c3(i) denotes the position of document i under \u03c3, the PD loss\nis de\ufb01ned as follows:\n\n+\n\n\ufffdPD(y, \u03c3) =\n\nr\ufffdi=1\ufffdj\ufffd=i\n\nyij 1\ufffd\u03c3(i) > \u03c3(j)\ufffd .\n\nIt was recently shown that there cannot exist any r-dimensional convex, calibrated surrogates for the\nPD loss [15, 16]. By Theorem 3, for the r(r \u2212 1)-dimensional surrogate \u03c8\u2217PD : Y \u00d7 Rr(r\u22121)\u2192 \u00afR+\nand pred\u2217PD : Rr(r\u22121)\u2192Sr de\ufb01ned as\n\u03c8\u2217PD(y, u) =\n\n(1)\n\nr\ufffdi=1\ufffdj\ufffd=i\npred\u2217PD(u) \u2208 argmin\u03c3\u2208Sr\n\n(uij \u2212 yij)2\nr\ufffdi=1\ufffdj\ufffd=i\n\nuij \u00b7 1\ufffd\u03c3(i) > \u03c3(j)\ufffd\n\n2\n\n2\n\n, allowing for an r(r\u22121)\n\nwe immediately have that (\u03c8\u2217PD, pred\u2217PD) is \ufffdPD-calibrated (in fact the loss matrix \ufffdPD has rank at most\nr(r\u22121)\n-dimensional surrogate; we use r(r\u22121) dimensions for convenience).\nIn this case, the optimization problem associated with computing pred\u2217PD(u) above is a minimum\nweighted feedback arc set (MWFAS) problem, which is known to be NP-hard. Therefore, as with the\nMAP loss, while the surrogate loss \u03c8\u2217PD is calibrated for \ufffdPD and can be minimized ef\ufb01ciently over\na training sample to learn a model f : X\u2192Rr(r\u22121), for large r, evaluating the mapping required to\ntransform predictions in Rr(r\u22121) back to predictions in Sr is likely to be computationally infeasible.\nBelow we give two sets of results. In Section 7.1, we give a family of score-based (r-dimensional)\nsurrogates that are calibrated with the PD loss under different conditions on the probability distri-\nbution; these surrogates and conditions generalize those of Duchi et al. [11]. In Section 7.2, we\ngive a different condition on the probability distribution under which we can actually avoid \u2018dif\ufb01-\ncult\u2019 graphs being passed to pred\u2217PD. This condition is more general (i.e. encompasses a larger set\nof probability distributions) than those associated with the score-based surrogates; this gives a new\n(non-score-based, r(r\u22121)-dimensional) surrogate with an ef\ufb01ciently computable pred mapping that\nis calibrated with the PD loss over a larger set of probability distributions than previous surrogates\nfor this loss.\n\n7.1 Family of r-Dimensional Surrogates Calibrated with \ufffdPD Under Noise Conditions\nThe following gives a family of score-based surrogates, parameterized by functions f : Y\u2192Rr, that\nare calibrated with the PD loss under different conditions on the probability distribution:\n\n7\n\n\f\u03c8f (y, u) =\n\nTheorem 6. Let f : Y\u2192Rr be any function that maps DAGs y \u2208 Y to score vectors f (y) \u2208 Rr. Let\n\u03c8f : Y \u00d7 Rr\u2192 \u00afR+, pred : Rr\u2192Sr and Pf \u2282 \u0394Y be de\ufb01ned as\nr\ufffdi=1\ufffdui \u2212 fi(y)\ufffd2\n\npred(u) \u2208 \ufffd\u03c3 \u2208 Sr : ui > uj =\u21d2 \u03c3(i) < \u03c3(j)\ufffd\n\nPf = \ufffdp \u2208 \u0394Y : EY \u223cp[Yij] > EY \u223cp[Yji] =\u21d2 EY \u223cp[fi(Y )] > EY \u223cp[fj(Y )]\ufffd .\n\nThen (\u03c8f , pred) is (\ufffdPD,Pf )-calibrated.\nThe noise conditions Pf state that the expected value of function f must decide the \u2018right\u2019 ordering.\nWe note that the surrogate given by Duchi et al. [11] can be written in our notation as\n\n\u03c8DMJ(y, u) =\n\nyij(uj \u2212 ui) + \u03bd\n\n\u03bb(ui) ,\n\nr\ufffdi=1\n\nr\ufffdi=1\ufffdj\ufffd=i\nfi(y) = \ufffdj\ufffd=i\n\n(yij \u2212 yji) .\n\nwhere \u03bb is a strictly convex and 1-coercive function and \u03bd > 0. Taking \u03bb(z) = z2 and \u03bd = 1\na special case of the family of score-based surrogates in Theorem 6 above obtained by taking f as\n\n2 gives\n\nIndeed, the set of noise conditions under which the surrogate \u03c8DMJ is shown to be calibrated with\nthe PD loss in Duchi et al. [11] is exactly the set Pf above with this choice of f. We also note that f\ncan be viewed as a \u2018standardization function\u2019 [13] for the PD loss over Pf .\n7.2 An O(r2)-dimensional Surrogate Calibrated with \ufffdPD Under More General Conditions\nConsider now the r(r \u2212 1)-dimensional surrogate \u03c8\u2217PD : Y \u00d7 Rr(r\u22121) de\ufb01ned in Eq. (1). We noted\nthe corresponding mapping pred\u2217PD involved an NP-hard optimization problem. Here we give an\nalternate mapping predPD : Rr(r\u22121)\u2192Sr that can be computed ef\ufb01ciently, and show that under\ncertain conditions on the probability distribution , the surrogate \u03c8\u2217PD together with this mapping\npredPD is calibrated for \ufffdPD. The mapping predPD is described by Algorithm 1 below:\n\nAlgorithm 1 predPD\n\n(Input: u \u2208 Rr(r\u22121); Output: Permutation \u03c3 \u2208 Sr)\n\nConstruct a directed graph over [r] with edge (i, j) having weight (uij \u2212 uji)+. If this graph is\nacyclic, return any topological sorted order. If the graph has cycles, sort the edges in ascending\norder by weight and delete them one by one (smallest weight \ufb01rst) until the graph becomes acyclic;\nreturn any topological sorted order of the resulting acyclic graph.\n\nFor each p \u2208 \u0394Y, de\ufb01ne Ep = {(i, j) \u2208 [r] \u00d7 [r] : EY \u223cp[Yij] > EY \u223cp[Yji]}, and de\ufb01ne\n\nPDAG = \ufffdp \u2208 \u0394Y :\ufffd[r], Ep\ufffd is a DAG\ufffd .\n\nThen we have the following result:\nTheorem 7. (\u03c8\u2217PD, predPD) is (\ufffdPD,PDAG)-calibrated.\nIt is easy to see that PDAG \ufffd Pf \u2200f (where Pf is as de\ufb01ned in Theorem 6), so that the above result\nyields a low-dimensional, convex surrogate with an ef\ufb01ciently computable pred mapping that is\ncalibrated for the PD loss under a broader set of conditions than the previous surrogates.\n\n8 Conclusion\nCalibration of surrogate losses is an important property in designing consistent learning algorithms.\nWe have given an explicit method for constructing calibrated surrogates for any learning problem\nwith a low-rank loss structure, and have used this to obtain several new results for subset ranking,\nincluding new calibrated surrogates for the Precision@q, ERU, MAP and PD losses.\n\nAcknowledgments\n\nThe authors thank the anonymous reviewers, Aadirupa Saha and Shiv Ganesh for their comments. HGR ac-\nknowledges a Tata Consultancy Services (TCS) PhD fellowship and the Indo-US Virtual Institute for Math-\nematical and Statistical Sciences (VIMSS). SA thanks the Department of Science & Technology (DST) and\nIndo-US Science & Technology Forum (IUSSTF) for their support. AT gratefully acknowledges the support of\nNSF under grant IIS-1319810.\n\n8\n\n\fReferences\n[1] G\u00b4abor Lugosi and Nicolas Vayatis. On the Bayes-risk consistency of regularized boosting\n\nmethods. Annals of Statistics, 32(1):30\u201355, 2004.\n\n[2] Wenxin Jiang. Process consistency for AdaBoost. Annals of Statistics, 32(1):13\u201329, 2004.\n[3] Tong Zhang. Statistical behavior and consistency of classi\ufb01cation methods based on convex\n\nrisk minimization. Annals of Statistics, 32(1):56\u2013134, 2004.\n\n[4] Ingo Steinwart. Consistency of support vector machines and other regularized kernel classi-\n\n\ufb01ers. IEEE Transactions on Information Theory, 51(1):128\u2013142, 2005.\n\n[5] Peter L. Bartlett, Michael Jordan, and Jon McAuliffe. Convexity, classi\ufb01cation and risk bounds.\n\nJournal of the American Statistical Association, 101(473):138\u2013156, 2006.\n\n[6] Tong Zhang. Statistical analysis of some multi-category large margin classi\ufb01cation methods.\n\nJournal of Machine Learning Research, 5:1225\u20131251, 2004.\n\n[7] Ambuj Tewari and Peter L. Bartlett. On the consistency of multiclass classi\ufb01cation methods.\n\nJournal of Machine Learning Research, 8:1007\u20131025, 2007.\n\n[8] Ingo Steinwart. How to compare different loss functions and their risks. Constructive Approx-\n\nimation, 26:225\u2013287, 2007.\n\n[9] David Cossock and Tong Zhang. Statistical analysis of bayes optimal subset ranking. IEEE\n\nTransactions on Information Theory, 54(11):5140\u20135154, 2008.\n\n[10] Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to learning\n\nto rank: Theory and algorithm. In International Conference on Machine Learning, 2008.\n\n[11] John Duchi, Lester Mackey, and Michael Jordan. On the consistency of ranking algorithms. In\n\nInternational Conference on Machine Learning, 2010.\n\n[12] Pradeep Ravikumar, Ambuj Tewari, and Eunho Yang. On NDCG consistency of listwise rank-\n\ning methods. In International Conference on Arti\ufb01cial Intelligence and Statistics, 2011.\n\n[13] David Buffoni, Cl\u00b4ement Calauz`enes, Patrick Gallinari, and Nicolas Usunier. Learning scoring\nfunctions with order-preserving losses and standardized supervision. In International Confer-\nence on Machine Learning, 2011.\n\n[14] Wei Gao and Zhi-Hua Zhou. On the consistency of multi-label learning. In Conference on\n\nLearning Theory, 2011.\n\n[15] Cl\u00b4ement Calauz`enes, Nicolas Usunier, and Patrick Gallinari. On the (non-)existence of convex,\ncalibrated surrogate losses for ranking. In Advances in Neural Information Processing Systems\n25, pages 197\u2013205. 2012.\n\n[16] Harish G. Ramaswamy and Shivani Agarwal. Classi\ufb01cation calibration dimension for general\nIn Advances in Neural Information Processing Systems 25, pages 2087\u2013\n\nmulticlass losses.\n2095. 2012.\n\n[17] Yanyan Lan, Jiafeng Guo, Xueqi Cheng, and Tie-Yan Liu. Statistical consistency of ranking\nmethods in a rank-differentiable probability space. In Advances in Neural Information Pro-\ncessing Systems 25, pages 1241\u20131249. 2012.\n\n[18] Quoc V. Le and Alex Smola. Direct optimization of ranking measures, arXiv:0704.3359, 2007.\n[19] Yisong Yue, Thomas Finley, Filip Radlinski, and Thorsten Joachims. A support vector method\nfor optimizing average precision. In Proceedings of the 30th ACM SIGIR International Con-\nference on Research and Development in Information Retrieval, 2007.\n\n9\n\n\f", "award": [], "sourceid": 737, "authors": [{"given_name": "Harish", "family_name": "Ramaswamy", "institution": "Indian Institute of Science"}, {"given_name": "Shivani", "family_name": "Agarwal", "institution": "Indian Institute of Science"}, {"given_name": "Ambuj", "family_name": "Tewari", "institution": "University of Michigan"}]}