{"title": "Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 1896, "page_last": 1904, "abstract": "All the existing multi-task local learning methods are defined on homogeneous neighborhood which consists of all data points from only one task. In this paper, different from existing methods, we propose local learning methods for multi-task classification and regression problems based on heterogeneous neighborhood which is defined on data points from all tasks. Specifically, we extend the k-nearest-neighbor classifier by formulating the decision function for each data point as a weighted voting among the neighbors from all tasks where the weights are task-specific. By defining a regularizer to enforce the task-specific weight matrix to approach a symmetric one, a regularized objective function is proposed and an efficient coordinate descent method is developed to solve it. For regression problems, we extend the kernel regression to multi-task setting in a similar way to the classification case. Experiments on some toy data and real-world datasets demonstrate the effectiveness of our proposed methods.", "full_text": "Heterogeneous-Neighborhood-based Multi-Task\n\nLocal Learning Algorithms\n\nDepartment of Computer Science, Hong Kong Baptist University\n\nyuzhang@comp.hkbu.edu.hk\n\nYu Zhang\n\nAbstract\n\nAll the existing multi-task local learning methods are de\ufb01ned on homogeneous\nneighborhood which consists of all data points from only one task. In this paper,\ndifferent from existing methods, we propose local learning methods for multi-\ntask classi\ufb01cation and regression problems based on heterogeneous neighborhood\nwhich is de\ufb01ned on data points from all tasks. Speci\ufb01cally, we extend the k-\nnearest-neighbor classi\ufb01er by formulating the decision function for each data point\nas a weighted voting among the neighbors from all tasks where the weights are\ntask-speci\ufb01c. By de\ufb01ning a regularizer to enforce the task-speci\ufb01c weight matrix\nto approach a symmetric one, a regularized objective function is proposed and\nan ef\ufb01cient coordinate descent method is developed to solve it. For regression\nproblems, we extend the kernel regression to multi-task setting in a similar way\nto the classi\ufb01cation case. Experiments on some toy data and real-world datasets\ndemonstrate the effectiveness of our proposed methods.\n\n1\n\nIntroduction\n\nFor single-task learning, besides global learning methods there are local learning methods [7], e.g.,\nk-nearest-neighbor (KNN) classi\ufb01er and kernel regression. Different from the global learning meth-\nods, the local learning methods make use of locality structure in different regions of the feature space\nand are complementary to the global learning algorithms. In many applications, the single-task lo-\ncal learning methods have shown comparable performance with the global counterparts. Moreover,\nbesides classi\ufb01cation and regression problems, the local learning methods are also applied to some\nother learning problems, e.g., clustering [18] and dimensionality reduction [19]. When the number\nof labeled data is not very large, the performance of the local learning methods is limited due to s-\nparse local density [14]. In this case, we can leverage the useful information from other related tasks\nto help improve the performance which matches the philosophy of multi-task learning [8, 4, 16].\nMulti-task learning utilizes supervised information from some related tasks to improve the perfor-\nmance of one task at hand and during the past decades many advanced methods have been proposed\nfor multi-task learning, e.g., [17, 3, 9, 1, 2, 6, 12, 20, 14, 13]. Among those methods, [17, 14] are\ntwo representative multi-task local learning methods. Even though both methods in [17, 14] use\nKNN as the base learner for each task, Thrun and O\u2019Sullivan [17] focus on learning cluster structure\namong different tasks while Parameswaran and Weinberger [14] learn different distance metrics for\ndifferent tasks. The KNN classi\ufb01ers use in both two methods are de\ufb01ned on the homogeneous neigh-\nborhood which is the set of nearest data points from the same task the query point belongs to. In\nsome situation, it is better to use a heterogeneous neighborhood which is de\ufb01ned as the set of nearest\ndata points from all tasks. For example, suppose we have two similar tasks marked with two colors\nas shown in Figure 1. For a test data point marked with \u2018?\u2019 from one task, we obtain an estima-\ntion with low con\ufb01dence or even a wrong one based on the homogeneous neighborhood. However,\nif we can use the data points from both two tasks to de\ufb01ne the neighborhood (i.e., heterogeneous\nneighborhood), we can obtain a more con\ufb01dent estimation.\n\n1\n\n\fIn this paper, we propose novel local learning models for\nmulti-task learning based on the heterogeneous neighbor-\nhood. For multi-task classi\ufb01cation problems, we extend\nthe KNN classi\ufb01er by formulating the decision function\non each data point as weighted voting of its neighbors\nfrom all tasks where the weights are task-speci\ufb01c. Since\nmulti-task learning usually considers that the contribution\nof one task to another one equals that in the reverse direc-\ntion, we de\ufb01ne a regularizer to enforce the task-speci\ufb01c\nweight matrix to approach a symmetric matrix and then\nbased on this regularizer, a regularized objective function\nis proposed. We develop an ef\ufb01cient coordinate descent\nmethod to solve it. Moreover, we also propose a local\nmethod for multi-task regression problems. Speci\ufb01cally,\nwe extend the kernel regression method to multi-task setting in a similar way to the classi\ufb01cation\ncase. Experiments on some toy data and real-world datasets demonstrate the effectiveness of our\nproposed methods.\n\nFigure 1: Data points with one color\n(i.e., black or red) are from the same\ntask and those with one type of marker\n(i.e., \u2018+\u2019 or \u2018-\u2019) are from the same class.\nA test data point is represented by \u2018?\u2019.\n\n2 A Multi-Task Local Classi\ufb01er based on Heterogeneous Neighborhood\n\nIn this section, we propose a local classi\ufb01er for multi-task learning by generalizing the KNN algo-\nrithm, which is one of the most widely used local classi\ufb01ers for single-task learning.\nSuppose we are given m learning tasks {Ti}m\ni=1. The training set consists of n triples (xi, yi, ti)\nwith the ith data point as xi \u2208 RD, its label yi \u2208 {\u22121, 1} and its task indicator ti \u2208 {1, . . . , m}. So\neach task is a binary classi\ufb01cation problem with ni = |{j|tj = i}| data points belonging to the ith\ntask Ti.\nFor the ith data point xi, we use Nk(i) to denote the set of the indices of its k nearest neighbors. If\nNk(i) is a homogeneous neighborhood which only contains data points from the task that xi belongs\nto make a decision for xi where sgn(\u00b7) denotes the\nto, we can use d(xi) = sgn\nsign function and s(i, j) denotes a similarity function between xi and xj. Here, by de\ufb01ning Nk(i) as\na heterogeneous neighborhood which contains data points from all tasks, we cannot directly utilize\nthis decision function and instead we introduce a weighted decision function by using task-speci\ufb01c\nweights as\n\nj\u2208Nk(i) s(i, j)yj\n\n(cid:16)(cid:80)\n\n(cid:17)\n\n\uf8eb\uf8ed (cid:88)\n\nj\u2208Nk(i)\n\n\uf8f6\uf8f8\n\nd(xi) = sgn\n\nwti,tj s(i, j)yj\n\nwhere wqr represents the contribution of the rth task Tr to the qth one Tq when Tr has some data\npoints to be neighbors of a data point from Tq. Of course, the contribution from one task to itself\nshould be positive and also the largest, i.e., wii \u2265 0 and \u2212wii \u2264 wij \u2264 wii for j (cid:54)= i. When\nwqr(q (cid:54)= r) approaches wqq, it means Tr is very similar to Tq in local regions. At another extreme\nwhere wqr(q (cid:54)= r) approaches \u2212wqq, if we \ufb02ip the labels of data points in Tr, Tr can have a positive\ncontribution \u2212wqr to Tq which indicates that Tr is negatively correlated to Tq. Moreover, when\nwqr/wqq(q (cid:54)= r) is close to 0 which implies there is no contribution from Tr to Tq, Tr is likely\nto be unrelated to Tq. So the utilization of {wqr} can model three task relationships: positive task\ncorrelation, negative task correlation and task unrelatedness as in [6, 20].\n\nWe use f (xi) to de\ufb01ne the estimation function as f (xi) =(cid:80)\n\nj\u2208Nk(i) wti,tj s(i, j)yj. Then similar to\nsupport vector machine (SVM), we use hinge loss l(y, y(cid:48)) = max(0, 1 \u2212 yy(cid:48)) to measure empirical\nperformance on the training data. Moreover, recall that wqr represents the contribution of Tr to\nTq and wrq is the contribution of Tq to Tr. Since multi-task learning usually considers that the\ncontribution of Tr to Tq almost equals that of Tq to Tr, we expect wqr to be close to wrq. To encode\nthis priori information into our model, we can either formulate it as wqr = wrq, a hard constraint,\nor a soft regularizer, i.e., minimizing (wqr \u2212 wrq)2 to enforce wqr \u2248 wrq, which is more preferred.\nCombining all the above considerations, we can construct a objective function for our proposed\nmethod MT-KNN as\n\nl(yi, f (xi)) +\n\n(cid:107)W \u2212 WT(cid:107)2\n\nF +\n\n\u03bb1\n4\n\n(cid:107)W(cid:107)2\n\nF\n\n\u03bb2\n2\n\ns.t. wqq \u2265 0, wqq \u2265 wqr \u2265 \u2212wqq (q (cid:54)= r)\n\n(1)\n\n2\n\nn(cid:88)\n\ni=1\n\nmin\nW\n\n\fwhere W is a m\u00d7 m matrix with wqr as its (q, r)th element and (cid:107)\u00b7(cid:107)F denotes Frobenius norm of a\nmatrix. The \ufb01rst term in the objective function of problem (1) measures the training loss, the second\none enforces W to be a symmetric matrix which implies wqr \u2248 wrq, and the last one penalizes\nthe complexity of W. The regularization parameters \u03bb1 and \u03bb2 balance the trade-off between these\nthree terms.\n\n(cid:17)\n\n2.1 Optimization Procedure\n\nj=1 wtij\n\n(cid:80)m\nIn this section, we discuss how to solve problem (1). We \ufb01rst rewrite f (xi) as f (xi) =\nm \u00d7 1 vector with the jth element as(cid:80)\nk (i) denotes the set of the indices of xi\u2019s nearest\nneighbors from the jth task in Nk(i), wti = (wti1, . . . , wtim) is the tith row of W, and \u02c6xi is a\nk (i) s(i, l)yl. Then we can reformulate problem (1) as\ns.t. wqq \u2265 0, wqq \u2265 wqr \u2265 \u2212wqq(q (cid:54)= r).\n\nl\u2208N j\n(cid:107)W \u2212 WT(cid:107)2\n\n= wti \u02c6xi where N j\n\nk (i) s(i, l)yl\n\n(cid:16)(cid:80)\n(cid:88)\n\nl(yj, wi\u02c6xj) +\n\nm(cid:88)\n\n(cid:107)W(cid:107)2\n\nl\u2208N j\n\nF +\n\nF\n\n\u03bb1\n4\n\n\u03bb2\n2\n\nmin\nW\n\ni=1\n\nj\u2208Ti\n\n(2)\nTo solve problem (2), we use a coordinate descent method, which is also named as an alternating\noptimization method in some literatures.\nBy adopting the hinge loss in problem (2), the optimization problem for wik (k (cid:54)= i) is formulated\nas\n\n(cid:88)\n\nmin\nwik\n\n\u03bb\n2\n\nik \u2212 \u03b2ikwik +\nw2\n\nmax(0, aj\n\nikwik + bj\n\nik)\n\ns.t. cik \u2264 wik \u2264 eik\n\n(3)\n\nj\u2208Ti\n\n< cik}, C2 = {j|aj\n\nik = \u2212yj \u02c6xjk, bj\n\nik > 0,\u2212 bj\nik\naj\nik\nik < 0,\u2212 bj\nik\naj\nik\n\n(cid:80)\nik = 1 \u2212\nwhere \u03bb = \u03bb1 + \u03bb2, \u03b2ik = \u03bb1wki, \u02c6xjk is the kth element of \u02c6xj, aj\nt(cid:54)=k wit \u02c6xjt, cik = \u2212wii, and eik = wii. If the objective function of problem (3) only has\nyj\nthe \ufb01rst two terms, this problem will become a univariate quadratic programming (QP) problem\nwith a linear inequality constraint, leading to an analytical solution. Moreover, similar to SVM we\ncan introduce some slack variables for the third term in the objective function of problem (3) and\nthen that problem will become a QP problem with ni + 1 variables and 2ni + 1 linear constraints.\nWe can use off-the-shelf softwares to solve this problem in polynomial time. However, the whole\noptimization procedure may not be very ef\ufb01cient since we need to solve problem (3) and call QP\nsolvers for multiple times. Here we utilize the piecewise linear structure of the last term in the\nobjective function of problem (3) and propose a more ef\ufb01cient solution.\nWe assume all aj are non-zero and otherwise we can discard them without affecting the solution\nsince the corresponding losses are constants. We de\ufb01ne six index sets as\nC1 = {j|aj\n\nik > 0,\u2212 bj\nik\naj\nik\nik < 0,\u2212 bj\nC4 = {j|aj\nik\naj\nik\nIt is easy to show that when j \u2208 C1\u222aC6 where the operator \u222a denotes the union of sets, aj\nikw+bj\nik >\n0 holds for w \u2208 [cik, eik], corresponding to the set of data points with non-zero loss. Oppositely\nwhen j \u2208 C3 \u222a C4, the values of the corresponding losses become zero since aj\nik \u2264 0 holds\nfor w \u2208 [cik, eik]. The variation lies in the data points with indices j \u2208 C2 \u222a C5. We sort sequence\nik|j \u2208 C2} and record it in a vector u of length du with u1 \u2264 . . . \u2264 udu. Moreover, we also\n{\u2212bj\nkeep a index mapping qu with its rth element qu\nik. Similarly,\nfor sequence {\u2212bj\nik|j \u2208 C5}, we de\ufb01ne a sorted vector v of length dv and the corresponding\nindex mapping qv. By using the merge-sort algorithm, we merge u and v into a sorted vector s and\nthen we add cik and eik into s as the minimum and maximum elements if they are not contained in\ns. Obviously, in range [sl, sl+1] where sl is the lth element of s and ds is the length of s, problem\n(3) becomes a univariate QP problem which has an analytical solution. So we can compute local\nminimums in successive regions [sl, sl+1] (l = 1, . . . , ds \u2212 1) and get the global minimum over\nregion [cik, eik] by comparing all local optima. The key operation is to compute the coef\ufb01cients\nof quadratic function over each region [sl, sl+1] and we devise an algorithm in Table 1 which only\nneeds to scan s once, leading to an ef\ufb01cient solution for problem (3).\n\nik > 0, cik \u2264 \u2212 bj\nik\naj\nik\nik < 0, cik \u2264 \u2212 bj\nik\naj\nik\n\nr = j if ur = \u2212bj\n\n\u2264 eik}, C3 = {j|aj\n\n\u2264 eik}, C6 = {j|aj\n\n< cik}, C5 = {j|aj\n\nr de\ufb01ned as qu\n\nikw + bj\n\n> eik}.\n\n> eik}\n\nik/aj\n\nik/aj\n\nik/aj\n\n3\n\n\fThe \ufb01rst step of the algorithm in Table 1 needs O(ni)\ntime complexity to construct the six sets C1 to C6. In step\n2, we need to sort two sequences to obtain u and v in\nO(du ln du + dv ln dv) time and merge two sequences to\nget s in O(du + dv). Then it costs O(ni) to calculate\ncoef\ufb01cients c0 and c1 by scanning C1, C2 and C6 in step\n4 and 5. Then from step 6 to step 13, we need to scan\nvector s once which costs O(du + dv) time. The overall\ncomplexity of the algorithm in Table 1 is O(du ln du +\ndv ln dv + ni) which is at most O(ni ln ni) due to du +\ndv \u2264 ni.\nFor wii, the optimization problem is formulated as\ns.t. wii \u2265 ci,\n\n(cid:88)\n\nmax(0, aj\n\ni wii + bj\ni )\n\nii +\n\nw2\n\n(4)\n\nmin\nwii\n\n\u03bb2\n2\n\nj\u2208Ti\n\n(cid:80)\n\nTable 1: Algorithm for problem (3)\n01: Construct four sets C1, C2, C3, C4, C5 and C6;\n02: Construct u, qu, v, qv and s;\n03: Insert cik and eik into s if needed;\nbj\nik;\nik \u2212 \u03b2ik;\naj\n\n04: c0 :=(cid:80)\n05: c1 :=(cid:80)\n\nj\u2208C1\u222aC2\u222aC6\nj\u2208C1\u222aC2\u222aC6\n\n06: w := sds ;\n07: o := c0 + c1w + \u03bbw2/2;\n\nfor l = ds \u2212 1 to 1\nc0 := c0 \u2212 b\n\nif sl+1 = ur for some r\n\n08:\n\nend if\nif sl+1 = vr for some r\n\nqu\nr\n\nik ; c1 := c1 \u2212 a\n\nqu\nik ;\nr\n\nend if\n\nc0 := c0 + b\n\nqv\n09:\nik ; c1 := c1 + a\nr\n10: w0 := min(sl+1, max(sl, \u2212 c1\n11:\n\no0 := c0 + c1w0 + \u03bbw2\nif o0 < o\n\n0/2;\n\n\u03bb ));\n\nqv\nik ;\nr\n\nw := w0; o := o0;\n\n12:\n\n13:\n\nend if\nl := l \u2212 1;\n\ni = \u2212yj \u02c6xji, bj\n\ni = 1 \u2212 yj\n\nend for\n\nii +(cid:80)\n\nwhere aj\nt(cid:54)=i wit \u02c6xjt, ci =\nmax(0, maxj(cid:54)=i(|wij|)), and |\u00b7| denotes the absolute val-\nue of a scalar. The main difference between problem (3)\nand (4) is that there exist a box constraint for wik in problem (3) but in problem (4) wii is only\nlower-bounded. We de\ufb01ne ei as ei = maxj{\u2212 bj\n} for all aj\ni (cid:54)= 0. For wii \u2208 [ei, +\u221e), the objective\ni\naj\ni > 0}\ni ) where S = {j|aj\ni\nj\u2208S (aj\nfunction of problem (4) can be reformulated as \u03bb2\n2 w2\nand the minimum value in [ei, +\u221e) will take at w(1)\n}. Then we can use the\nii = max{ei,\u2212\nalgorithm in Table 1 to \ufb01nd the minimizor w(2)\nin the interval [ci, ei] for problem (4). Finally we\nii\nii } by comparing the corresponding\ncan choose the optimal solution to problem (4) from {w(1)\nvalues of the objective function.\nSince the complexity to solve both problem (3) and (4) is O(ni ln ni), the complexity of one update\ni=1 ni ln ni). Usually the coordinate descent algorithm converges\nvery fast in a small number of iterations and hence the whole algorithm to solve problem (2) or (1)\nis very ef\ufb01cient.\nWe can use other loss functions for problem (2) instead of hinge loss, e.g., square loss l(s, t) =\n(s \u2212 t)2 as in the least square SVM [10]. It is easy to show that problem (3) has an analytical\nand the solution to problem (4) can be\nsolution as wik = min\n\nfor the whole matrix W is O(m(cid:80)m\n\n(cid:80)\ni wii + bj\nj\u2208S aj\n\u03bb2\n\nii , w(2)\n\n(cid:18)\n\n(cid:19)\n\n(cid:19)\n\nmax\n\ni\n\n\u03b2ik\u22122(cid:80)\n\u03bb+2(cid:80)\n\n, eik\n\naj\nikbj\nik\n(aj\nik)2\n. Then the computational complexity of the whole\n\ncomputed as wii = max\nalgorithm to solve problem (2) by adopting square loss is O(mn).\n\nci,\n\nj\u2208Ti\nj\u2208Ti\n\n(cid:19)\n\nj\u2208Ti\nj\u2208Ti\naj\ni bj\ni\n(aj\ni )2\n\n(cid:18)\n\u22122(cid:80)\n\u03bb2+2(cid:80)\n\ncik,\n\n(cid:18)\n\n3 A Multi-Task Local Regressor based on Heterogeneous Neighborhood\n\nIn this section, we consider the situation that each task is a regression problem with each label\nyi \u2208 R.\nSimilar to the classi\ufb01cation case in the previous section, one candidate for multi-task local regressor\nis a generalization of kernel regression, a counterpart of KNN classi\ufb01er for regression problems, and\nthe estimation function can be formulated as\n\n(cid:80)\n(cid:80)\n\nf (xi) =\n\nj\u2208Nk(i) wti,tj s(i, j)yj\nj\u2208Nk(i) wti,tj s(i, j)\n\n(5)\n\nwhere wqr also represents the contribution of Tr to Tq. Since the denominator of f (xi) is a linear\ncombination of elements in each row of W with data-dependent combination coef\ufb01cients, if we\nutilize a similar formulation to problem (1) with square loss, we need to solve a complex and non-\nconvex fractional programming problem. For computational consideration, we resort to another way\nto construct the multi-task local regressor.\n\n4\n\n\f(cid:80)m\n\n(cid:16)(cid:80)\n\n(cid:17)\n\nthat\n\nRecall\n\nj=1 wtij\n\nk (i) s(i, l)yl\n\nthe estimation function for the classi\ufb01cation case is formulated as f (xi) =\n. We can see that the expression in the brackets on the right-hand\nl\u2208N j\nside can be viewed as a prediction for xi based on its neighbors in the jth task. Inspired by this\nobservation, we can construct a prediction \u02c6yi\nj for xi based on its neighbors from the jth task by\nutilizing any regressor, e.g., kernel regression and support vector regression. Here due to the local\nnature of our proposed method, we choose the kernel regression method, which is a local regression\ns(i,l)yl\ns(i,l) . When j equals\nmethod, as a good candidate and hence \u02c6yi\nti which means we use neighbored data points from the task that xi belongs to, we can use this\nprediction in con\ufb01dence. However, if j (cid:54)= ti, we cannot totally trust the prediction and need to add\nsome weight wti,j as a con\ufb01dence. Then by using the square loss, we formulate an optimization\nproblem to get the estimation function f (xi) based on {\u02c6yi\nwti,j(y \u2212 \u02c6yi\n\nj is formulated as \u02c6yi\n\n(cid:80)m\nj} as\n(cid:80)m\n\nf (xi) = arg min\n\nl\u2208N j\nk\nl\u2208N j\nk\n\nm(cid:88)\n\n(cid:80)\n(cid:80)\n\nj)2 =\n\nj =\n\n(6)\n\n(i)\n\n(i)\n\n.\n\nj\n\nj=1 wti,j \u02c6yi\nj=1 wti,j\n\ny\n\nj=1\n\nsummation of W to be 1, i.e., (cid:80)m\n\nCompared with the regression function of the direct extension of kernel regression to multi-task\nlearning in Eq. (5), the denominator of our proposed regressor in Eq. (6) only includes the row\nsummation of W, making the optimization problem easier to solve as we will see later. Since the\nscale of wij does not matter the value of the estimation function in Eq. (6), we constrain the row\nj=1 wij = 1 for i = 1, . . . , m. Moreover, the estimation \u02c6yi\n(cid:80)\nti\nby using data from the same task as xi is more trustful than the estimations based on other tasks,\nwhich suggests wii should be the largest among elements in the ith row. Then this constraint implies\nthat wii \u2265 1\nm > 0. To capture the negative task correlations, wij (i (cid:54)= j) is only\nrequired to be a real scalar and wij \u2265 \u2212wii. Combining the above consideration, we formulate an\nm(cid:88)\n(cid:88)\noptimization problem as\n\nk wik = 1\n\nm\n\n(cid:107)W(cid:107)2\n\nF s.t. W1 = 1, wii \u2265 wij \u2265 \u2212wii,\n\n(wi\u02c6yj \u2212 yj)2 +\n\n(cid:107)W \u2212 WT(cid:107)2\n\nF +\n\n(7)\n\n\u03bb1\n4\n\n\u03bb2\n2\n\nmin\nW\n\ni=1\n\nj\u2208Ti\n\nwhere 1 denotes a vector of all ones with appropriate size and \u02c6yj = (\u02c6yj\nsection, we discuss how to optimize problem (7).\n\n1, . . . , \u02c6yj\n\nm)T . In the following\n\n3.1 Optimization Procedure\n\nDue to the linear equality constraints in problem (7), we cannot apply a coordinate descent method\nto update variables one by one in a similar way to problem (2). However, similar to the SMO\nalgorithm [15] for SVM, we can update two variables in one row of W at one time to keep the linear\nequality constraints valid.\nWe update each row one by one and the optimization problem with respect to wi is formulated as\n\nm(cid:88)\nwhere A = 2(cid:80)\nby setting the (i, i)th element to be 0, b = \u22122(cid:80)\n\nj + \u03bb1Ii\n\ni + wibT\n\nwiAwT\n\n\u02c6yj \u02c6yT\n\nmin\nwi\n\nj\u2208Ti\n\ns.t.\n\n1\n2\n\nj=1\n\nyj \u02c6yT\nby setting its ith element to 0. We de\ufb01ne the Lagrangian as\n\nj\u2208Ti\n\nwij = 1, \u2212wii \u2264 wij \u2264 wii \u2200j (cid:54)= i,\nm + \u03bb2Im, Im is an m \u00d7 m identity matrix, Ii\nm(cid:88)\n\n(wii \u2212 wij)\u03b2j \u2212(cid:88)\n\nwij \u2212 1) \u2212(cid:88)\n\nj \u2212 \u03bb1cT\n\nJ =\n\n1\n2\n\nwiAwT\n\ni + wibT \u2212 \u03b1(\n\nj=1\n\nj(cid:54)=i\n\nm is a copy of Im\ni , and ci is the ith column of W\n\n(wii + wij)\u03b3j.\n\nj(cid:54)=i\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n(12)\n\nThe Karush-Kuhn-Tucker (KKT) optimality condition is formulated as\n= wiaj + bj \u2212 \u03b1 + \u03b2j \u2212 \u03b3j = 0, for j (cid:54)= i\n\n= wiai + bi \u2212 \u03b1 \u2212(cid:88)\n\n\u2202J\n\u2202wij\n\u2202J\n\u2202wii\n\u03b2j \u2265 0, (wii \u2212 wij)\u03b2j = 0 \u2200j (cid:54)= i\n\u03b3j \u2265 0, (wii + wij)\u03b3j = 0 \u2200j (cid:54)= i,\n\nk(cid:54)=i\n\n(\u03b2k + \u03b3k) = 0\n\n5\n\n\f\u03b1. Moreover, Eq. (10) implies that wiai + bi = \u03b1 +(cid:80)\n\nwhere aj is the jth column of A and bj is the jth element of b. It is easy to show that \u03b2j\u03b3j = 0\nfor all j (cid:54)= i. When wij satis\ufb01es wij = wii, according to Eq. (12) we have \u03b3j = 0 and further\nwiaj + bj = \u03b1 \u2212 \u03b2j \u2264 \u03b1 according to Eq. (9). When wij = \u2212wii, based on Eq. (11) we can\nget \u03b2j = 0 and then wiaj + bj = \u03b1 + \u03b3j \u2265 \u03b1. For wij between those two extremes (i.e.,\n\u2212wii < wij < wii), \u03b3j = \u03b2j = 0 according to Eqs. (11) and (12), which implies that wiaj + bj =\nk(cid:54)=i(\u03b2k + \u03b3k) \u2265 \u03b1. We de\ufb01ne sets as\nS1 = {j|wij = wii, j (cid:54)= i}, S2 = {j| \u2212 wii < wij < wii}, S3 = {j|wij = \u2212wii}, and S4 = {i}.\nThen a feasible wi is a stationary point of problem (8) if and only if maxj\u2208S1\u222aS2{wiaj + bj} \u2264\nmink\u2208S2\u222aS3\u222aS4{wiak + bk}. If there exist a pair of indices (j, k), where j \u2208 S1 \u222a S2 and k \u2208\nS2 \u222a S3 \u222a S4, satisfying wiaj + bj > wiak + bk, {j, k} is called a violating pair. If the current\nestimation wi is not an optimal solution, there should exist some violating pairs. Our SMO algorithm\nupdates a violating pair at one step by choosing the most violating pair {j, k} with j and k de\ufb01ned\nas j = arg maxl\u2208S1\u222aS2{wial + bl} and k = arg minl\u2208S2\u222aS3\u222aS4{wial + bl}. We de\ufb01ne the update\nrule for wij and wik as \u02dcwij = wij + t and \u02dcwik = wik \u2212 t. By noting that j cannot be i, t should\nsatisfy the following constraints to make the updated solution feasible:\n\nwhen k = i, t \u2212 wik \u2264 wij + t \u2264 wik \u2212 t, t \u2212 wik \u2264 wil \u2264 wik \u2212 t \u2200l (cid:54)= j&l (cid:54)= k\nwhen k (cid:54)= i, \u2212wii \u2264 wij + t \u2264 wii, \u2212wii \u2264 wik \u2212 t \u2264 wii.\n\nWhen k = i, there will be a constraint on t as t \u2264 e \u2261 min(cid:0) wik\u2212wij\n\n, minl(cid:54)=j&l(cid:54)=k(wik \u2212 |wil|)(cid:1)\n\nand otherwise t will satisfy c \u2264 t \u2264 e where c = max(wik \u2212 wii,\u2212wij \u2212 wii) and e = min(wii \u2212\nwij, wii + wik). Then the optimization problem for t can be uni\ufb01ed as\n\n2\n\nt2 + (wiaj + bj \u2212 wiai \u2212 bi)t\n\ns.t. c \u2264 t \u2264 e,\n\n(cid:16)\n\nmin\n\nt\n\n(cid:16)\n\nwhere for the case that k = i, c is set to be \u2212\u221e. This problem has an analytical solution as\n. We update each row of W one by one until convergence.\nt = min\n\nc, wiai+bi\u2212wiaj\u2212bj\n\ne, max\n\najj +aii\u22122aji\n\najj + aii \u2212 2aji\n\n2\n\n(cid:17)(cid:17)\n\n4 Experiments\n\nIn this section, we test the empirical performance of our proposed methods in some toy data and\nreal-world problems.\n\n4.1 Toy Problems\n\nand\n\n(cid:21)\n\n(cid:21)\n\n(cid:20) 0.1014\n\n(cid:20) 0.1025\n\nWe \ufb01rst use one UCI dataset, i.e., diabetes data, to analyze the learned W matrix. The diabetes data\nconsist of 768 data points from two classes. We randomly select p percent of data points to form\nthe training set of two learning tasks respectively. The regularization parameters \u03bb1 and \u03bb2 are \ufb01xed\nas 1 and the number of nearest neighbors is set to 5. When p = 20 and p = 40, the means of the\nestimated W over 10 trials are\n. This result shows\nwij (j (cid:54)= i) is very close to wii for i = 1, 2. This observation implies our method can \ufb01nd that these\ntwo tasks are positive correlated which matches our expectation since those two tasks are from the\nsame distribution.\nFor the second experiment, we randomly select p percent of data points to form the training set\nof two learning tasks respectively but differently we \ufb02ip the labels of one task so that those two\ntasks should be negatively correlated. The matrices W\u2019s learned for p = 20 and p = 40 are\n. We can see that wij (j (cid:54)= i) is very close\n\n(cid:20) 0.1019 \u22120.0999\n\n(cid:20) 0.1019 \u22120.1017\n\n0.1004\n0.1010\n\n0.1011\n0.1056\n\n0.0980\n\n0.1010\n\n(cid:20) 0.1575\n\nto \u2212wii for i = 1, 2, which is what we expect.\nAs the third problem, we construct two learning tasks as in the \ufb01rst one but \ufb02ip 50% percent of the\n(cid:21)\nclass labels in each class of those two tasks. Here those two tasks can be viewed as unrelated tasks\nsince the label assignment is random. The estimated matrices W\u2019s for p = 20 and p = 40 are\n, where wij (i (cid:54)= j) is much smaller than wii. From\nthe structure of the estimations, we can see that those two tasks are more likely to be unrelated,\nmatching our expectation. In summary, our method can learn the positive correlations, negative\ncorrelations and task unrelatedness for those toy problems.\n\n(cid:20) 0.1015 \u22120.0003\n\n0.0144\n0.1281\n\n0.0398\n\n0.0081\n\n0.1077\n\n\u22120.1007\n\n\u22120.0997\n\n0.1012\n\n0.1038\n\n(cid:21)\n\nand\n\n(cid:21)\n\n(cid:21)\n\nand\n\n6\n\n\f4.2 Experiments on Classi\ufb01cation Problems\n\n2\n\n2\u03c32\n\nLetter\n\n0.0775\u00b10.0053\n0.0511\u00b10.0053\n0.0505\u00b10.0038\n0.0466\u00b10.0023\n0.0494\u00b10.0028\n\nUSPS\n\n0.0445\u00b10.0131\n0.0141\u00b10.0038\n0.0140\u00b10.0025\n0.0114\u00b10.0013\n0.0124\u00b10.0014\n\nTable 2: Comparison of classi\ufb01cation errors of different\nmethods on the two classi\ufb01cation problems in the form of\nmean\u00b1std.\nKNN\nmtLMNN\nMTFL\nMT-KNN(hinge)\nMT-KNN(square)\n\nTwo multi-task classi\ufb01cation prob-\nlems are used in our experiments.\nThe \ufb01rst problem we investigate is\na handwritten letter classi\ufb01cation ap-\nplication consisting of seven tasks\neach of which is to distinguish t-\nwo letters. The corresponding letter-\ns for each task to classify are: c/e,\ng/y, m/n, a/g, a/o, f/t and h/n. Each\nclass in each task has about 1000 data\npoints which have 128 features corre-\nsponding to the pixel values of hand-\nwritten letter images. The second one is the USPS digit classi\ufb01cation problem and it consists of nine\nbinary classi\ufb01cation tasks each of which is to classify two digits. Each task contains about 1000 data\npoints with 255 features for each class.\nHere the similarity function we use is a heat\nkernel s(i, j) = exp{\u2212(cid:107)xi\u2212xj(cid:107)2\n} where \u03c3\nis set\nto the mean pairwise Euclidean dis-\ntance among training data. We use 5-fold\ncross validation to determine the optimal \u03bb1\nand \u03bb2 whose candidate values are chosen\nfrom n \u00d7 {0.01, 0.1, 0.5, 1, 5, 10, 100} and the\noptimal number of nearest neighbors from\n{5, 10, 15, 20}. The classi\ufb01cation error is used\nas the performance measure. We compare our\nmethod, which is denoted as MT-KNN, with\nthe KNN classi\ufb01er which is a single-task learn-\ning method, the multi-task large margin nearest\nneighbor (mtLMNN) method [14]1 which is a\nmulti-task local learning method based on the\nhomogeneous neighborhood, and the multi-task\nfeature learning (MTFL) method [2] which is a\nglobal method for multi-task learning. By uti-\nlizing hinge and square losses, we also consider two variants of our MT-KNN method. To mimic\nthe real-world situation where the training data are usually limited, we randomly select 20% of the\nwhole data as training data and the rest to form the test set. The random selection is repeated for 10\ntimes and we record the results in Table 2. From the results, we can see that our method MT-KNN\nis better than KNN, mtLMNN and MTFL methods, which demonstrates that the introduction of the\nheterogeneous neighborhood is helpful to improve the performance. For different loss functions\nutilized by our method, MT-KNN with hinge loss is better than that with square loss due to the\nrobustness of the hinge loss against the square loss.\nFor those two problems, we also compare our proposed coordinate descent method described in\nTable 1 with some off-the-shelf solvers such as the CVX solver [11] with respect to the running\ntime. The platform to run the experiments is a desktop with Intel i7 CPU 2.7GHz and 8GB RAM\nand we use Matlab 2009b for implementation and experiments. We record the average running time\nover 100 trials in Figure 2 and from the results we can see that on the classi\ufb01cation problems above,\nour proposed coordinate descent method is much faster than the CVX solver which demonstrates\nthe ef\ufb01ciency of our proposed method.\n\nFigure 2: Comparison on average running time\nover 100 trials between our proposed coordinate\ndescent methods and the CVX solver on classi\ufb01-\ncation and regression problems.\n\n4.3 Experiments on Regression Problems\n\nHere we study a multi-task regression problem to learn the inverse dynamics of a seven degree-of-\nfreedom SARCOS anthropomorphic robot arm.2 The objective is to predict seven joint torques based\n\n1http://www.cse.wustl.edu/\u02dckilian/code/files/mtLMNN.zip\n2http://www.gaussianprocess.org/gpml/data/\n\n7\n\nLetterUSPSRobot00.10.20.30.40.50.60.70.8Dataset Running Time (in second) Our MethodCVX Solver\fon 21 input features, corresponding to seven joint positions, seven joint velocities and seven joint\naccelerations. So each task corresponds to the prediction of one torque and can be formulated as a\nregression problem. Each task has 2000 data points. The similarity function used here is also the heat\nkernel and 5-fold cross validation is used to determine the hyperparameters, i.e., \u03bb1, \u03bb2 and k. The\nperformance measure used is normalized mean squared error (nMSE), which is mean squared error\non the test data divided by the variance of the ground truth. We compare our method denoted by MT-\nKR with single-task kernel regression (KR), the multi-task feature learning (MTFL) under different\ncon\ufb01gurations on the size of the training set. Compared with KR and MTFL methods, our method\nachieves better performance over different sizes of the training sets. Moreover, for our proposed\ncoordinate descent method introduced in section 3.1, we compare it with CVX solver and record\nthe results in the last two columns of Figure 2. We \ufb01nd the running time of our proposed method is\nmuch smaller than that of the CVX solver which demonstrates that the proposed coordinate descent\nmethod can speed up the computation of our MT-KR method.\n\nFigure 3: Comparison of different methods on the robot arm application when varying the size of\nthe training set.\n\n4.4 Sensitivity Analysis\n\nHere we test the sensitivity of the performance\nwith respect to the number of nearest neighbors.\nBy changing the number of nearest neighbors\nfrom 5 to 40 at an interval of 5, we record the\nmean of the performance of our method over 10\ntrials in Figure 4. From the results, we can see\nour method is not very sensitive to the number\nof nearest neighbors, which makes the setting\nof k not very dif\ufb01cult.\n\n5 Conclusion\n\nFigure 4: Sensitivity analysis of the performance\nof our method with respect to the number of near-\nest neighbors at different data sets.\n\nIn this paper, we develop local learning meth-\nods for multi-task classi\ufb01cation and regression\nproblems. Based on an assumption that all task\npairs contributes to each other almost equally,\nwe propose regularized objective functions and develop ef\ufb01cient coordinate descent methods to\nsolve them. Up to here, each task in our studies is a binary classi\ufb01cation problem. In some applica-\ntions, there may be more than two classes in each task. So we are interested in an extension of our\nmethod to multi-task multi-class problems. Currently the task-speci\ufb01c weights are shared by all data\npoints from one task. One interesting research direction is to investigate a localized variant where\ndifferent data points have different task-speci\ufb01c weights based on their locality structure.\n\nAcknowledgment\n\nYu Zhang is supported by HKBU \u2018Start Up Grant for New Academics\u2019.\n\n8\n\n0.10.20.300.020.040.060.08The size of training setnMSE KRMTFLMT\u2212KR5101520253035400.010.020.030.040.050.06Number of NeighborsError LetterUSPSRobot\fReferences\n[1] R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unla-\n\nbeled data. Journal of Machine Learning Research, 6:1817\u20131853, 2005.\n\n[2] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In B. Sch\u00a8olkopf, J. C. Platt, and\nT. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 41\u201348, Vancouver,\nBritish Columbia, Canada, 2006.\n\n[3] B. Bakker and T. Heskes. Task clustering and gating for bayesian multitask learning. Journal of Machine\n\nLearning Research, 4:83\u201399, 2003.\n\n[4] J. Baxter. A Bayesian/information theoretic model of learning to learn via multiple task sampling. Ma-\n\nchine Learning, 28(1):7\u201339, 1997.\n\n[5] J. C. Bezdek and R. J. Hathaway. Convergence of alternating optimization. Neural, Parallel & Scienti\ufb01c\n\nComputations, 11(4):351\u2013368, 2003.\n\n[6] E. Bonilla, K. M. A. Chai, and C. Williams. Multi-task Gaussian process prediction.\n\nIn J.C. Platt,\nD. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20,\npages 153\u2013160, Vancouver, British Columbia, Canada, 2007.\n\n[7] L. Bottou and V. Vapnik. Local learning algorithms. Neural Computation, 4(6):888\u2013900, 1992.\n[8] R. Caruana. Multitask learning. Machine Learning, 28(1):41\u201375, 1997.\n[9] T. Evgeniou and M. Pontil. Regularized multi-task learning. In Proceedings of the Tenth ACM SIGKDD\nInternational Conference on Knowledge Discovery and Data Mining, pages 109\u2013117, Seattle, Washing-\nton, USA, 2004.\n\n[10] T. V. Gestel, J. A. K. Suykens, B. Baesens, S. Viaene, J. Vanthienen, G. Dedene, B. De Moor, and J. Van-\ndewalle. Benchmarking least squares support vector machine classi\ufb01ers. Machine Learning, 54(1):5\u201332,\n2004.\n\n[11] M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming, 2011.\n[12] L. Jacob, F. Bach, and J.-P. Vert. Clustered multi-task learning: a convex formulation.\n\nIn D. Koller,\nD. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems\n21, pages 745\u2013752, Vancouver, British Columbia, Canada, 2008.\n\n[13] A. Kumar and H. Daum\u00b4e III. Learning task grouping and overlap in multi-task learning. In Proceedings\n\nof the 29 th International Conference on Machine Learning, Edinburgh, Scotland, UK, 2012.\n\n[14] S. Parameswaran and K. Weinberger. Large margin multi-task metric learning. In J. Lafferty, C. K. I.\nWilliams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances in Neural Information Process-\ning Systems 23, pages 1867\u20131875, 2010.\n\n[15] J. C. Platt.\n\nFast training of support vector machines using sequential minimal optimization.\n\nIn\nB. Sch\u00a8olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods: Support Vector\nLearning. MIT Press, 1998.\n\n[16] S. Thrun. Is learning the n-th thing any easier than learning the \ufb01rst? In D. S. Touretzky, M. Mozer, and\nM. E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 640\u2013646, Denver,\nCO, 1995.\n\n[17] S. Thrun and J. O\u2019Sullivan. Discovering structure in multiple learning tasks: The TC algorithm.\n\nIn\nProceedings of the Thirteenth International Conference on Machine Learning, pages 489\u2013497, Bari, Italy,\n1996.\n\n[18] M. Wu and B. Sch\u00a8olkopf. A local learning approach for clustering. In B. Sch\u00a8olkopf, J. C. Platt, and\nT. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 1529\u20131536, Vancou-\nver, British Columbia, Canada, 2006.\n\n[19] M. Wu, K. Yu, S. Yu, and B. Sch\u00a8olkopf. Local learning projections. In Proceedings of the Twenty-Fourth\n\nInternational Conference on Machine Learning, pages 1039\u20131046, Corvallis, Oregon, USA, 2007.\n\n[20] Y. Zhang and D.-Y. Yeung. A convex formulation for learning task relationships in multi-task learning.\nIn Proceedings of the 26th Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 733\u2013742, Catalina\nIsland, California, 2010.\n\n9\n\n\f", "award": [], "sourceid": 968, "authors": [{"given_name": "Yu", "family_name": "Zhang", "institution": "Hong Kong Baptist University"}]}