{"title": "Manifold-based Similarity Adaptation for Label Propagation", "book": "Advances in Neural Information Processing Systems", "page_first": 1547, "page_last": 1555, "abstract": "Label propagation is one of the state-of-the-art methods for semi-supervised learning, which estimates labels by propagating label information through a graph. Label propagation assumes that data points (nodes) connected in a graph should have similar labels. Consequently, the label estimation heavily depends on edge weights in a graph which represent similarity of each node pair. We propose a method for a graph to capture the manifold structure of input features using edge weights parameterized by a similarity function. In this approach, edge weights represent both similarity and local reconstruction weight simultaneously, both being reasonable for label propagation. For further justification, we provide analytical considerations including an interpretation as a cross-validation of a propagation model in the feature space, and an error analysis based on a low dimensional manifold model. Experimental results demonstrated the effectiveness of our approach both in synthetic and real datasets.", "full_text": "Manifold-based Similarity Adaptation\n\nfor Label Propagation\n\nMasayuki Karasuyama and Hiroshi Mamitsuka\n\nBioionformatics Center, Institute for Chemical Research, Kyoto University, Japan\n\n{karasuyama,mami}@kuicr.kyoto-u.ac.jp\n\nAbstract\n\nLabel propagation is one of the state-of-the-art methods for semi-supervised learn-\ning, which estimates labels by propagating label information through a graph.\nLabel propagation assumes that data points (nodes) connected in a graph should\nhave similar labels. Consequently, the label estimation heavily depends on edge\nweights in a graph which represent similarity of each node pair. We propose a\nmethod for a graph to capture the manifold structure of input features using edge\nweights parameterized by a similarity function. In this approach, edge weights\nrepresent both similarity and local reconstruction weight simultaneously, both be-\ning reasonable for label propagation. For further justi\ufb01cation, we provide analyt-\nical considerations including an interpretation as a cross-validation of a propaga-\ntion model in the feature space, and an error analysis based on a low dimensional\nmanifold model. Experimental results demonstrated the effectiveness of our ap-\nproach both in synthetic and real datasets.\n\n1\n\nIntroduction\n\nGraph-based learning algorithms have received considerable attention in machine learning commu-\nnity. For example, label propagation (e.g., [1, 2]) is widely accepted as a state-of-the-art approach\nfor semi-supervised learning, in which node labels are estimated through the input graph structure.\nA common important property of these graph-based approaches is that the manifold structure of the\ninput data can be captured by the graph. Their practical performance advantage has been demon-\nstrated in various application areas.\n\nOn the other hand, it is well-known that the accuracy of the graph-based methods highly depends on\nthe quality of the input graph (e.g., [1, 3\u20135]), which is typically generated from a set of numerical\ninput vectors (i.e., feature vectors). A general framework of graph-based learning can be represented\nas the following three-step procedure:\nStep 1: Generating graph edges from given data, where nodes of the generated graph correspond to\n\nthe instances of input data.\n\nStep 2: Giving weights to the graph edges.\nStep 3: Estimating node labels based on the generated graph, which is often represented as an\n\nadjacency matrix.\n\nIn this paper, we focus on the second step in the three-step procedure; estimating edge weights for\nthe subsequent label estimation. Optimizing edge weights is dif\ufb01cult in semi-supervised learning,\nbecause there are only a small number of labeled instances. Also this problem is important because\nedge weights heavily affect \ufb01nal prediction accuracy of graph-based methods, while in reality rather\nsimple heuristics strategies have been employed.\n\nThere are two standard approaches for estimating edge weights: similarity function based- and\nlocally linear embedding (LLE) [6] based-approaches. Each of these two approaches has its own\n\n1\n\n\fdisadvantage. The similarity based approaches use similarity functions, such as Gaussian kernel,\nwhile most similarity functions have tuning parameters (such as the width parameter of Gaussian\nkernel) that are in general dif\ufb01cult to be tuned. On the other hand, in LLE, the true underlying\nmanifold can be approximated by a graph which minimizes a local reconstruction error. LLE is\nmore sophisticated than the similarity-based approach, and LLE based graphs have been applied to\nsemi-supervised learning [5, 7\u20139]. However LLE is noise-sensitive [10]. In addition, to avoid a kind\nof degeneracy problem [11], LLE has to have additional tuning parameters.\n\nOur approach is a similarity-based method, yet also captures the manifold structure of the input data;\nwe refer to our approach as adaptive edge weighting (AEW). In AEW, graph edges are determined\nby a data adaptive manner in terms of both similarity and manifold structure. The objective function\nin AEW is based on local reconstruction, by which estimated weights capture the manifold structure,\nwhere each edge is parameterized as a similarity function of each node pair. Consequently, in spite\nof its simplicity, AEW has the following three advantages:\n\n\u2022 Compared to LLE based approaches, our formulation alleviates the problem of over-\ufb01tting\ndue to the parameterization of weights. In our experiments, we observed that AEW is robust\nagainst noise of input data using synthetic data set, and we also show the performance\nadvantage of AEW in eight real-world datasets.\n\n\u2022 Similarity based representation of edge weights is reasonable for label propagation because\ntransitions of labels are determined by those weights, and edge weights obtained by LLE\napproaches may not represent node similarity.\n\n\u2022 AEW does not have additional tuning parameters such as regularization parameters. Al-\nthough the number of edges in a graph cannot be determined by AEW, we show that per-\nformance of AEW is robust against the number of edges compared to standard heuristics\nand a LLE based approach.\n\nWe provide further justi\ufb01cations for our approach based on the ideas of feature propagation and local\nlinear approximation. Our objective function can be seen as a cross validation error of a propagation\nmodel for feature vectors, which we call feature propagation. This allows us to interpret that AEW\noptimizes graph weights through cross validation (for prediction) in the feature vector space instead\nof label space, assuming that input feature vectors and given labels share the same local structure.\nAnother interpretation is provided through local linear approximation, by which we can analyze the\nerror of local reconstruction in the output (label) space under the assumption of low dimensional\nmanifold model.\n\n2 Graph-based Semi-supervised Learning\n\nIn this paper we use label propagation, which is one of the state-of-the-art graph-based learning\nalgorithms, as the methods in the third step in the three-step procedure. Suppose that we have n\nfeature vectors X = {x1, . . . , xn}, where xi \u2208 Rp. An undirected graph G is generated from X ,\nwhere each node (or vertex) corresponds to each data point xi. The graph G can be represented by\nthe adjacency matrix W \u2208 Rn\u00d7n where (i, j)-element Wij is a weight of the edge between xi and\nxj. The key idea of graph-based algorithms is so-called manifold assumption, in which instances\nconnected by large weights Wij on a graph have similar labels (meaning that labels smoothly change\non the graph).\n\nFor the adjacency matrix Wij , the following weighted k-nearest neighbor (k-NN) graph is com-\nmonly used in graph-based learning algorithms [1]:\n\nWij =( exp(cid:16)\u2212Pp\n\n0,\n\nwhere xid is the d-th element of xi, Ni is a set of indices of the k-NN of xi, and {\u03c3d}p\nd=1 is a set of\nparameters. [1] shows this weighting can also be interpreted as the solution of the heat equation on\nthe graph.\n\nd=1\n\n(xid\u2212xjd)2\n\n\u03c32\nd\n\n(cid:17) ,\n\nj \u2208 Ni or i \u2208 Nj,\notherwise,\n\n(1)\n\nFrom this adjacency matrix, the graph Laplacian can be de\ufb01ned by\n\nL = D \u2212 W ,\n\n2\n\n\fwhere D is a diagonal matrix with the diagonal entry Dii = Pj Wij . Instead of L, normalized\n\nvariants of Laplacian such as L = I \u2212 D\u22121W or L = I \u2212 D\u22121/2W D\u22121/2 is also used, where\nI \u2208 Rn\u00d7n is the identity matrix.\n\nAmong several label propagation algorithms, we mainly use the formulation by [1], which is the\nstandard formulation of graph-based semi-supervised learning. Suppose that the \ufb01rst \u2113 data points\nin X are labeled by Y = {y1, . . . , y\u2113}, where yi \u2208 {1, . . . , c} and c is the number of classes. The\ngoal of label propagation is to predict the labels of unlabeled nodes {x\u2113+1, . . . , xn}. The scoring\nmatrix F gives an estimation of the label of xi by argmaxj Fij . Label propagation can be de\ufb01ned\nas estimating F in such a way that score F smoothly changes on a given graph as well as it can\npredict given labeled points. The following is standard formulation, which is called the harmonic\nGaussian \ufb01eld (HGF) model, of label propagation [1]:\n\nmin\n\nF\n\ntrace(cid:16)F \u22a4LF(cid:17) subject to Fij = Yij, for i = 1, . . . , \u2113.\n\nwhere Yij is the label matrix with Yij = 1 if xi is labeled as yi = j; otherwise, Yij = 0, In this\nformulation, the scores for labeled nodes are \ufb01xed as constants. This formulation can be reduced\nto linear systems, which can be solved ef\ufb01ciently, especially when Laplacian L has some sparse\nstructure.\n\n3 Basic Framework of Proposed Approach\n\nThe performance of label propagation heavily depends on quality of an input graph. Our proposed\napproach, adaptive edge weighting (AEW), optimizes edge weights for the graph-based learning\nalgorithms. We note that AEW is for the second step of the three step procedure and has nothing\nto do with the \ufb01rst and third steps, meaning that any methods in the \ufb01rst and third steps can be\ncombined with AEW. In this paper we consider that the input graph is generated by k-NN graph (the\n\ufb01rst step is based on k-NN), while we note that AEW can be applied to any types of graphs.\n\nFirst of all, graph edges should satisfy the following conditions:\n\n\u2022 Capturing the manifold structure of the input space.\n\u2022 Representing similarity between two nodes.\n\nThese two conditions are closely related to manifold assumption of graph-based learning algorithms,\nin which labels vary smoothly along the input manifold. Since the manifold structure of the input\ndata is unknown beforehand, the graph is used to approximate the manifold (the \ufb01rst condition).\nSubsequent predictions are performed in such a way that labels smoothly change according to the\nsimilarity structure provided by the graph (the second condition). Our algorithm simultaneously\npursues these two important aspects of the graph for the graph-based learning algorithms.\n\nWe de\ufb01ne Wij as a similarity function of two nodes (1), using Gaussian kernel in this paper (Note:\nother similarity functions can also be used). We estimate \u03c3d so that the graph represents manifold\nstructure of the input data by the following optimization problem:\n\nmin\n{\u03c3d}p\n\nd=1\n\nn\n\nXi=1\n\nkxi \u2212\n\n1\n\nDii Xj\u223ci\n\nWij xjk2\n2,\n\n(2)\n\nwhere j \u223c i means that j is connected to i.\nThis minimizes the reconstruction error by local\nlinear approximation, which captures the input manifold structure, in terms of the parameters of\nthe similarity function. We will describe the motivation and analytical properties of the objective\nfunction in Section 4. We further describe advantages of this function over existing approaches\nincluding well-known locally linear embedding (LLE) [6] based methods in Section 5, respectively.\n\nTo optimize (2), we can use any gradient-based algorithm such as steepest descent and conjugate\ngradient (in the later experiments, we used steepest descent method). Due to the non-convexity\nof the objective function, we cannot guarantee that solutions converge to the global optimal which\nmeans that the solutions depend on the initial \u03c3d. In our experiments, we employed well-known\nmedian heuristics (e.g., [12]) for setting initial values of \u03c3d (Section 6). Another possible strategy\nis to use a number of different initial values for \u03c3d, which needs a high computational cost. The\n\n3\n\n\fgradient can be computed ef\ufb01ciently, due to the sparsity of the adjacency matrix. Since the number\nof edges of a k-NN graph is O(nk), the derivative of adjacency matrix W can be calculated by\nO(nkp). Then the entire derivative of the objective function can be calculated by O(nkp2). Note\nthat k often takes a small value such as k = 10.\n\n4 Analytical Considerations\n\nIn Section 3, we de\ufb01ned our approach as the minimization of the local reconstruction error of input\nfeatures. We describe several interesting properties and interpretations of this de\ufb01nition.\n\n4.1 Derivation from Feature Propagation\n\nFirst, we show that our objective function can be interpreted as a cross-validation error of the HGF\nmodel for the feature vector x on the graph. Let us divide a set of node indices {1, . . . , n} into a\ntraining set T and a validation set V. Suppose that we try to predict x in the validation set {xi}i\u2208V\nfrom the given training set {xi}i\u2208T and the adjacency matrix W . For this prediction problem, we\nconsider the HGF model for x:\n\nmin\n\u02c6X\n\nL \u02c6X(cid:17) subject to \u02c6xij = xij, for i \u2208 T ,\n\n\u22a4\n\ntrace(cid:16) \u02c6X\n\nwhere X = (x1, x2, . . . xn)\u22a4, \u02c6X = (\u02c6x1, \u02c6x2, . . . \u02c6xn)\u22a4, and xij and \u02c6xij indicate (i, j)-th entries of\nX and \u02c6X respectively. In this formulation, \u02c6xi corresponds to a prediction for xi. Note that only\n\u02c6xi in the validation set V is regarded as free variables in the optimization problem because the other\n{\u02c6xi}i\u2208T is \ufb01xed at the observed values by the constraint. This can be interpreted as propagating\n{xi}i\u2208T to predict {xi}i\u2208V . We call this process as feature propagation.\n\nWhen we employ leave-one-out as the cross-validation of the feature propagation model, we obtain\n\nkxi \u2212 \u02c6x\u2212ik2\n2,\n\nn\n\nXi=1\n\n(3)\n\nwhere \u02c6x\u2212i is a prediction for xi with T = {1, . . . , i \u2212 1, i + 1, . . . , n} and V = {i}. Due to the\n\nlocal averaging property of HGF [1], we see \u02c6x\u2212i =Pj Wij xj/Dii, and then (3) is equivalent to our\n\nobjective function (2). From this equivalence, AEW can be interpreted as the parameter optimization\nin graph weights of the HGF model for feature vectors through the leave-one-out cross-validation.\nThis also means that our framework estimates labels using the adjacency matrix W optimized in the\nfeature space instead of the output (label) space. Thus, if input features and labels share the same\nadjacency matrix (i.e., sharing the same local structure), the minimization of the objective function\n(2) should estimate the adjacency matrix which accurately propagates the labels of graph nodes.\n\n4.2 Local Linear Approximation\n\nThe feature propagation model provides the interpretation of our approach as the optimization of the\nadjacency matrix under the assumption that x and y can be reconstructed by the same adjacency ma-\ntrix. We here justify this assumption in a more formal way from a viewpoint of local reconstruction\nwith a lower dimensional manifold model.\n\nAs shown in [1], HGF can be regarded as local reconstruction methods, which means the prediction\ncan be represented as weighted local averages:\n\nFik = Pj WijFjk\n\nDii\n\nfor i = \u2113 + 1, . . . , n.\n\nWe show the relationship between the local reconstruction error in the feature space described by\nour objective function (2) and the output space. For simplicity we consider the vector form of the\nscore function f \u2208 Rn which can be considered as a special case of the score matrix F , and\ndiscussions here can be applied to F . The same analysis can be approximately applied to other\ngraph learning methods such as local global consistency [2] because it has similar local averaging\nform as the above, though we omitted here.\n\n4\n\n\fWe assume the following manifold model for the input feature space, in which x is generated from\ncorresponding some lower dimensional variable \u03c4 \u2208 Rq: x = g(\u03c4 ) + \u03b5x, where g : Rq \u2192 Rp\nis a smooth function, and \u03b5x \u2208 Rp represents noise. In this model, y is also represented by some\nfunction form of \u03c4 : y = h(\u03c4 ) + \u03b5y, where h : Rq \u2192 R is a smooth function, and \u03b5y \u2208 R represents\nnoise (for simplicity, we consider a continuous output rather than discrete labels). For this model,\nthe following theorem shows the relationship between the reconstruction error of the feature vector\nx and the output y:\nTheorem 1. Suppose xi can be approximated by its neighbors as follows\n\nxi =\n\nWij xj + ei,\n\n(4)\n\n1\n\nDii Xj\u223ci\n\nwhere ei \u2208 Rp represents an approximation error. Then, the same adjacency matrix reconstructs\nthe output yi \u2208 R with the following error:\n\n1\n\nDii Xj\u223ci\n\nwhere J = \u2202h(\u03c4 i)\n\u03c4 jk2\n\n\u2202\u03c4 \u22a4 (cid:17)+\n\u2202\u03c4 \u22a4 (cid:16) \u2202g(\u03c4 i)\n\n2).\n\nyi \u2212\n\nWijyj = J ei + O(\u03b4\u03c4 i) + O(\u03b5x + \u03b5y),\n\n(5)\n\nwith superscript + indicates pseudoinverse, and \u03b4\u03c4 i = maxj(k\u03c4 i \u2212\n\nSee our supplementary material for the proof of this theorem. From (5), we can see that the recon-\nstruction error of yi consists of three terms. The \ufb01rst term includes the reconstruction error for xi\nwhich is represented by ei, and the second term is the distance between \u03c4 i and {\u03c4 j}j\u223ci. These two\nterms have a kind of trade-off relationship because we can reduce ei if we use a lot of data points\nxj, but then \u03b4\u03c4 i would increase. The third term is the intrinsic noise which we cannot directly\ncontrol. In spite of its importance, this simple relationship has not been focused on in the context\nof graph estimation for semi-supervised learning, in which a LLE based objective function has been\nused without clear justi\ufb01cation [5, 7\u20139].\n\nA simple approach to exploit this theorem would be a regularization formulation, which can be a\nminimization of a combination of the reconstruction error for x and a penalization term for distances\nbetween data points connected by edges. Regularized LLE [5, 8, 13, 14] can be interpreted as one\nrealization of such an approach. However, in semi-supervised learning, selecting appropriate values\nof the regularization parameter is dif\ufb01cult. We therefore optimize edge weights through parameters\nof the similarity function, especially the bandwidth parameter of Gaussian similarity function \u03c3. In\nthis approach, a very large bandwidth (giving large weights to distant data points) may cause a large\nreconstruction error, while an extremely small bandwidth causes the problem of not giving enough\nweights to reconstruct.\n\nFor symmetric normalized graph Laplacian, we can not apply Theorem 1 to our algorithm. See\nsupplementary material for modi\ufb01ed version of Theorem 1 for normalized Laplacian. In the exper-\niments, we also report results for normalized Laplacian and show that our approach can improve\nprediction accuracy as in the case of unnormalized Laplacian.\n\n5 Related Topics\n\nLLE [6] can also estimate graph edges based on a similar objective function, in which W is directly\noptimized as a real valued matrix. This manner has been used in many methods for graph-based\nsemi-supervised learning and clustering [5, 7\u20139], but LLE is very noise-sensitive [10], and resulting\nweights Wij cannot necessarily represent similarity between the corresponding nodes (i, j). For\nexample, for two nearly identical points xj1 and xj2 , both connecting to xi, it is not guaranteed\nthat Wij1 and Wij2 have similar values. To solve this problem, a regularization term can be in-\ntroduced [11], while it is not easy to optimize the regularization parameter for this term. On the\nother hand, we optimize parameters of the similarity (kernel) function. This parameterized form of\nedge weights can alleviate the over-\ufb01tting problem. Moreover, obviously, the optimized weights still\nrepresent the node similarity.\n\nAlthough several model selection approaches (such as cross-validation and marginal likelihood max-\nimization) have been applied to optimizing graph edge weights by regarding them as usual hyper-\n\n5\n\n\fparameters in supervised learning [3, 4, 15], most of them need labeled instances and become un-\nreliable under the cases with few labels. Another approach is optimizing some criterion designed\nspeci\ufb01cally for graph-based algorithms (e.g., [1, 16]). These criteria often have degenerate (trivial)\nsolutions for which heuristics are used to prevent, but the validity of those heuristics is not clear.\nCompared to these approaches, our approach is more general and \ufb02exible for problem settings, be-\ncause AEW is independent of the number of classes, the number of labels, and subsequent label\nestimation algorithms.\nIn addition, model selection based approaches are basically for the third\nstep in the three-step procedure, by which AEW can be combined with such methods, like that the\noptimized graph by AEW can be used as the input graph of these methods.\n\nBesides k-NN, there have been several methods generating a graph (edges) from the feature vectors\n(e.g., [9, 17]). Our approach can also be applied to those graphs because AEW only optimizes\nweights of edges. In our experiments, we used the edges of the k-NN graph as the initial graph of\nAEW. We then observed that AEW is not sensitive to the choice of k, comparing with usual k-NN\ngraphs. This is because the Gaussian similarity value becomes small if xi and xj are not close\nto each other to minimize the reconstruction error (2). In other words, redundant weights can be\nreduced drastically, because in the Gaussian kernel, weights decay exponentially according to the\nsquared distance.\n\nMetric learning is another approach to adapting similarity, while metric learning is not for graphs.\nA standard method for incorporating graph information into metric learning is to use some graph-\nbased regularization, in which graph weights must be determined beforehand. For example, in [18],\na graph is generated by LLE, of which we already described the disadvantages. Another approach\nis [19], which estimates a distance metric so that the k-NN graph in terms of the obtained metric\nshould reproduce a given graph. This approach is however not for semi-supervised learning, and it\nis unclear if this approach works for semi-supervised settings. Overall metric learning is developed\nfrom a different context from our setting, by which it has not been justi\ufb01ed that metric learning can\nbe applied to label propagation.\n\n6 Experiments\n\nWe evaluated the performance of our approach using synthetic and real-world datasets. We investi-\ngated the performance of AEW using the harmonic Gaussian \ufb01eld (HGF) model. For comparison,\nwe used linear neighborhood propagation (LNP) [5], which generates a graph using a LLE based\nobjective function. LNP can have two regularization parameters, one of which is for the LLE pro-\ncess (the \ufb01rst and second steps in the three-step procedure), and the other is for the label estimation\nprocess (the third step in the three-step procedure). For the parameter in the LLE process, we used\nthe heuristics suggested by [11], and for the label propagation process, we chose the best parameter\nvalue in terms of the test accuracy. HGF does not have such hyper-parameters. All results were\naveraged over 30 runs with randomly sampled data points.\n\n6.1 Synthetic datasets\n\nWe here use two datasets in Figure 1 having the same form, but Figure 1 (b) has several noisy data\npoints which may become bridge points (points connecting different classes [5]). In both cases, the\nnumber of classes is 4 and each class has 100 data points (thus, n = 400).\n\nTable 1 shows the error rates for the unlabeled nodes of HGF and LNP under 0-1 loss. For HGF,\nwe used the median heuristics to choose the parameter \u03c3d in similarity function (1), meaning that a\ncommon \u03c3 (= \u03c31 = . . . = \u03c3p) is set as the median distance between all connected pairs of xi.The\nsymmetric normalized version of graph Laplacian was used. The optimization of AEW started from\nthe median \u03c3d. The results by AEW are shown in the column \u2018AEW + HGF\u2019 of Table 1. The number\nof labeled nodes was 10 in each class (\u2113 = 40, i.e., 10% of the entire datasets), and the number of\nneighbors in the graphs was set as k = 10 or 20.\n\nIn Table 1, we see HGF with AEW achieved better prediction accuracies than the median heuristics\nand LNP in all cases. Moreover, for both of datasets (a) and (b), AEW was most robust against\nthe change of the number of neighbors k. This is because \u03c3d is automatically adjusted in such\na way that the local reconstruction error is minimized and then weights for connections between\n\n6\n\n\f1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n(a)\n\n(b)\n\nFigure 1: Synthetic datasets.\n\nTable 1: Test error comparison for synthetic\ndatasets. The best methods according to t-test\nwith the signi\ufb01cant level of 5% are highlighted\nwith boldface.\n\ndata\n(a)\n(a)\n(b)\n(b)\n\nk\n10\n20\n10\n20\n\nHGF\n\nAEW + HGF\n\nLNP\n\n.057 (.039)\n.261 (.048)\n.119 (.054)\n.280 (.051)\n\n.020 (.027)\n.020 (.028)\n.073 (.035)\n.077 (.035)\n\n.039 (.026)\n.103 (.042)\n.103 (.038)\n.148 (.047)\n\n400\n\n350\n\n300\n\n250\n\n200\n\n150\n\n100\n\n50\n\n400\n\n350\n\n300\n\n250\n\n200\n\n150\n\n100\n\n50\n\n400\n\n350\n\n300\n\n250\n\n200\n\n150\n\n100\n\n50\n\n50\n\n100 150 200 250 300 350 400\n\n50\n\n100 150 200 250 300 350 400\n\n50\n\n100 150 200 250 300 350 400\n\n(a) k-NN\n\n(b) AEW\n\n(c) LNP\n\nFigure 2: Resulting graphs for the synthetic dataset of Figure 1 (a) (k = 20).\n\ndifferent manifolds are reduced. Although LNP also minimizes the local reconstruction error, LNP\nmay connect data points far from each other if it reduces the reconstruction error.\n\nFigure 2 shows the graphs generated by (a) k-NN, (b) AEW, and (c) LNP, under k = 20 for the\ndataset of Figure 1 (a). In Figure 2, the k-NN graph connects a lot of nodes in different classes,\nwhile AEW favorably eliminates those undesirable edges. LNP also has less edges between different\nclasses compared to k-NN, but it still connects different classes. AEW reveals the class structure\nmore clearly, which can lead the better prediction performance of subsequent learning algorithms.\n\nTable 2: List of datasets.\n\nCOIL\nUSPS\nMNIST\nORL\nVowel\nYale\noptdigit\nUMIST\n\nn\n500\n1000\n1000\n360\n792\n250\n1000\n518\n\np\n256\n256\n784\n644\n10\n1200\n256\n644\n\n# classes\n10\n10\n10\n40\n11\n5\n10\n20\n\n6.2 Real-world datasets\n\nWe examined the performance of our approach on the\neight popular datasets shown in Table 2, namely COIL\n(COIL-20) [20], USPS (a preprocessed version from\n[21]), MNIST [22], ORL [23], Vowel [24], Yale (Yale\nFace Database B) [25], optdigit [24], and UMIST [26].\n\nWe evaluated two variants of the HGF model.\nIn\nwhat follows, \u2018HGF\u2019 indicates HGF using unnormalized\ngraph Laplacian L = D \u2212 W , and \u2018N-HGF\u2019 indi-\ncates HGF using symmetric normalized Laplacian L =\nI \u2212 D\u22121/2W D\u22121/2. For both of two variants, the me-\ndian heuristics was used to set \u03c3d. To adapt the difference of local scale, we here use local scaling\nkernel [27] as the similarity function. Figure 3 shows the test error for unlabeled nodes. In this\n\ufb01gure, two dashed lines with different markers are by HGF and N-HGF, while two solid lines with\nthe same markers are by HGF with AEW. The performance difference within the variants of HGF\nwas not large, compared to the effect of AEW, particularly in COIL, ORL, Vowel, Yale, and UMIST.\nWe can rather see that AEW substantially improved the prediction accuracy of HGF in most cases.\nLNP is by the solid line without any markers. LNP outperformed HGF (without AEW, shown as the\ndashed lines) in COIL, ORL, Vowel, Yale and UMIST, while HGF with AEW (at least one of three\nvariants) achieved better performance than LNP in all these datasets except for Yale (In Yale, LNP\nand HGF with AEW attained a similar accuracy).\n\nOverall AEW-N-HGF had the best prediction accuracy, where typical examples were USPS and\nMNIST. Although Theorem 1 exactly holds only for AEW-HGF, we can see that AEW-N-HGF, in\nwhich the degrees of the graph nodes are scaled by normalized Laplacian had highly stable perfor-\nmance.\n\nWe further examined the effect of k. Figure 4 shows the test error for k = 20 and 10, using N-HGF,\nAEW-N-HGF, and LNP for COIL dataset. The number of labeled instances is the midst value in\n\n7\n\n\f \n\n \n\nHGF\n\nN\u2212HGF\n\nAEW\u2212HGF\n\nAEW\u2212N\u2212HGF\n\nLNP\n\n0.55\n\n0.5\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\ne\n\nt\n\na\nr\n \nr\no\nr\nr\ne\n\n \nt\ns\ne\nT\n\n \n\nHGF\n\nN\u2212HGF\n\nAEW\u2212HGF\n\nAEW\u2212N\u2212HGF\n\nLNP\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\ne\n\nt\n\na\nr\n \nr\no\nr\nr\ne\n\n \nt\ns\ne\nT\n\n \n\nHGF\n\nN\u2212HGF\n\nAEW\u2212HGF\n\nAEW\u2212N\u2212HGF\n\nLNP\n\n2\n\n4\n\n6\n\n8\n\n10\n\n# labeled instances in each class\n\n \n\n2\n\n4\n\n6\n\n8\n\n10\n\n# labeled instances in each class\n\n0.05\n\n \n1\n\n2\n\n3\n\n4\n\n5\n\n# labeled instances in each class\n\n(b) USPS\n\n(c) MNIST\n\n(d) ORL\n\n \n\n \n\n \n\n \n\nHGF\n\nN\u2212HGF\n\nAEW\u2212HGF\n\nAEW\u2212N\u2212HGF\n\nLNP\n\nHGF\n\nN\u2212HGF\n\nAEW\u2212HGF\n\nAEW\u2212N\u2212HGF\n\nLNP\n\nHGF\n\nN\u2212HGF\n\nAEW\u2212HGF\n\nAEW\u2212N\u2212HGF\n\nLNP\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\ne\n\nt\n\na\nr\n \nr\no\nr\nr\ne\n\n \nt\ns\ne\nT\n\n0.1\n\n \n\n2\n\n4\n\n6\n\n8\n\n10\n\n# labeled instances in each class\n\n(a) COIL\n\ne\n\nt\n\na\nr\n \nr\no\nr\nr\ne\n\n \nt\ns\ne\nT\n\n0.6\n\n0.55\n\n0.5\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n \n\ne\n\nt\n\na\nr\n \nr\no\nr\nr\ne\n\n \nt\ns\ne\nT\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n \n\n0.7\n\n0.65\n\n0.6\n\n0.55\n\n0.5\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\ne\n\nt\n\na\nr\n \nr\no\nr\nr\ne\n\n \nt\ns\ne\nT\n\nHGF\n\nN\u2212HGF\n\nAEW\u2212HGF\n\nAEW\u2212N\u2212HGF\n\nLNP\n\ne\n\nt\n\na\nr\n \nr\no\nr\nr\ne\n\n \nt\ns\ne\nT\n\n0.16\n\n0.14\n\n0.12\n\n0.1\n\n0.08\n\n0.06\n\n0.04\n\n \n\nHGF\n\nN\u2212HGF\n\nAEW\u2212HGF\n\nAEW\u2212N\u2212HGF\n\nLNP\n\ne\n\nt\n\na\nr\n \nr\no\nr\nr\ne\n\n \nt\ns\ne\nT\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n \n\n2\n\n4\n\n6\n\n8\n\n10\n\n# labeled instances in each class\n\n2\n\n4\n\n6\n\n8\n\n10\n\n# labeled instances in each class\n\n5\n\n10\n\n15\n\n20\n\n# labeled instances in each class\n\n \n\n2\n\n4\n\n6\n\n8\n\n10\n\n# labeled instances in each class\n\n(e) Vowel\n\n(f) Yale\n\n(g) optdigit\n\n(h) UMIST\n\nFigure 3: Performance comparison on real-world datasets. HGFs with AEW are by solid lines with\nmarkers, while HGFs with median heuristics is by dashed lines with the same markers, and LNP\nis by a solid line without any markers. For N-HGF and AWE-N-HGF, \u2018N\u2019 indicates normalized\nLaplacian.\n\nthe horizontal axis of Figure 3 (a) (5 in each class). We can see that the test error of AEW is not\nsensitive to k. Performance of N-HGF with k = 20 was worse than that with k = 10. On the other\nhand, AEW-N-HGF with k = 20 had a similar performance to that with k = 10.\n\n7 Conclusions\n\n0.2\n\n0.15\n\n0.1\n\ne\n\nt\n\na\nr\n \nr\no\nr\nr\ne\n\n \nt\ns\ne\nT\n\n0.25\n\nWe have proposed the adaptive edge weighting (AEW)\nmethod for graph-based semi-supervised learning. AEW\nis based on the local reconstruction with the constraint\nthat each edge represents the similarity of each pair of\nnodes. Due to this constraint, AEW has numerous ad-\nvantages against LLE based approaches. For example,\nnoise sensitivity of LLE can be alleviated by the parame-\nterized form of the edge weights, and the similarity form\nfor the edges weights is very reasonable for graph-based\nmethods. We also provide several interesting properties\nof AEW, by which our objective function can be mo-\ntivated analytically. We examined the performance of\nAEW by using two synthetic and eight real benchmark\ndatasets. Experimental results demonstrated that AEW\ncan improve the performance of the harmonic Gaussian\n\ufb01eld (HGF) model substantially, and we also saw that AEW outperformed LLE based approaches in\nall cases of real datasets except only one case.\n\nFigure 4: Comparison in test error rates\nof k = 10 and 20 (COIL \u2113 = 50). Two\nboxplots of each method correspond to\nk = 10 in the left (with a smaller width)\nand k = 20 in the right (with a larger\nwidth).\n\nAEW\u2212N\u2212HGF\n\nN\u2212HGF\n\n0.05\n\nLNP\n\nReferences\n\n[1] X. Zhu, Z. Ghahramani, and J. D. Lafferty, \u201cSemi-supervised learning using Gaussian \ufb01elds and harmonic\nfunctions,\u201d in Proc. of the 20th ICML (T. Fawcett and N. Mishra, eds.), pp. 912\u2013919, AAAI Press, 2003.\n\n[2] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Sch\u00a8olkopf, \u201cLearning with local and global consis-\n\ntency,\u201d in Advances in NIPS 16 (S. Thrun, L. Saul, and B. Sch\u00a8olkopf, eds.), MIT Press, 2004.\n\n8\n\n\f[3] A. Kapoor, Y. A. Qi, H. Ahn, and R. Picard, \u201cHyperparameter and kernel learning for graph based semi-\nsupervised classi\ufb01cation,\u201d in Advances in NIPS 18 (Y. Weiss, B. Sch\u00a8olkopf, and J. Platt, eds.), pp. 627\u2013\n634, MIT Press, 2006.\n\n[4] X. Zhang and W. S. Lee, \u201cHyperparameter learning for graph based semi-supervised learning algorithms,\u201d\nin Advances in NIPS 19 (B. Sch\u00a8olkopf, J. Platt, and T. Hoffman, eds.), pp. 1585\u20131592, MIT Press, 2007.\n\n[5] F. Wang and C. Zhang, \u201cLabel propagation through linear neighborhoods,\u201d IEEE TKDE, vol. 20, pp. 55\u2013\n\n67, 2008.\n\n[6] S. Roweis and L. Saul, \u201cNonlinear dimensionality reduction by locally linear embedding,\u201d Science,\n\nvol. 290, no. 5500, pp. 2323\u20132326, 2000.\n\n[7] S. I. Daitch, J. A. Kelner, and D. A. Spielman, \u201cFitting a graph to vector data,\u201d in Proc. of the 26th ICML,\n\n(New York, NY, USA), pp. 201\u2013208, ACM, 2009.\n\n[8] H. Cheng, Z. Liu, and J. Yang, \u201cSparsity induced similarity measure for label propagation,\u201d in IEEE 12th\n\nICCV, pp. 317\u2013324, IEEE, 2009.\n\n[9] W. Liu, J. He, and S.-F. Chang, \u201cLarge graph construction for scalable semi-supervised learning,\u201d in Proc.\n\nof the 27th ICML, pp. 679\u2013686, Omnipress, 2010.\n\n[10] J. Chen and Y. Liu, \u201cLocally linear embedding: a survey,\u201d Arti\ufb01cial Intelligence Review, vol. 36, pp. 29\u2013\n\n48, 2011.\n\n[11] L. K. Saul and S. T. Roweis, \u201cThink globally, \ufb01t locally: unsupervised learning of low dimensional\n\nmanifolds,\u201d JMLR, vol. 4, pp. 119\u2013155, Dec. 2003.\n\n[12] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch\u00a8olkopf, and A. J. Smola, \u201cA kernel method for the two-\nsample-problem,\u201d in Advances in NIPS 19 (B. Sch\u00a8olkopf, J. C. Platt, and T. Hoffman, eds.), pp. 513\u2013520,\nMIT Press, 2007.\n\n[13] E. Elhamifar and R. Vidal, \u201cSparse manifold clustering and embedding,\u201d in Advances in NIPS 24\n\n(J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, eds.), pp. 55\u201363, 2011.\n\n[14] D. Kong, C. H. Ding, H. Huang, and F. Nie, \u201cAn iterative locally linear embedding algorithm,\u201d in Proc.\n\nof the 29th ICML (J. Langford and J. Pineau, eds.), pp. 1647\u20131654, Omnipress, 2012.\n\n[15] X. Zhu, J. Kandola, Z. Ghahramani, and J. Lafferty, \u201cNonparametric transforms of graph kernels for semi-\nsupervised learning,\u201d in Advances in NIPS 17 (L. K. Saul, Y. Weiss, and L. Bottou, eds.), pp. 1641\u20131648,\nMIT Press, 2005.\n\n[16] F. R. Bach and M. I. Jordan, \u201cLearning spectral clustering,\u201d in Advances in NIPS 16 (S. Thrun, L. K. Saul,\n\nand B. Sch\u00a8olkopf, eds.), 2004.\n\n[17] T. Jebara, J. Wang, and S.-F. Chang, \u201cGraph construction and b-matching for semi-supervised learning,\u201d\nin Proc. of the 26th ICML (A. P. Danyluk, L. Bottou, and M. L. Littman, eds.), pp. 441\u2013448, ACM, 2009.\n\n[18] M. S. Baghshah and S. B. Shouraki, \u201cMetric learning for semi-supervised clustering using pairwise con-\nstraints and the geometrical structure of data,\u201d Intelligent Data Analysis, vol. 13, no. 6, pp. 887\u2013899,\n2009.\n\n[19] B. Shaw, B. Huang, and T. Jebara, \u201cLearning a distance metric from a network,\u201d in Advances in NIPS 24\n\n(J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, eds.), pp. 1899\u20131907, 2011.\n\n[20] S. A. Nene, S. K. Nayar, and H. Murase, \u201cColumbia object image library,\u201d tech. rep., CUCS-005-96,\n\n1996.\n\n[21] T. Hastie, R. Tibshirani, and J. H. Friedman, The elements of statistical learning: data mining, inference,\n\nand prediction. New York: Springer-Verlag, 2001.\n\n[22] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, \u201cGradient-based learning applied to document recogni-\n\ntion,\u201d Proceedings of the IEEE, vol. 86, no. 11, pp. 2278\u20132324, 1998.\n\n[23] F. Samaria and A. Harter, \u201cParameterisation of a stochastic model for human face identi\ufb01cation,\u201d in\n\nProceedings of the Second IEEE Workshop on Applications of Computer Vision, pp. 138\u2013142, 1994.\n\n[24] A.\n\nAsuncion\n\nand\n\nD.\n\nJ.\n\nNewman,\n\n\u201cUCI\n\nmachine\n\nlearning\n\nrepository.\u201d\n\nhttp://www.ics.uci.edu/\u02dcmlearn/MLRepository.html, 2007.\n\n[25] A. Georghiades, P. Belhumeur, and D. Kriegman, \u201cFrom few to many: Illumination cone models for face\n\nrecognition under variable lighting and pose,\u201d IEEE TPAMI, vol. 23, no. 6, pp. 643\u2013660, 2001.\n\n[26] D. B. Graham and N. M. Allinson, \u201cCharacterizing virtual eigensignatures for general purpose face recog-\nnition,\u201d in Face Recognition: From Theory to Applications ; NATO ASI Series F, Computer and Systems\nSciences (H. Wechsler, P. J. Phillips, V. Bruce, F. Fogelman-Soulie, and T. S. Huang, eds.), vol. 163,\npp. 446\u2013456, 1998.\n\n[27] L. Zelnik-Manor and P. Perona, \u201cSelf-tuning spectral clustering,\u201d in Advances in NIPS 17, pp. 1601\u20131608,\n\nMIT Press, 2004.\n\n9\n\n\f", "award": [], "sourceid": 768, "authors": [{"given_name": "Masayuki", "family_name": "Karasuyama", "institution": "Kyoto University"}, {"given_name": "Hiroshi", "family_name": "Mamitsuka", "institution": "Kyoto University"}]}