{"title": "Prediction on a Graph with a Perceptron", "book": "Advances in Neural Information Processing Systems", "page_first": 577, "page_last": 584, "abstract": null, "full_text": "Prediction on a Graph with a Perceptron\n\nMark Herbster, Massimiliano Pontil Department of Computer Science University College London Gower Street, London WC1E 6BT, England, UK {m.herbster, m.pontil}@cs.ucl.ac.uk\n\nAbstract\nWe study the problem of online prediction of a noisy labeling of a graph with the perceptron. We address both label noise and concept noise. Graph learning is framed as an instance of prediction on a finite set. To treat label noise we show that the hinge loss bounds derived by Gentile [1] for online perceptron learning can be transformed to relative mistake bounds with an optimal leading constant when applied to prediction on a finite set. These bounds depend crucially on the norm of the learned concept. Often the norm of a concept can vary dramatically with only small perturbations in a labeling. We analyze a simple transformation that stabilizes the norm under perturbations. We derive an upper bound that depends only on natural properties of the graph  the graph diameter and the cut size of a partitioning of the graph  which are only indirectly dependent on the size of the graph. The impossibility of such bounds for the graph geodesic nearest neighbors algorithm will be demonstrated.\n\n1 Introduction\nWe study the problem of robust online learning over a graph. Consider the following game for predicting the labeling of a graph. Nature presents a vertex vi1 ; the learner predicts the label of the vertex y1  {-1, 1}; nature presents a label y1 ; nature presents a vertex vi2 ; the learner predicts y2 ; ^ ^ and so forth. The learner's goal is minimize the total number of mistakes (|{t : yt = yt }|). If nature ^ is adversarial, the learner will always mispredict; but if nature is regular or simple, there is hope that a learner may make only a few mispredictions. Thus, a methodological goal is to give learners whose total mispredictions can be bounded relative to the \"complexity\" of nature's labeling. In this paper, we consider the cut size as a measure of the complexity of a graph's labeling, where the size of the cut is the number of edges between disagreeing labels. We will give bounds which depend on the cut size and the diameter of the graph and thus do not directly depend on the size of the graph. The problem of learning a labeling of a graph is a natural problem in the online learning setting, as well as a foundational technique for a variety of semi-supervised learning methods [2, 3, 4, 5, 6]. For example, in the online setting, consider a system which serves advertisements on web pages. The web pages may be identified with the vertices of a graph and the edges as links between pages. The online prediction problem is then that, at a given time t the system may receive a request to serve an advertisement on a particular web page. For simplicity, we assume that there are two alternatives to be served: either advertisement \"A\" or advertisement \"B\". The system then interprets the feedback as the label and then may use this information in responding to the next request to predict an advertisement for a requested web page. 1.1 Related work\n\nThere is a well-developed literature regarding learning on the graph. The early work of Blum and Chawla [2] presented an algorithm which explicitly finds min-cuts of the label set. Bounds have been\n\n\f\nInput: {(vit , yt )}t=1  VM  {-1,1}. Initialization: w1 = 0; MA = . + + for t = 1, . . . , do Predict: receive vit + + yt = sign(eit wt ) ^ Update: receive yt Figure 2: Barbell if yt = yt then ^ + wt+1 = wt else + + wt+1 = wt + yt vit MA = MA  {t} + + end Figure 1: Perceptron on set VM . Figure 3: Barbell with concept noise\n\n+\n\n+\n\n-\n\n-\n\nFigure 4: Flower\n+ + + + + + -\n\nFigure 5: Octopus\n\nproven previously with smooth loss functions [6, 7] in a batch setting. Kernels on graph labelings were introduced in [3, 5]. This current work builds upon our work in [8]. There it was shown that, given a fixed labeling of a graph, the number of mistakes made by an algorithm similar to the kernel perceptron [9] with a kernel that was the pseudoinverse of the graph Laplacian, could be bounded by the quantity [8, Theorems 3.2, 4.1, and 4.2] 4G (u)DG bal(u). (1) Here u  {-1, 1}n is a binary vector defining the labeling of the graph, G (u) is the cut size1 defined as G (u) := |{(i, j )  E (G) : ui = uj }|, that is, the number of edges between positive 11 u -2 and negative labels, DG is the diameter of the graph and bal(u) := - n | measures the i| label balance. This bound is interesting in that the mistakes of the algorithm could be bounded in terms of simple properties of a labeled graph. However, there are a variety of shortcomings in this result. First, we observe that the bound above assumed a fixed labeling of the graph. In practice, the online data sequence could contain multiple labels for a single vertex; this is the problem of label noise. Second, for an unbalanced set of labels the bound is vacuous, for example, if u = {1, 1, . . . , 1, -1}  IRn then bal(u) = n2 . Third, consider the prototypical easy instance for the algorithm of two dense clusters connected by a few edges, for instance, two m-cliques connected by a single edge (a barbell graph, see Figure 2). If each clique is labeled with distinct labels then we have that 4G (u)DG bal(u) = 4131 = 12, which is independent of m. Now suppose that, say, the first clique contains one vertex which is labeled as the second clique (see Figure 3). Previously G (u) = 1, but now G (u) = m and the bound is vacuous. This is the problem of concept noise; in this example, a (1) perturbation of labeling increases the bound multiplicatively by (m). 1.2 Overview\n\nA first aim of this paper is to improve upon the bounds in [8], particularly, to address the three problems of label balance, label noise, and concept noise as discussed above. For this purpose, we apply the well-known kernel perceptron [9] to the problem of online learning on the graph. We discuss the background material for this problem in section 2, where we also show that the bounds of [1] can be specialized to relative mistake bounds when applied to, for example, prediction on the graph. A second important aim of this paper is to interpret the mistake bounds by an explanation in terms of high level graph properties. Hence, in section 3, we refine a diameter based bound of [8, Theorem 4.2] to a sharper bound based on the \"resistance distance\" [10] on a weighted graph; which we then closely match with a lower bound. In section 4, we introduce a kernel which is a simple augmentation of the pseudoinverse of the graph Laplacian and then prove a theorem on the performance of the perceptron with this kernel which solves the three problems above. We conclude in section 5, with a discussion comparing the mistake bounds for prediction on the graph with the halving algorithm [11] and the k -nearest neighbors algorithm.\n\n2 Preliminaries\nIn this section, we describe our setup for Hilbert spaces on finite sets and its specification to the graph case. We then recall a result of Gentile [1] on prediction with the perceptron and discuss a special case in which relative 01 loss (mistake) bounds are obtainable.\n1\n\nLater in the paper we extend the definition of cut size to weighted graphs.\n\n\f\n2.1\n\nHilbert spaces of functions defined on a finite set\n\nWe denote matrices by capital bold letters and vectors by small bold case letters. So M denotes the n  n matrix (Mij )nj =1 and w the n-dimensional vector (wi )n 1 . The identity matrix is denoted i, i= by I. We also let 0 and 1 be the n-dimensional vectors all of whose components equal to zero and one respectively, and ei the i-th coordinate vector of IRn . Let IN be the set of natural numbers and IN := {1, . . . , }. We denote a generic Hilbert space with H. We identify V := INn as the indices of a set of n objects, e.g. the vertices of a graph. A vector w  IRn can alternatively be seen as a function f : V  IR such that f (i) = wi , i  V . However, for simplicity we will use the notation w to denote both a vector in IRn or the above function. A symmetric positive semidefinite matrix M induces a semi-inner product on IRn which is defined as u, w M := u Mw, where \" \" denotes transposition. The reproducing kernel [12] associated with the above semi-inner product is K = M+ , where \"+ \" denotes pseudoinverse. We also define the coordinate spanning set VM := {vi := M+ ei : i = 1, . . . , n} (2)\n\nand let H(M) := span(VM ). The restriction of the semi-inner product ,  M to H(M) is an inner product on H(M). The set VM acts as \"coordinates\" for H(M), that is, if w  H(M) we have wi = ei M+ Mw = vi Mw = vi , w M, (3)\n\nalthough the vectors {v1 , . . . , vn } are not necessarily normalized and are linearly independent only if M is positive definite. We note that equation (3) is simply the reproducing kernel property [12] for kernel M+ . When V indexes the vertices of an undirected graph G , a natural norm to use is that induced by the graph Laplacian. We explain this in detail now. Let A be the n  n symmetric weight matrix of the graph such that Aij  0, and define the edge set E (G ) := {(i, j ) : 0 < Aij , i < j }. The distance matrix  associated with G is the per-element inverse of the weight matrix, that is, ij = A1 j ( may have + as a matrix element). The graph Laplacian G is the n  n matrix i defined ns G := D - A where D = diag(d1 , . . . , dn ) and di is the weighted degree of vertex i, a di = j =1 Aij . The Laplacian is positive semidefinite and induces the semi-norm ( w 2 := w Gw = Aij (wi - wj )2 . (4) G\ni,j )E (G)\n\nWhen the graph is connected, it follows from equation (4) that the null space of G is spanned by the constant vector 1 only. In this paper, we always assume that the graph G is connected. Where it is not ambiguous, we use the notation G to denote both the graph G and the graph Laplacian. 2.2 Online prediction of functions on a finite set with the perceptron\n\nGentile [1] bounded the performance of the perceptron algorithm on nonseparable data with the linear hinge loss. Here, we apply his result to study the problem of prediction on a finite set with the perceptron (see Figure 1). In this case, the inputs are the coordinates in the set VM  H(M) defined above. We additionally assume that matrix M is positive definite (not just positive semidefinite as in the previous subsection). This assumption, along with the fact that the inputs are coordinates, enables us to upper bound the hinge loss and hence we may give a relative mistake bound in terms of the complete set of base classifiers {-1, 1}n . Theorem 2.1. Let M be a symmetric positive definite matrix. If {(vit , yt )}t=1  VM  {-1, 1} is a sequence of examples, MA denotes the set of trials in which the perceptron algorithm predicted incorrectly and X = maxtMA vit M, then the cumulative number of mistakes |MA | of the algorithm is bounded by 2 4 2 u MX 4 u MX 2 2 + |MA Mu | u M X 2 + (5) |MA |  2|MA Mu | + 2 4 for all u  {-1, 1}n , where Mu = {t  IN\n:\n\nuit = yt }. In particular, if |Mu | = 0 then\n2 2 MX .\n\n|MA |  u\n\n\f\nProof. This bound follows directly from [1, Theorem 8] with p = 2,  = 1, and w1 = 0. Since M is assumed to be symmetric positive definite, it follows that {-1, 1}n  H(M). Thus, the hinge loss Lu,t := max(0, 1 - yt u, vit M) of any classifier u  {-1, 1}n with any example (vit , yt ) is either 0 or 2, since | u, vit M| = 1 by equation (3). This allows us to bound the hinge loss term of [1, Theorem 8] directly with mistakes. We emphasize that our hypothesis on M does not imply linear separability since multiple instances of an input vector in the training sequence may have distinct target labels. Moreover, we note that, for deterministic prediction the constant 2 in the first term of the right hand side of equation (5) is optimal for an online algorithm as a mistake may be forced on every trial.\n\n3 Interpretation of the space H(G)\nThe bound for prediction on a finite set in equation (5) involves two quantities, namely the squared norm of a classifier u  {-1, 1}n and the maximum of the squared norms of the coordinates v  VM . In the case of prediction on the graph, recall from equation (4) that u 2 := u Gu = G ( 2 i,j )E (G) Aij (ui - uj ) . Therefore, we may identify this semi-norm with the weighted cut size 1 2 (6) uG 4 of the labeling induced when u  {-1, 1}n . In particular, with boolean weighted edges (Aij  {0, 1}) the cut simply counts the number of edges spanning disagreeing labels. G (u) := The norm v - w G is a metric distance for v, w  span(VG ) however, surprisingly, the square of the norm vp - vq 2 when restricted to graph coordinates vp , vq  VG is also a metric known as G the resistance distance [10], rG (p, q ) := (ep - eq ) G+ (ep - eq ) = vp - vq 2 . (7) G It is interesting to note that the resistance distance between vertex p and vertex q is the effective resistance between vertices p and q , where the graph is the circuit and edge (i, j ) is a resistor with the resistance ij = A-1 . ij As we shall see, our bounds in section 4 depend on vp 2 = vp - 0 2 . Formally, this is not an G G effective resistance between vePex p and another vertex \"0\". The vector 0, informally however, is rt the center of the graph as 0 =\nvVG\n\nv\n\n|VG |\n\n, since 1 is in the null space of H(G). In the following, we\n\nfurther characterize vp 2 . First, we observe qualitatively that the more interconnected the graph G the smaller the term vp 2 (Corollary 3.1). Second, in Theorem 3.2 we quantitatively upper bound G vp 2 by the average (over q ) of the effective resistance between vertex p and each vertex q in the G graph (including q = p), which in turn may be upper bounded by the eccentricity of p. We proceed with the following useful lemma and theorem, as a basis for our later results. w . -2 Lemma 3.1. Let x  H then x = minwH 2 : w, x = 1 The proof is straightforward and we do not elaborate on the details. Theorem 3.1. If M and M are symmetric positive semidefinite matrices with span(VM ) = 2 2 span(VM ) and, for every w  span(VM ), w M  w M then 2 2 in in ai vi  ai vi\n=1 M =1 M\n,\n\nwhere vi  VM , vi  VM Proof. Let x = x n\ni=1 ai vi -2 M\n\nand\n\na  IRn .\ni i=1 ai v x x\n\nand x = 2 x x\n2 M M\n\nn \n\nthen\nx\n\nx\n2 M\n\n=\n\n2 M\nx\n\n\n2 M\n\nx\n2 M x =\n\n-2 M\n\n,\n\nwhere the first inequality follows since\n\nx2\n\nM,\n\nx M = 1, hence\n\nx\n2 M i\n\ns a feasible solution to\n\nthe minimization problem in the right hand side of Lemma 3.1. While the second one follows 2 2 immediately from the assumption that w M  w M .\n\n\f\nAs a corollary to the above theorem we have the following when M is a graph Laplacian. Corollary 3.1. Given connected graphs G and G with distance matrices  and  such that ij  ij then for all p, q  V , we have that vp G  vp 2 and rG (p, q )  rG (p, q ). 2 G The first inequality in the above corollary demonstrates that vp 2 is nonincreasing in a graph that G is strictly more connected. The second inequality is the well-known Rayleigh's monotonicity law which states that if any resistance in a circuit is decreased then the effective resistance between any two points cannot increase. We define the geodesic distance between vertices p, q  V to be dG (p, q ) := min |P (p, q )| where the minimum is taken with respect to all paths P (p, q ) from p to q , with the path length defined ( as |P (p, q )| := i,j )E (P (p,q )) ij . The eccentricity dG (p) of a vertex p  V is the geodesic distance on the graph between p and the furthest vertex on the graph to p, that is, dG (p) = maxqV dG (p, q )  DG , and DG is the (geodesic) diameter of the graph, DG := maxpV dG (p). A graph G is connected when DG < . A tree is an n-vertex connected graph with n - 1 edges. The following lemma, a well known result (see e.g. [10]), establishes that the resistance distance can be be equated with the geodesic distance when the graph is a tree. Lemma 3.2. If the graph T is a tree with graph Laplacian T then rT (p, q ) = dT (p, q ). The next theorem provides a quantitative relationship between vp 2 and two measures of the G connectivity of vertex p, namely its eccentricity and the mean of the effective resistances between vertex p and each vertex on the graph. Theorem 3.2. If G is a connected graph then n 1q vp 2  rG (p, q )  dG (p). (8) G n =1 n Proof. Recall that rG (p, q ) = vp - vq 2 (see equation (7)) and use q=1 vq = 0 to obtain that G n n 1 1 p q G q =1 vp - vq 2 = v Gvp + n q =1 v Gvq which implies the left inequality in (8). Next, n by Corollary 3.1, if T is the Laplacian of a tree T  G then rG (p, q )  rT (p, q ) for p, q  V . Therefore, from Lemma 3.2 we conclude that rG (p, q )  dT (p, q ). Moreover, since T  G can be any tree, we have that rG (p, q )  minT dT (p, q ) where the minimum is over all trees T  G. Since the geodesic path from p to q is necessarily contained in some tree T  G it follows that minT dT (p, q ) = dG (p, q ) and, so, rG (p, q )  dG (p, q ). Now the theorem follows by maximizing dG (p, q ) over q and the definition of dG (p). We identify the resistance diameter of a graph G as RG := maxp,qV rG (p, q ); thus, from the previous theorem, we may also conclude that max vp G  RG  DG . 2\npV\n\n(9)\n\nWe complete this section by showing that there exists a family of graphs for which the above inequality is nearly tight. Specifically, we consider the \"flower graph\" (see Figure 4) obtained by connecting the first vertex of a chain with p - 1 vertices to the root vertex of an m-ary tree of depth one. We index the vertices of this graph so that vertices 1 to p correspond to \"stem vertices\" and vertices p + 1 to p + m to \"petals\". Clearly, this graph has diameter equal to p, hence our upper bound above establishes that v1 G  p. We now argue that as m grows this bound is almost 2 w . 2 tight. From Lemma 3.1 we have that v1 - = minwH(G) 2 : w, v1 = 1 We note G G that by symmetry, the solution w = (wi : i  INp+m ) to the problem above satisfies wi = z if ^ ^ ^ i  p + 1 since w must take the same value on the petal vertices. Consequently, it f. llows that ^ o m p-1 p 2 v1 - = min (z - wp )2 + i=1 (wi - wi+1 )2 : w1 = 1, We upper G i=1 wi + mz = 0\np-i bound this minimum by choosing wi = p-1 for 1  i  p. Thus, w1 = 1 as it is required, wp = 0 and we compute z by the constraint set of the above minimization problem as z = - 2p . A direct m\n\ncomputation gives v1 G  -\n\nthat v1 2  (p - 1) - G for the flower graph is matched by a lower bound with a gap of 1.\n\np2 1 (p-1) + 4m from which using a first order Taylor expansion it follows (p-1)2 p2 . Therefore, as m   the upper bound on v1 2 (equation (8)) G 4m 2\n\n\f\n4 Prediction on the graph\nWe define the following symmetric positive definite graph kernel, Kb := G+ + b11 + cI, (0 < b, 0  c), (10) c b b -1 b where Gc = (Kc ) is the matrix of the associated Hilbert space H(Gc ). In Lemma 4.1 below we prove the needed properties of H(Gb ) as a necessary step for the bound in Theorem 4.2. As c we shall see, these properties moderate the consequences of label imbalance and concept noise. To prove Lemma 4.1, we use the following theorem which is a special case of [12, Thm I, I.6]. Theorem 4.1. If M1 and M2 are n  n symmetric positive semidefinite matrices, and we set M := (M+ + M+ )+ then w 2 = inf { w1 2 1 + w2 2 2 : wi  H(Mi ), w1 + w2 = w} for 1 2 M M M every w  H(M). Next, n e define u  [0, 1] as a measure of the balance of a labeling u  {-1, 1}n as u := w 1 ( n i=1 ui )2 . Note that for a perfectly balanced labeling u = 0, while u = 1 for a perfectly unbalanced one. Lemma 4.1. Given a vertex p with associated coordinates vp  VG and vp  VGb we have that c vp 2 b = vp 2 + b + c. G G\nc\n\n(11)\n\nMoreover, if u, u\n\n\n\n{-1, 1}n and where k := |{i : ui = ui }| we have that 4k u 2 + . u 2b  u G+ Gc b c\nc\n\n(12)\n\nProof. To prove equation (11) we recall equation (3) and note that vp 2 b = vp , vp + b1 + cep Gb = G\nc\n\nvp , vp Gb + vp , b1 + cep Gb = vp 2 + b + c. G\nc c\n\nTo prove inequality (12) we proceed in two steps. First, we show that u 2 2 u Gb = u G + . (13) 0 b n Indeed, we can uniquely decompose u as the sum of a vector in H(G) and one in H( 11 b ) as 2 n n 1 1 u = (u - 1 n i=1 ui ) + 1 n i=1 ui . Therefore, by Theorem 4.1 we conclude that u 2 b = G0   u 2 2  2 2 u - u 1 G + u 1 2n1 = u G + b , where u - u 1 G = u G since 1  H (G). 1 Second, we show, for any symmetric positive definite matrix M, u, u  {-1, 1}n and c > 0, that 4k 2 , (14) u 2 c  u M+ M c where Mc := (M-1 + cI)-1 and k := |{i : ui = ui }|. To this end, we decompose u as a sum of k two elements of H(M) and H( 1 I) as u = u + (u - u) and observe that u - u 2 I = 4c . By 1 c Theorem 4.1 it then follows that u\n2 Mc\n2b\n\nu\n\n2 M+\n\nu\n\n-\n\nc\n\nu 2I = u 1\nc\n\n4k 2 M+ c .\n\nNow inequality (12)\n\nfollows by combining equations (13) and (14) with M = Gb . 0 We can now state our relative mistake bound for online prediction on the graph. Theorem 4.2. Let G be a connected graph. If {(vit , yt )}t=1  VGb  {-1, 1} is a sequence of c examples and MA denotes the set of trials in which the perceptron algorithm predicted incorrectly, then the cumulative number of mistakes |MA | of the algorithm is bounded by 2 Z Z2 |MA |  2|MA Mu | + + |MA Mu |Z + , (15) 2 4 n 1 for all u,u  {-1, 1}n , where k = |{i : ui = ui }|, u = ( n i=1 ui )2 , Mu = {t  IN : uit = yt }, and 4 R . b 4k u Z= G (u ) + + +b+c G c In particular, if b = 1, c = 0, k = 0 and |Mu | = 0 then |MA |  (4G (u) + u )(RG + 1). (16)\n\n\f\nProof. The proof follows by Theorem 2.1 with M = Gb , then bounding u c Lemma 4.1, and then using maxtMA vit 2  RG by equation (9). G\n\n2 Gb c\n\nand vt 2 b via G\nc\n\nThe upper bound of the theorem is more resilient to label imbalance, concept noise, and label noise than the bound in [8, Theorems 3.2, 4.1, and 4.2] (see equation (1)). For example, given the noisy barbell graph in Figure 3 but with k n noisy vertices the bound (1) is O(k n) while the bound (15) with b = 1, c = 1, and |Mu | = 0 is O(k ). A similar argument may be given for label imbalance. In the bound above, for easy interpretability, one may upper bound the resistance diameter RG by the geodesic diameter DG . However, the resistance diameter makes for a sharper bound in a number of natural situations. For example now consider (a thick barbell) two m-cliques (one labeled \"+1\", one \"-1\") with edges ( < m) between the cliques. We observe between any two vertices there are at least edge-disjoint paths of length no more than five, therefore the resistance diameter is at most 5/ by the \"resistors-in-parallel\" rule while the geodesic diameter is 3. Thus, for \"thick barbells\" if we use the geodesic diameter we have a mistake bound of 16 (substituting u = 0, and RG  3 into (16)) while surprisingly with the resistance diameter the bound (substituting b = 41 , c = 0, |Mu | = 0, u = 0, and RG  5/ into (15)) is independent of and is 20. n\n\n5 Discussion\nIn this paper, we have provided a bound on the performance of the perceptron on the graph in terms of structural properties of the graph and its labeling which are only indirectly dependent on the number of vertices in the graph, in particular, they depend on the cut size and the diameter. In the following, we compare the perceptron with two other approaches. First, we compare the perceptron with the graph kernel K1 to the conceptually simpler k -nearest neighbors algorithm with either 0 the graph geodesic distance or the resistance distance. In particular, we prove the impossibility of bounding performance of k -nearest neighbors only in terms of the diameter and the cut size. Specifically, we give a parameterized family of graphs for which the number of mistakes of the perceptron is upper bounded by a fixed constant independent of the graph size while k -nearest neighbors provably incurs mistakes linearly in the graph size. Second, we compare the perceptron with the graph kernel K1 with a simple application of the classical halving algorithm [11]. Here, we conclude that 0 the upper bound for the perceptron is better for graphs with a small diameter while the halving algorithm's upper bound is better for graphs with a large diameter. In the following, for simplicity we limit our discussion to binary-weighted graphs, noise-free data (see equation (16)) and upper bound the resistance diameter RG with the geodesic diameter DG (see equation (9)). 5.1 K-nearest neighbors on the graph We consider the k -nearest neighbors algorithms on the graph with both the resistance distance (see equation (7)) and the graph geodesic distance. The geodesic distance between two vertices is the length of the shortest path between the two vertices (recall the discussion in section 3). In the following, we use the emphasis distance to refer simultaneously to both distances. Now, consider the family of O ,m,p of octopus graphs. An octopus graph (see Figure 5) consists of a \"head\" which is an -clique (C ( ) ) with vertices denoted by c1 , . . . , c , and a set of m \"tentacles\" ({Ti }m 1 ), where i= each tentacle is a line graph of length p. The vertices of tentacle i are denoted by {ti,0 , ti,1 , . . . , ti,p }; the ti,0 are all identified as one vertex r which acts as the root of the m tentacles. There is an edge (the body) connecting root r to the vertex c1 on the head. Thus, this graph has diameter D = max(p + 2, 2p) and there are + mp + 1 vertices in total; an octopus Om,p is balanced if = mp + 1. Note that the distance of every vertex in the head to every other vertex in the graph is no more than p + 2, and every tentacle \"tip\" ti,p is distance 2p to other tips tj,p : j = i. We now argue that k -nearest neighbors may incur mistakes linear in the number of tentacles. To this end, choose p  3 and suppose we have the following online data sequence {(c1 , +1), (t1,p , -1), (c2 , +1), (t2,p , -1), . . . , (cm , +1), (tm,p , -1)}. Note that k -nearest neighbors will make a mistake on every instance (ti,p , -1) and so, even assuming that it predicts correctly on (c1 , +1) it will always make m mistakes. We now contrast this result with the performance of the perceptron with the graph kernel K1 (see equation (10)). By equation (16), 0 the number of mistakes will be upper bounded by 10p + 5 because there is a cut of size 1 and the\n\n\f\ndiameter is 2p. Thus, for balanced octopi Om,p with p  3, as m grows the number of mistakes of the kernel perceptron will be bounded by a fixed constant. Whereas distance k -nearest neighbors will incur mistakes linearly in m. 5.2 Halving algorithm\n\nWe now compare the performance of our algorithm to the classical halving algorithm [11]. The halving algorithm operates by predicting on each trial as the majority of the classifiers in the concept class which have been consistent over the trial sequence. Hence, the number of mistakes of the halving algorithm is upper bounded by the logarithm of the cardinality of the concept class. Let k KG = {u  {-1, 1}n : G (u) = k } be the set all of all classifiers with a cut size equal to k on a n s k k fixed graph G . The cardinality of KG is upper bounded by (n-1) ince any classifier (cut) in KG k can be uniquely identified by a choice of k edges and 1 bit which determines the sign of the vertices in the same of partition (however we overcount as not every set of edges determines a classifier). The number of mistakes of the halving algorithm is upper bounded by O(k log n ). For example, k on a line graph with a cut size of 1 the halving algorithm has an upper bound of log n while the upper bound for the number of mistakes of the perceptron as given in equation (16) is 5n + 5. Although the halving algorithm has a sharper bound on such large diameter graphs as the line graph, it unfortunately has a logarithmic dependence on n. This contrasts to the bound of the perceptron which is essentially independent of n. Thus, the bound for the halving algorithm is roughly sharper on graphs with a diameter  (log n ), while the perceptron bound is roughly sharper on graphs with k a diameter o(log n ). We emphasize that this analysis of upper bounds is quite rough and sharper k bounds for both algorithms could be obtained for example, by including a term representing the minimal possible cut, that is, the minimum number of edges necessary to disconnect a graph. For k the halving algorithm this would enable a better bound on the cardinality of KG (see [13]). While, for the perceptron the larger the connectivity of the graph, the weaker the diameter upper bound in Theorem 3.2 (see for example the discussion of \"thick barbells\" at the end of section 4). Acknowledgments We wish to thank the anonymous reviewers for their useful comments. This work was supported by EPSRC Grant GR/T18707/01 and by the IST Programme of the European Community, under the PASCAL Network of Excellence IST-2002-506778.\n\nReferences\n[1] C. Gentile. The robustness of the p-norm algorithms. Machine Learning, 53(3):265299, 2003. [2] A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In ICML 2002, pages 1926. Morgan Kaufmann, San Francisco, CA, 2002. [3] R. I. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete input spaces. In ICML 2002, pages 315322. Morgan Kaufmann, San Francisco, CA, 2002. [4] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In ICML 2003, pages 912919, 2003. [5] A. Smola and R.I. Kondor. Kernels and regularization on graphs. In COLT 2003, pages 144158, 2003. [6] M. Belkin, I. Matveeva, and P. Niyogi. Regularization and semi-supervised learning on large graphs. In COLT 2004, pages 624  638, Banff, Alberta, 2004. Springer. [7] T. Zhang and R. Ando. Analysis of spectral kernel design based semi-supervised learning. In Y. Weiss,  B. Scholkopf, and J. Platt, editors, NIPS 18, pages 16011608. MIT Press, Cambridge, MA, 2006. [8] M. Herbster, M. Pontil, and L. Wainer. Online learning over graphs. In ICML 2005, pages 305312, New York, NY, USA, 2005. ACM Press. [9] Y. Freund and R. E. Schapire. Large margin classification using the perceptron algorithm. Machine Learning, 37(3):277296, 1999.  [10] D. Klein and M. Randic. Resistance distance. Journal of Mathematical Chemistry, 12(1):8195, 1993. [11] J. M. Barzdin and R. V. Frievald. On the prediction of general recursive functions. Soviet Math. Doklady, 13:12241228, 1972. [12] N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. Soc., 68:337404, 1950. [13] D. Karger and C. Stein. A new approach to the minimum cut problem. JACM, 43(4):601640, 1996.\n\n\f\n", "award": [], "sourceid": 3100, "authors": [{"given_name": "Mark", "family_name": "Herbster", "institution": null}, {"given_name": "Massimiliano", "family_name": "Pontil", "institution": null}]}