{"title": "Distributed $k$-means and $k$-median Clustering on General Topologies", "book": "Advances in Neural Information Processing Systems", "page_first": 1995, "page_last": 2003, "abstract": "This paper provides new algorithms for distributed clustering for two popular center-based objectives, $k$-median and $k$-means. These algorithms have provable guarantees and improve communication complexity over existing approaches. Following a classic approach in clustering by \\cite{har2004coresets}, we reduce the problem of finding a clustering with low cost to the problem of finding a `coreset' of small size. We provide a distributed method for constructing a global coreset which improves over the previous methods by reducing the communication complexity, and which works over general communication topologies. We provide experimental evidence for this approach on both synthetic and real data sets.", "full_text": "Distributed k-Means and k-Median Clustering on\n\nGeneral Topologies\n\nMaria Florina Balcan, Steven Ehrlich, Yingyu Liang\n\nSchool of Computer Science\n\nGeorgia Institute of Technology\n\n{ninamf,sehrlich}@cc.gatech.edu,yliang39@gatech.edu\n\nAtlanta, GA 30332\n\nAbstract\n\nThis paper provides new algorithms for distributed clustering for two popular\ncenter-based objectives, k-median and k-means. These algorithms have provable\nguarantees and improve communication complexity over existing approaches.\nFollowing a classic approach in clustering by [13], we reduce the problem of\n\ufb01nding a clustering with low cost to the problem of \ufb01nding a coreset of small\nsize. We provide a distributed method for constructing a global coreset which\nimproves over the previous methods by reducing the communication complexity,\nand which works over general communication topologies. Experimental results\non large scale data sets show that this approach outperforms other coreset-based\ndistributed clustering algorithms.\n\nIntroduction\n\n1\nMost classic clustering algorithms are designed for the centralized setting, but in recent years data\nhas become distributed over different locations, such as distributed databases [21, 5], images and\nvideos over networks [20], surveillance [11] and sensor networks [4, 12]. In many of these appli-\ncations the data is inherently distributed because, as in sensor networks, it is collected at different\nsites. As a consequence it has become crucial to develop clustering algorithms which are effective\nin the distributed setting.\nSeveral algorithms for distributed clustering have been proposed and empirically tested. Some of\nthese algorithms [10, 22, 7] are direct adaptations of centralized algorithms which rely on statistics\nthat are easy to compute in a distributed manner. Other algorithms [14, 17] generate summaries of\nlocal data and transmit them to a central coordinator which then performs the clustering algorithm.\nNo theoretical guarantees are provided for the clustering quality in these algorithms, and they do\nnot try to minimize the communication cost. Additionally, most of these algorithms assume that\nthe distributed nodes can communicate with all other sites or that there is a central coordinator that\ncommunicates with all other sites.\nIn this paper, we study the problem of distributed clustering where the data is distributed across\nnodes whose communication is restricted to the edges of an arbitrary graph. We provide algorithms\nwith small communication cost and provable guarantees on the clustering quality. Our technique\nfor reducing communication in general graphs is based on the construction of a small set of points\nwhich act as a proxy for the entire data set.\nAn \u0001-coreset is a weighted set of points whose cost on any set of centers is approximately the cost\nof the original data on those same centers up to accuracy \u0001. Thus an approximate solution for the\ncoreset is also an approximate solution for the original data. Coresets have previously been studied\nin the centralized setting ([13, 8]) but have also recently been used for distributed clustering as in\n[23] and as implied by [9]. In this work, we propose a distributed algorithm for k-means and k-\n\n1\n\n\f(a) Zhang et al.[23]\n\n(b) Our Construction\n\n(a) Each node computes a coreset on the weighted pointset for its own data and its\nFigure 1:\nsubtrees\u2019 coresets. (b) Local constant approximation solutions are computed, and the costs of these\nsolutions are used to coordinate the construction of a local portion on each node.\n\nmedian, by which each node constructs a local portion of a global coreset. Communicating the\napproximate cost of a global solution to each node is enough for the local construction, leading to\nlow communication cost overall. The nodes then share the local portions of the coreset, which can\nbe done ef\ufb01ciently in general graphs using a message passing approach.\nMore precisely, in Section 3, we propose a distributed coreset construction algorithm based on local\napproximate solutions. Each node computes an approximate solution for its local data, and then\nconstructs the local portion of a coreset using only its local data and the total cost of each node\u2019s ap-\nproximation. For \u0001 constant, this builds a coreset of size \u02dcO(kd+nk) for k-median and k-means when\nthe data lies in d dimensions and is distributed over n sites. If there is a central coordinator among\nthe n sites, then clustering can be performed on the coordinator by collecting the local portions of\nthe coreset with a communication cost equal to the coreset size \u02dcO(kd + nk). For distributed clus-\ntering over general connected topologies, we propose an algorithm based on the distributed coreset\nconstruction and a message-passing approach, whose communication cost improves over previous\ncoreset-based algorithms. We provide a detailed comparison below.\nExperimental results on large scale data sets show that our algorithm performs well in practice. For\na \ufb01xed amount of communication, our algorithm outperforms other coreset construction algorithms.\nComparison to Other Coreset Algorithms: Since coresets summarize local information they are\na natural tool to use when trying to reduce communication complexity. If each node constructs an \u0001-\ncoreset on its local data, then the union of these coresets is clearly an \u0001-coreset for the entire data set.\nUnfortunately the size of the coreset in this approach increases greatly with the number of nodes.\nAnother approach is the one presented in [23]. Its main idea is to approximate the union of local\ncoresets with another coreset. They assume nodes communicate over a rooted tree, with each node\npassing its coreset to its parent. Because the approximation factor of the constructed coreset depends\non the quality of its component coresets, the accuracy a coreset needs (and thus the overall commu-\nnication complexity) scales with the height of this tree. Although it is possible to \ufb01nd a spanning\ntree in any communication network, when the graph has large diameter every tree has large height.\nIn particular many natural networks such as grid networks have a large diameter (\u2126(\u221an) for grids)\nwhich greatly increases the size of the local coresets. We show that it is possible to construct a global\ncoreset with low communication overhead. This is done by distributing the coreset construction pro-\ncedure rather than combining local coresets. The communication needed to construct this coreset is\nnegligible \u2013 just a single value from each data set representing the approximate cost of their local\noptimal clustering. Since the sampled global \u0001-coreset is the same size as any local \u0001-coreset, this\nleads to an improvement of the communication cost over the other approaches. See Figure 1 for an\nillustration. The constructed coreset is smaller by a factor of n in general graphs, and is independent\nof the communication topology. This method excels in sparse networks with large diameters, where\nthe previous approach in [23] requires coresets that are quadratic in the size of the diameter for\nk-median and quartic for k-means; see Section 4 for details. [9] also merge coresets using coreset\nconstruction, but they do so in a model of parallel computation and ignore communication costs.\nBalcan et al. [3] and Daume et al. [6] consider communication complexity questions arising when\ndoing classi\ufb01cation in distributed settings. In concurrent and independent work, Kannan and Vem-\n\n2\n\n563124C2C4C5C6C3565631245 2 3 4 6 1 \fpala [15] study several optimization problems in distributed settings, including k-means clustering\nunder an interesting separability assumption.\n\np\u2208P w(p)d(p, x)2. Similarly, the k-median cost is de\ufb01ned as(cid:80)\n\nset P \u2286 Rd. Here the k-means cost is de\ufb01ned as cost(P, x) = (cid:80)\nis de\ufb01ned as(cid:80)\n\n2 Preliminaries\nLet d(p, q) denote the Euclidean distance between any two points p, q \u2208 Rd. The goal of k-means\nclustering is to \ufb01nd a set of k centers x = {x1, x2, . . . , xk} which minimize the k-means cost of data\np\u2208P d(p, x)2 where d(p, x) =\nminx\u2208x d(p, x). If P is a weighted data set with a weighting function w, then the k-means cost\np\u2208P d(p, x). Both\nk-means and k-median cost functions are known to be NP-hard to minimize (see for example [2]).\nFor both objectives, there exist several readily available polynomial-time algorithms that achieve\nconstant approximation solutions (see for example [16, 18]).\nIn distributed clustering, we consider a set of n nodes V = {vi, 1 \u2264 i \u2264 n} which communicate\non an undirected connected graph G = (V, E) with m = |E| edges. More precisely, an edge\n(vi, vj) \u2208 E indicates that vi and vj can communicate with each other. Here we measure the\ncommunication cost in number of points transmitted, and assume for simplicity that there is no\nlatency in the communication. On each node vi, there is a local data set Pi, and the global data set\ni=1 Pi. The goal is to \ufb01nd a set of k centers x which optimize cost(P, x) while keeping\nthe computation ef\ufb01cient and the communication cost as low as possible. Our focus is to reduce the\ncommunication cost while preserving theoretical guarantees for approximating clustering cost.\nCoresets: For the distributed clustering task, a natural approach to avoid broadcasting raw data is\nto generate a local summary of the relevant information. If each site computes a summary for their\nown data set and then communicates this to a central coordinator, a solution can be computed from\na much smaller amount of data, drastically reducing the communication.\nIn the centralized setting, the idea of summarization with respect to the clustering task is captured\nby the concept of coresets [13, 8]. A coreset is a set of weighted points whose cost approximates the\ncost of the original data for any set of k centers. The formal de\ufb01nition of coresets is:\nDe\ufb01nition 1 (coreset). An \u0001-coreset for a set of points P with respect to a center-based cost function\nis a set of points S and a set of weights w : S \u2192 R such that for any set of centers x, we have\n(1 \u2212 \u0001)cost(P, x) \u2264\nIn the centralized setting, many coreset construction algorithms have been proposed for k-median,\nk-means and some other cost functions. For example, for points in Rd, algorithms in [8] construct\ncoresets of size \u02dcO(kd/\u00014) for k-means and coresets of size \u02dcO(kd/\u00012) for k-median. In the dis-\ntributed setting, it is natural to ask whether there exists an algorithm that constructs a small coreset\nfor the entire point set but still has low communication cost. Note that the union of coresets for mul-\ntiple data sets is a coreset for the union of the data sets. The immediate construction of combining\nthe local coresets from each node would produce a global coreset whose size was larger by a factor\nof n, greatly increasing the communication complexity. We present a distributed algorithm which\nconstructs a global coreset the same size as the centralized construction and only needs a single\nvalue1 communicated to each node. This serves as the basis for our distributed clustering algorithm.\n\nis P =(cid:83)n\n\n(cid:80)\n\np\u2208S w(p)cost(p, x) \u2264 (1 + \u0001)cost(P, x).\n\n3 Distributed Coreset Construction\nHere we design a distributed coreset construction algorithm for k-means and k-median. The under-\nlying technique can be extended to other additive clustering objectives such as k-line median.\nTo gain some intuition on the distributed coreset construction algorithm, we brie\ufb02y review the con-\nstruction algorithm in [8] in the centralized setting. The coreset is constructed by computing a\nconstant approximation solution for the entire data set, and then sampling points proportional to\ntheir contributions to the cost of this solution. Intuitively, the points close to the nearest centers can\nbe approximately represented by the centers while points far away cannot be well represented. Thus,\npoints should be sampled with probability proportional to their contributions to the cost. Directly\nadapting the algorithm to the distributed setting would require computing a constant approximation\n\n1The value that is communicated is the sum of the costs of approximations to the local optimal clustering.\n\nThis is guaranteed to be no more than a constant factor times larger than the optimal cost.\n\n3\n\n\fAlgorithm 1 Communication aware distributed coreset construction\nInput: Local datasets {Pi, 1 \u2264 i \u2264 n}, parameter t (number of points to be sampled).\n\nRound 1: on each node vi \u2208 V\n\u2022 Compute a constant approximation Bi for Pi.\nCommunicate cost(Pi, Bi) to all other nodes.\n(cid:80)n\nRound 2: on each node vi \u2208 V\n\u2022 Set ti = t cost(Pi,Bi)\nj=1 cost(Pj ,Bj ) and mp = cost(p, Bi),\u2200p \u2208 Pi.\n\u2022 Pick a non-uniform random sample Si of ti points from Pi,\n(cid:80)\n(cid:80)\nLet wq =\n\u2022 For \u2200b \u2208 Bi, let Pb = {p \u2208 Pi : d(p, b) = d(p, Bi)}, wb = |Pb| \u2212\n\nwhere for every q \u2208 Si and p \u2208 Pi, we have q = p with probability mp/(cid:80)\n(cid:80)\n\nfor each q \u2208 Si.\n\nq\u2208Pb\u2229S wq.\nOutput: Distributed coreset: points Si \u222a Bi with weights {wq : q \u2208 Si \u222a Bi}, 1 \u2264 i \u2264 n.\n\nz\u2208Pi\ntmq\n\nz\u2208Pi\n\nmz\n\ni\n\nmz.\n\nsolution for the entire data set. We show that a global coreset can be constructed in a distributed\nfashion by estimating the weight of the entire data set with the sum of local approximations. With\nthis approach, it suf\ufb01ces for nodes to communicate the total costs of their local solutions.\nTheorem 1. For distributed k-means and k-median clustering on a graph, there exists an algorithm\nsuch that with probability at least 1 \u2212 \u03b4, the union of its output on all nodes is an \u0001-coreset for\n\u00012 (kd+\nlog 1\n\n\u03b4 )+nk log nk\n\u03b4 ) + nk) for k-median. The total communication cost is O(mn).\n\ni=1 Pi. The size of the coreset is O( 1\n\n\u03b4 ) for k-means, and O( 1\n\nP =(cid:83)n\n\n\u00014 (kd+log 1\n\n\u00012 (kd + log 1\n\n\u00014 (kd + log 1\n\n\u03b4 ) + nk log nk\n\n\u03b4 ) for k-means and O( 1\n\n. Set the weights of the points as wp =\n\nmp(cid:80)\n(cid:105)\n=(cid:80)\nq\u2208S E[wqf (q)] =(cid:80)\n\nAs described below, the distributed coreset construction can be achieved by using Algorithm 1 with\nappropriate t, namely O( 1\n\u03b4 )) for k-\nmedian. Due to space limitation, we describe a proof sketch highlighting the intuition and provide\nthe details in the supplementary material.\nProof Sketch of Theorem 1: The analysis relies on the de\ufb01nition of the pseudo-dimension of a\nfunction space and a sampling lemma.\nDe\ufb01nition 2 ([19, 8]). Let F be a \ufb01nite set of functions from a set P to R\u22650. For f \u2208 F , let\nB(f, r) = {p : f (p) \u2264 r}. The dimension of the function space dim(F, P ) is the smallest integer d\n\nsuch that for any G \u2286 P ,(cid:12)(cid:12){G \u2229 B(f, r) : f \u2208 F, r \u2265 0}\n(cid:12)(cid:12) \u2264 |G|d.\n(cid:104)(cid:80)\n(cid:80)\np\u2208P Pr[q = p]wpf (p) =(cid:80)\n\nSuppose we draw a sample S according to {mp : p \u2208 P}, namely for each q \u2208 S and p \u2208 P , q = p\nwith probability\nfor p \u2208 P . Then for\nany f \u2208 F , the expectation of the weighted cost of S equals the cost of the original data P , since\nE\nIf the sample size is large enough, then we also have concentration for any f \u2208 F . The lemma is\nimplicit in [8] and we include the proof in the supplementary material.\nmp(cid:80)\nLemma 1. Fix a set F of functions f : P \u2192 R\u22650. Let S be a sample drawn i.i.d. from P according\n(cid:0)dim(F, P ) + log 1\n(cid:1), then with probabil-\n(cid:80)\n. Let wp =\nto {mp \u2208 R\u22650 : p \u2208 P}: for each q \u2208 S and p \u2208 P , q = p with probability\n(cid:12)(cid:12)(cid:12) \u2264 \u0001\n(cid:16)(cid:80)\n(cid:17)(cid:16)\nz\u2208P mz\nmp|S|\np\u2208P f (p) and(cid:80)\nTo get a small bound on the difference between(cid:80)\nity at least 1 \u2212 \u03b4,\u2200f \u2208 F :\nmp such that(cid:80)\nmaxf\u2208F f (p), then the difference is bounded by \u0001(cid:80)\n\nq\u2208S wqf (q), we need to choose\nis bounded. More precisely, if we choose mp =\np\u2208P mp.\n\nfor p \u2208 P . For a suf\ufb01ciently large c, if |S| \u2265 c\n\u00012\nq\u2208S wqf (q)\n\np\u2208P mp is small and maxp\u2208P\n\np\u2208P f (p) \u2212\n\n(cid:12)(cid:12)(cid:12)(cid:80)\n\nq\u2208S wqf (q)\n\np\u2208P f (p).\n\np\u2208P mp\n\nmaxp\u2208P\n\nf (p)\nmp\n\n.\n\n(cid:80)\n\nz\u2208P mz\nmp|S|\n\n(cid:80)\n\nz\u2208P mz\n\n\u03b4\n\n(cid:17)\n\nz\u2208P mz\n\nq\u2208S\n\nf (p)\nmp\n\nWe \ufb01rst consider the centralized setting and review how [8] applied the lemma to construct a core-\nset for k-median as in De\ufb01nition 1. A natural approach is to apply this lemma directly to the\ncost fx(p) := cost(p, x). The problem is that a suitable upper bound mp is not available for\ncost(p, x). However, we can still apply the lemma to a different set of functions de\ufb01ned as fol-\nlows. Let bp denote the closest center to p in the approximation solution. Aiming to approximate\n\n4\n\n\f\u03b4\n\n\u00012\n\n(cid:80)\n\np\u2208P fx(p) \u2212\n\np\u2208P fx(p) \u2212\n\np\u2208P cost(p, x) \u2212\n\np[cost(p, x) \u2212 cost(bp, x)] rather than to approximate(cid:80)\n\nthe error(cid:80)\n(cid:80)\n(cid:80)\nq\u2208S wqfx(q)| by 2\u0001(cid:80)\nNote that(cid:80)\n(cid:80)\nq\u2208S wqfx(q) does not equal(cid:80)\nHowever, it equals the difference between(cid:80)\n\np cost(p, x) directly, we de\ufb01ne\nfx(p) := cost(p, x)\u2212 cost(bp, x) + cost(p, bp), where cost(p, bp) is added so that fx(p) \u2265 0. Since\n0 \u2264 fx(p) \u2264 2cost(p, bp), we can apply the lemma with mp = 2cost(p, bp). It bounds the differ-\nence |\np\u2208P cost(p, bp), so we have an O(\u0001)-approximation.\nq\u2208S wqcost(q, x).\np\u2208P cost(p, x) and a weighted cost of the sampled\npoints and the centers in the approximation solution. To get a coreset as in De\ufb01nition 1, we need to\nadd the centers of the approximation solution with speci\ufb01c weights to the coreset. Then when the\nsample is suf\ufb01ciently large, the union of the sampled points and the centers is an \u0001-coreset.\nOur key contribution in this paper is to show that in the distributed setting, it suf\ufb01ces to choose\nbp from the local approximation solution for the local dataset containing p, rather than from an\napproximation solution for the global dataset. Furthermore, the sampling and the weighting of the\ncoreset points can be done in a local manner. In the following, we provide a formal veri\ufb01cation\nof our discussion above. We have the following lemma for k-median with F = {fx : fx(p) =\nd(p, x) \u2212 d(bp, x) + d(p, bp), x \u2208 (Rd)k}.\nLemma 2. For k-median, the output of Algorithm 1 is an \u0001-coreset with probability at least 1 \u2212 \u03b4,\nif t \u2265 c\nProof Sketch of Lemma 2: We want to show that for any set of centers x the true cost for using\nthese centers is well approximated by the cost on the weighted coreset. Note that our coreset has two\ntypes of points: sampled points q \u2208 S = \u222an\nand local solution\ncenters b \u2208 B = \u222an\nwq. We use bp to represent the nearest\ncenter to p in the local approximation solution. We use Pb to represent the set of points which have\nb as their closest center in the local approximation solution.\nAs mentioned above, we construct fx(p) to be the difference between the cost of p and\nthe cost of bp so that Lemma 1 can be applied. Note that the centers are weighted such\np\u2208P d(bp, x) \u2212\nq\u2208S wqmq, we can show\n\n(cid:1) for a suf\ufb01ciently large constant c.\n\n(cid:0)dim(F, P ) + log 1\n\ni=1Si with weight wq :=\n\nb\u2208B |Pb|d(b, x) \u2212\n\nz\u2208P mz\nmq|S|\n\n(cid:80)\n\nq\u2208S\u2229Pb\n\nq\u2208S\u2229Pb\n\n(cid:80)\n\nb\u2208B\n\nIn [8] it\n\nq\u2208S wqfx(q)\n\n\u00012 (kd + log 1\n\nq\u2208S\u222aB wqd(q, x)\n\nq\u2208S\u222aB wqd(q, x)\n\np\u2208P fx(p) \u2212\n\np\u2208P d(p, x) \u2212\n\np\u2208P d(p, x) \u2212\n\np\u2208P d(p, x), as desired.\n\nwqd(b, x) = (cid:80)\n(cid:80)\n\ni=1Bi with weight wb := |Pb|\u2212\nthat (cid:80)\nb\u2208B wbd(b, x) = (cid:80)\n(cid:80)\n(cid:80)\n(cid:80)\nq\u2208S wqd(bq, x). Taken together with the fact that (cid:80)\np\u2208P mp = (cid:80)\n(cid:12)(cid:12)(cid:12)(cid:80)\n(cid:12)(cid:12)(cid:12) =\n(cid:12)(cid:12)(cid:12)(cid:80)\n(cid:80)\n(cid:12)(cid:12)(cid:12)(cid:80)\n(cid:12)(cid:12)(cid:12) \u2264 O(\u0001)(cid:80)\nO(cid:0) 1\n\n(cid:12)(cid:12)(cid:12). Note that 0 \u2264\n(cid:80)\n\u03b4 )(cid:1), the weighted cost of S \u222a B approximates the k-median cost of P for any set\n\nthat\nfx(p) \u2264 2d(p, bp) by triangle inequality, and S is suf\ufb01ciently large and chosen according to\nweights mp = d(p, bp), so the conditions of Lemma 1 are met. Thus we can conclude that\n\nis shown that dim(F, P ) = O(kd). Therefore, by Lemma 2, when |S| \u2265\nof centers, then (S \u222a B, w) becomes an \u0001-coreset for P . The total communication cost is bounded\nby O(mn), since even in the most general case that every node only knows its neighbors, we can\nbroadcast the local costs with O(mn) communication (see Algorithm 3).\nProof Sketch for k-means: Similar methods prove that for k-means when t = O( 1\n\u03b4 ) +\n\u03b4 )), the algorithm constructs an \u0001-coreset with probability at least 1\u2212 \u03b4. The key difference\nnk log nk\nis that triangle inequality does not apply directly to the k-means cost, and so the error |cost(p, x) \u2212\ncost(bp, x)| and thus fx(p) are not bounded. The main change to the analysis is that we divide the\npoints into two categories: good points whose costs approximately satisfy the triangle inequality\n(up to a factor of 1/\u0001) and bad points. The good points for a \ufb01xed set of centers x are de\ufb01ned as\nG(x) = {p \u2208 P : |cost(p, x) \u2212 cost(bp, x)| \u2264 \u2206p} where the upper bound is \u2206p = cost(p,bp)\n, and\nthe analysis follows as in Lemma 2. For bad points we can show that the difference in cost must still\nbe small, namely O(\u0001 min{cost(p, x), cost(bp, x)}).\nMore formally, let fx(p) = cost(p, x) \u2212 cost(bp, x) + \u2206p, and let gx(p) be fx(p) if p \u2208 G(x) and\n(cid:88)\n\nq\u2208S\u222aB wqcost(q, x) is decomposed into three terms:\n\n\u00014 (kd + log 1\n\n(cid:88)\n\n\u0001\n\n0 otherwise. Then(cid:80)\n(cid:88)\n(cid:124)\n\np\u2208P\n\np\u2208P cost(p, x) \u2212\ngx(p) \u2212\n\nwqgx(q)\n\n(cid:88)\n(cid:123)(cid:122)\n\nq\u2208S\n\n(A)\n\n(cid:80)\n(cid:125)\n\n+\n\n(cid:124)\n\np\u2208P\\G(x)\n\nq\u2208S\\G(x)\n\nfx(p)\n\n\u2212\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(C)\n\nwqfx(q)\n\n(cid:125)\n\n(cid:123)(cid:122)\n\n(B)\n\n5\n\n\fAlgorithm 2 Distributed clustering on a graph\nInput: {Pi, 1 \u2264 i \u2264 n}: local datasets; {Ni, 1 \u2264 i \u2264 n}: the neighbors of vi; A\u03b1: an \u03b1-\napproximation algorithm for weighted clustering instances.\n\nRound 1: on each node vi\n\nRound 2: on each node vi\n\n\u2022 Construct its local portion Di of an \u0001/2-coreset by Algorithm 1,\nusing Message-Passing for communicating the local costs.\n\n\u2022 Call Message-Passing(Di, Ni). Compute x = A\u03b1((cid:83)\n\nj Dj).\n\nOutput: x\n\nAlgorithm 3 Message-Passing(Ii, Ni)\n\nInput: Ii is the message, Ni are the neighbors.\n\n\u2022 Let Ri denote the information received. Initialize Ri = {Ii}, and send Ii to Ni.\n\u2022 While Ri (cid:54)= {Ij, 1 \u2264 j \u2264 n}:\n\nIf receive message Ij (cid:54)\u2208 Ri, then let Ri = Ri \u222a {Ij} and send Ij to Ni.\n\nLemma 1 bounds (A) by O(\u0001)cost(P, x), but we need an accuracy of \u00012 to compensate for the 1/\u0001\nfactor in the upper bound of fx(p). This leads to an O(1/\u00014) factor in the sample complexity.\nFor (B) and (C), |cost(p, x) \u2212 cost(bp, x)| > \u2206p since p (cid:54)\u2208 G(x). This can be used to show that\np and bp are close to each other and far away from x, and thus |cost(p, x) \u2212 cost(bp, x)| is O(\u0001)\nsmaller than cost(p, x) and cost(bp, x). This fact bounds ((B)) by O(\u0001)cost(P, x). It also bounds\n\u03b4 ).\nq\u2208Pb\u2229S wq \u2264 2|Pb| when t \u2265 O(nk log nk\n\nq\u2208Pb\u2229S wq] = |Pb|, and thus(cid:80)\n\n(C), noting that E[(cid:80)\n\nThe proof is completed by bounding the function space dimension by O(kd) as in [8].\n4 Effect of Network Topology on Communication Cost\nIf there is a central coordinator in the communication graph, then we can run distributed coreset con-\nstruction algorithm and send the local portions of the coreset to the coordinator, which can perform\nthe clustering task. The total communication cost is just the size of the coreset.\nIn this section, we consider distributed clustering over arbitrary connected topologies. We propose\nto use a message passing approach for collecting information for coreset construction and sharing\nthe local portions of the coreset. The details are presented in Algorithm 2 and 3. Since each piece\nof the coreset is shared at most twice across any particular edge in message passing, we have\nTheorem 2. Given an \u03b1-approximation algorithm for weighted k-means (k-median respectively)\nas a subroutine, there exists an algorithm that with probability at least 1 \u2212 \u03b4 outputs a (1 + \u0001)\u03b1-\napproximation solution for distributed k-means (k-median respectively). The communication cost is\n\u03b4 ) + nk)) for k-median.\nO(m( 1\nIn contrast, an approach where each node constructs an \u0001-coreset for k-means and sends it to the\nother nodes incurs communication cost of \u02dcO( mnkd\nOur algorithm can also be applied on a rooted tree: we can send the coreset portions to the root\nwhich then applies an approximation algorithm. Since each portion are transmitted at most h times,\nTheorem 3. Given an \u03b1-approximation algorithm for weighted k-means (k-median respectively)\nas a subroutine, there exists an algorithm that with probability at least 1 \u2212 \u03b4 outputs a (1 + \u0001)\u03b1-\napproximation solution for distributed k-means (k-median respectively) clustering on a rooted tree\nof height h. The total communication cost is O(h( 1\n\u03b4 )) for k-means, and\nO(h( 1\n\n). Our algorithm signi\ufb01cantly reduces this.\n\n\u03b4 )) for k-means, and O(m( 1\n\n\u03b4 ) + nk log nk\n\n\u03b4 ) + nk log nk\n\n\u00014 (kd + log 1\n\n\u00012 (kd + log 1\n\n\u00012 (kd + log 1\n\n\u03b4 ) + nk)) for k-median.\n\n\u00014 (kd + log 1\n\n\u00014\n\nOur approach improves the cost of \u02dcO( nh4kd\n) for k-median\nin [23] 2. The algorithm in [23] builds on each node a coreset for the union of coresets from its\n2 Their algorithm used coreset construction as a subroutine. The construction algorithm they used builds\n\u0001d log |P|). Throughout this paper, when we compare to [23] we assume they use the\n\ncoreset of size \u02dcO( nkh\ncoreset construction technique of [8] to reduce their coreset size and communication cost.\n\n) for k-means and the cost of \u02dcO( nh2kd\n\n\u00012\n\n\u00014\n\n6\n\n\fchildren, and thus needs O(\u0001/h) accuracy to prevent the accumulation of errors. Since the coreset\nconstruction subroutine has quadratic dependence on 1/\u0001 for k-median (quartic for k-means), the\nalgorithm then has quadratic dependence on h (quartic for k-means). Our algorithm does not build\ncoreset on top of coresets, resulting in a better dependence on the height of the tree h.\nIn a general graph, any rooted tree will have its height h at least as large as half the diameter. For\nsensors in a grid network, this implies h = \u2126(\u221an). In this case, our algorithm gains a signi\ufb01cant\nimprovement over existing algorithms.\n\n5 Experiments\n\nHere we evaluate the effectiveness of our algorithm and compare it to other distributed coreset algo-\nrithms. We present the k-means cost of the solution by our algorithm with varying communication\ncost, and compare to those of other algorithms when they use the same amount of communication.\nData sets: We present results on YearPredictionMSD (515345 points in R90, k = 50). Similar\nresults are observed on \ufb01ve other datasets, which are presented in the supplementary material.\nExperimental Methodology: We \ufb01rst generate a communication graph connecting local sites, and\nthen partition the data into local data sets. The algorithms were evaluated on Erd\u00a8os-Renyi random\ngraphs with p = 0.3, grid graphs, and graphs generated by the preferential attachment mecha-\nnism [1]. We used 100 sites for YearPredictionMSD.\nThe data is then distributed over the local sites. There are four partition methods: uniform,\nsimilarity-based, weighted, and degree-based. In all methods, each example is distributed to the\nlocal sites with probability proportional to the site\u2019s weight. In uniform partition, the sites have\nequal weights; in similarity-based partition, each site has an associated data point randomly selected\nfrom the global data and the weight is the similarity to the associated point; in weighted partition,\nthe weights are chosen from |N (0, 1)|; in degree-based, the weights are the sites\u2019 degrees.\nTo measure the quality of the coreset generated, we run Lloyd\u2019s algorithm on the coreset and the\nglobal data respectively to get two solutions, and compute the ratio between the costs of the two\nsolutions over the global data. The average ratio over 30 runs is then reported. We compare our\nalgorithm with COMBINE, the method of combining coresets from local data sets, and with the\nalgorithm of [23] (Zhang et al.). When running the algorithm of Zhang et al., we restrict the network\nto a spanning tree by picking a root uniformly at random and performing a breadth \ufb01rst search.\nResults: Figure 2 shows the results over different network topologies and partition methods. We\nobserve that the algorithms perform well with much smaller coreset sizes than predicted by the\ntheoretical bounds. For example, to get 1.1 cost ratio, the coreset size and thus the communication\nneeded is only 0.1% \u2212 1% of the theoretical bound.\nIn the uniform partition, our algorithm performs nearly the same as COMBINE. This is not surpris-\ning since our algorithm reduces to the COMBINE algorithm when each local site has the same cost\nand the two algorithms use the same amount of communication. In this case, since in our algorithm\nthe sizes of the local samples are proportional to the costs of the local solutions, it samples the same\nnumber of points from each local data set. This is equivalent to the COMBINE algorithm with the\nsame amount of communication. In the similarity-based partition, similar results are observed as it\nalso leads to balanced local costs. However, when the local sites have signi\ufb01cantly different costs (as\nin the weighted and degree-based partitions), our algorithm outperforms COMBINE. As observed\nin Figure 2, the costs of our solutions consistently improve over those of COMBINE by 2% \u2212 5%.\nOur algorithm then saves 10% \u2212 20% communication cost to achieve the same approximation ratio.\nFigure 3 shows the results over the spanning trees of the graphs. Our algorithm performs much better\nthan the algorithm of Zhang et al., achieving about 20% improvement in cost. This is due to the fact\nthat their algorithm needs larger coresets to prevent the accumulation of errors when constructing\ncoresets from component coresets, and thus needs higher communication cost to achieve the same\napproximation ratio.\n\nAcknowledgements This work was supported by ONR grant N00014-09-1-0751, AFOSR grant\nFA9550-09-1-0538, and by a Google Research Award. We thank Le Song for generously allowing\nus to use his computer cluster.\n\n7\n\n\f(a) random graph, uniform\n\n(b) random graph, similarity-based\n\n(c) random graph, weighted\n\n(d) grid graph, similarity-based\n\n(e) grid graph, weighted\n\n(f) preferential graph, degree-based\n\nFigure 2: k-means cost (normalized by baseline) v.s. communication cost over graphs. The titles\nindicate the network topology and partition method.\n\n(a) random graph, uniform\n\n(b) random graph, similarity-based\n\n(c) random graph, weighted\n\n(d) grid graph, similarity-based\n\n(e) grid graph, weighted\n\n(f) preferential graph, degree-based\n\nFigure 3: k-means cost (normalized by baseline) v.s. communication cost over the spanning trees of\nthe graphs. The titles indicate the network topology and partition method.\n\nReferences\n[1] R. Albert and A.-L. Barab\u00b4asi. Statistical mechanics of complex networks. Reviews of Modern\n\nPhysics, 2002.\n\n8\n\n  COMBINEOurAlgok-meanscostratio\u00d71071.61.822.21.051.11.15\u00d71071.71.81.922.12.22.31.041.061.081.11.121.141.161.181.2\u00d71071.61.71.81.922.12.21.041.061.081.11.121.141.161.181.2k-meanscostratiocommunicationcost\u00d710622.22.42.62.81.051.11.15communicationcost\u00d710622.22.42.62.81.051.11.15communicationcost\u00d71062.22.42.62.81.051.11.15  Zhangetal.OurAlgok-meanscostratio\u00d71071.61.822.211.11.21.31.41.5\u00d71071.71.81.922.12.22.311.051.11.151.21.251.31.351.4\u00d71071.61.71.81.922.12.211.051.11.151.21.251.31.351.4k-meanscostratiocommunicationcost\u00d710622.22.42.62.811.11.21.31.4communicationcost\u00d710622.22.42.62.811.11.21.31.4communicationcost\u00d71062.22.42.62.811.11.21.31.4\f[2] P. Awasthi and M. Balcan. Center based clustering: A foundational perspective. Survey Chap-\n\nter in Handbook of Cluster Analysis (Manuscript), 2013.\n\n[3] M.-F. Balcan, A. Blum, S. Fine, and Y. Mansour. Distributed learning, communication com-\n\nplexity and privacy. In Proceedings of the Conference on Learning Thoery, 2012.\n\n[4] J. Considine, F. Li, G. Kollios, and J. Byers. Approximate aggregation techniques for sensor\n\ndatabases. In Proceedings of the International Conference on Data Engineering, 2004.\n\n[5] J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. Furman, S. Ghemawat, A. Gubarev,\nC. Heiser, P. Hochschild, et al. Spanner: Googles globally-distributed database. In Proceedings\nof the USENIX Symposium on Operating Systems Design and Implementation, 2012.\n\n[6] H. Daum\u00b4e III, J. M. Phillips, A. Saha, and S. Venkatasubramanian. Ef\ufb01cient protocols for\ndistributed classi\ufb01cation and optimization. In Algorithmic Learning Theory, pages 154\u2013168.\nSpringer, 2012.\n\n[7] S. Dutta, C. Gianella, and H. Kargupta. K-means clustering over peer-to-peer networks. In Pro-\nceedings of the International Workshop on High Performance and Distributed Mining, 2005.\n[8] D. Feldman and M. Langberg. A uni\ufb01ed framework for approximating and clustering data. In\n\nProceedings of the Annual ACM Symposium on Theory of Computing, 2011.\n\n[9] D. Feldman, A. Sugaya, and D. Rus. An effective coreset compression algorithm for large scale\nsensor networks. In Proceedings of the International Conference on Information Processing\nin Sensor Networks, 2012.\n\n[10] G. Forman and B. Zhang. Distributed data clustering can be ef\ufb01cient and exact. ACM SIGKDD\n\nExplorations Newsletter, 2000.\n\n[11] S. Greenhill and S. Venkatesh. Distributed query processing for mobile surveillance. In Pro-\n\nceedings of the International Conference on Multimedia, 2007.\n\n[12] M. Greenwald and S. Khanna. Power-conserving computation of order-statistics over sensor\nnetworks. In Proceedings of the ACM SIGMOD-SIGACT-SIGART Symposium on Principles\nof Database Systems, 2004.\n\n[13] S. Har-Peled and S. Mazumdar. On coresets for k-means and k-median clustering. In Proceed-\n\nings of the Annual ACM Symposium on Theory of Computing, 2004.\n\n[14] E. Januzaj, H. Kriegel, and M. Pfei\ufb02e. Towards effective and ef\ufb01cient distributed clustering.\nIn Workshop on Clustering Large Data Sets in the IEEE International Conference on Data\nMining, 2003.\n\n[15] R. Kannan and S. Vempala. Nimble algorithms for cloud computing.\n\narXiv:1304.3162, 2013.\n\narXiv preprint\n\n[16] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. A\nlocal search approximation algorithm for k-means clustering. In Proceedings of the Annual\nSymposium on Computational Geometry, 2002.\n\n[17] H. Kargupta, W. Huang, K. Sivakumar, and E. Johnson. Distributed clustering using collective\n\nprincipal component analysis. Knowledge and Information Systems, 2001.\n\n[18] S. Li and O. Svensson. Approximating k-median via pseudo-approximation. In Proceedings\n\nof the Annual ACM Symposium on Theory of Computing, 2013.\n\n[19] Y. Li, P. M. Long, and A. Srinivasan. Improved bounds on the sample complexity of learning.\nIn Proceedings of the eleventh annual ACM-SIAM Symposium on Discrete Algorithms, 2000.\n[20] S. Mitra, M. Agrawal, A. Yadav, N. Carlsson, D. Eager, and A. Mahanti. Characterizing web-\n\nbased video sharing workloads. ACM Transactions on the Web, 2011.\n\n[21] C. Olston, J. Jiang, and J. Widom. Adaptive \ufb01lters for continuous queries over distributed data\nstreams. In Proceedings of the ACM SIGMOD International Conference on Management of\nData, 2003.\n\n[22] D. Tasoulis and M. Vrahatis. Unsupervised distributed clustering. In Proceedings of the Inter-\n\nnational Conference on Parallel and Distributed Computing and Networks, 2004.\n\n[23] Q. Zhang, J. Liu, and W. Wang. Approximate clustering on distributed data streams.\n\nProceedings of the IEEE International Conference on Data Engineering, 2008.\n\nIn\n\n9\n\n\f", "award": [], "sourceid": 1011, "authors": [{"given_name": "Maria-Florina", "family_name": "Balcan", "institution": "Georgia Tech"}, {"given_name": "Steven", "family_name": "Ehrlich", "institution": "Georgia Tech"}, {"given_name": "Yingyu", "family_name": "Liang", "institution": "Georgia Tech"}]}