{"title": "Fast Distributed Submodular Cover: Public-Private Data Summarization", "book": "Advances in Neural Information Processing Systems", "page_first": 3594, "page_last": 3602, "abstract": "In this paper, we introduce the public-private framework of data summarization motivated by privacy concerns in personalized recommender systems and online social services. Such systems have usually access to massive data generated by a large pool of users. A major fraction of the data is public and is visible to (and can be used for) all users. However, each user can also contribute some private data that should not be shared with other users to ensure her privacy. The goal is to provide a succinct summary of massive dataset, ideally as small as possible, from which customized summaries can be built for each user, i.e. it can contain elements from the public data (for diversity) and users' private data (for personalization). To formalize the above challenge, we assume that the scoring function according to which a user evaluates the utility of her summary satisfies submodularity, a widely used notion in data summarization applications. Thus, we model the data summarization targeted to each user as an instance of a submodular cover problem. However, when the data is massive it is infeasible to use the centralized greedy algorithm to find a customized summary even for a single user. Moreover, for a large pool of users, it is too time consuming to find such summaries separately. Instead, we develop a fast distributed algorithm for submodular cover, FASTCOVER, that provides a succinct summary in one shot and for all users. We show that the solution provided by FASTCOVER is competitive with that of the centralized algorithm with the number of rounds that is exponentially smaller than state of the art results. Moreover, we have implemented FASTCOVER with Spark to demonstrate its practical performance on a number of concrete applications, including personalized location recommendation, personalized movie recommendation, and dominating set on tens of millions of data points and varying number of users.", "full_text": "Fast Distributed Submodular Cover:\nPublic-Private Data Summarization\n\nBaharan Mirzasoleiman\n\nETH Zurich\n\nMorteza Zadimoghaddam\n\nGoogle Research\n\nAmin Karbasi\nYale University\n\nAbstract\n\nIn this paper, we introduce the public-private framework of data summarization\nmotivated by privacy concerns in personalized recommender systems and online\nsocial services. Such systems have usually access to massive data generated by a\nlarge pool of users. A major fraction of the data is public and is visible to (and\ncan be used for) all users. However, each user can also contribute some private\ndata that should not be shared with other users to ensure her privacy. The goal is to\nprovide a succinct summary of massive dataset, ideally as small as possible, from\nwhich customized summaries can be built for each user, i.e. it can contain elements\nfrom the public data (for diversity) and users\u2019 private data (for personalization).\nTo formalize the above challenge, we assume that the scoring function according\nto which a user evaluates the utility of her summary satis\ufb01es submodularity, a\nwidely used notion in data summarization applications. Thus, we model the data\nsummarization targeted to each user as an instance of a submodular cover problem.\nHowever, when the data is massive it is infeasible to use the centralized greedy\nalgorithm to \ufb01nd a customized summary even for a single user. Moreover, for a\nlarge pool of users, it is too time consuming to \ufb01nd such summaries separately. In-\nstead, we develop a fast distributed algorithm for submodular cover, FASTCOVER,\nthat provides a succinct summary in one shot and for all users. We show that\nthe solution provided by FASTCOVER is competitive with that of the centralized\nalgorithm with the number of rounds that is exponentially smaller than state of the\nart results. Moreover, we have implemented FASTCOVER with Spark to demon-\nstrate its practical performance on a number of concrete applications, including\npersonalized location recommendation, personalized movie recommendation, and\ndominating set on tens of millions of data points and varying number of users.\n\n1\n\nIntroduction\n\nData summarization, a central challenge in machine learning, is the task of \ufb01nding a representative\nsubset of manageable size out of a large dataset. It has found numerous applications, including image\nsummarization [1], recommender systems [2], scene summarization [3], clustering [4, 5], active set\nselection in non-parametric learning [6], and document and corpus summarization [7, 8], to name a\nfew. A general recipe to obtain a faithful summary is to de\ufb01ne a utility/scoring function that measures\ncoverage and diversity of the selected subset [1]. In many applications, the choice of utility functions\nused for summarization exhibit submodularity, a natural diminishing returns property. In words,\nsubmodularity implies that the added value of any given element from the dataset decreases as we\ninclude more data points to the summary. Thus, the data summarization problem can be naturally\nreduced to that of a submodular cover problem where the objective is to \ufb01nd the smallest subset\nwhose utility achieves a desired fraction of the utility provided by the entire dataset.\nIt is known that the classical greedy algorithm yields a logarithmic factor approximation to the\noptimum summary [9]. It starts with an empty set, and at each iteration adds an element with the\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fmaximum added value to the summary selected so far. It is also known that improving upon the\nlogarithmic approximation ratio is NP-hard [10]. Even though the greedy algorithm produces a\nnear-optimal solution, it is highly impractical for massive datasets, as sequentially selecting elements\non a single machine is heavily constrained in terms of speed and memory. Hence, in order to solve the\nsubmodular cover problem at scale, we need to make use of MapReduce-style parallel computation\nmodels [11, 12]. The greedy algorithm, due to its sequential nature, is poorly suited for parallelization.\nIn this paper, we propose a fast distributed algorithm, FASTCOVER, that enables us to solve the more\ngeneral problem of covering multiple submodular functions in one run of the algorithm. It relies\none three important ingredients: 1) a reduction from multiple submodular cover problems into a\nsingle instance of a submodular cover problem [13, 14], 2) randomized \ufb01ltration mechanism to select\nelements with high utility, and 3) a set of carefully chosen threshold functions used for the \ufb01lteration\nmechanism. FASTCOVER also provides a natural tarde-off between the number of MapReduce rounds\nand the size of the returned summary. It effectively lets us choose between compact summaries (i.e.,\nsmaller solution size) while running more MapReduce rounds or larger summaries while running\nfewer MapReduce rounds.\nThis setting is motivated by privacy concerns in many modern applications, including personalized\nrecommender systems, online social services, and the data collected by apps on mobile platforms\n[15, 16]. In such applications, users have some control over their own data and can mark some part\nof it private (in a slightly more general case, we can assume that users can make part of their data\nprivate to speci\ufb01c groups and public to others). As a result, the dataset consists of public data, shared\namong all users, and disjoint sets of private data accessible to the owners only.\nWe call this more general framework for data summarization, public-private data summarization,\nwhere the private data of one user should not be included in another user\u2019s summary (see also [15]).\nThis model naturally reduces to solving one instance of the submodular cover problem for each\nuser, as their view of the dataset and the speci\ufb01c utility function specifying users\u2019 preferences differ\nacross users. When the number of users is small, one can solve the public-private data summarization\nseparately for each user, using the greedy algorithm (for datasets of small size) or the recently\nproposed distributed algorithm DISCOVER [12] (for datasets of moderate size). However, when there\nare many users or the dataset is massive, none of the prior work truly scales.\nWe report performance of DISCOVER using Spark on concrete applications of the public-private data\nsummarization, including personalized movie recommendation on a dataset containing 2 million\nratings by more than 100K users for 1000 movies, personalized location recommendation based\non 20 users and their collected GPS locations, and \ufb01nding the dominating set on a social network\ncontaining more than 65 million nodes and 1.8 billion edges. For small to moderate sized datasets, we\ncompare our results with previous work, namely, classical greedy algorithm and DISCOVER [12]. For\ntruly large-scale experiments, where the data is big and/or there are many users involved (e.g., movie\nrecommendation), we cannot run DISCOVER as the number of MapReduce rounds in addition to their\ncommunication costs is prohibitive. In our experiments, we constantly observe that FASTCOVER\nprovides solutions of size similar to the greedy algorithm (and very often even smaller) with the\nnumber of rounds that are orders of magnitude smaller than DISCOVER. This makes FASTCOVER\nthe \ufb01rst distributed algorithm that solves the public-private data summarization fast and at scale.\n\n2 Problem Statement: Public-Private Data Summarization\n\nIn this section, we formally de\ufb01ne the public-private model of data summarization1. Here, we\nconsider a potentially large dataset (sometimes called universe of items) V of size n and a set of\nusers U. The dataset consists of public data VP and disjoint subsets of private data Vu for each user\nu \u2208 U. The public-private aspect of data summarization realizes in two dimensions. First, each\nuser u \u2208 U has her own utility function fu(S) according to which she scores the value of a subset\nS \u2286 V. Throughout this paper we assume that fu(\u00b7) is integer-valued2, non-negative, and monotone\n\n1All the results are applicable to submodular cover as a special case where there is only public data.\n2For the submodular cover problem it is a standard assumption that the function is integer-values for the\ntheoretical results to hold. In applications where this assumption is not satis\ufb01ed, either we can appropriately\ndiscretize and rescale the function, or instead of achieving the desired utility Q, try to reach (1 \u2212 \u03b4)Q, for some\n0 < \u03b4 < 1. In the latter case, we can simply replace Q with Q/\u03b4 in the theorems to get the correct bounds.\n\n2\n\n\fsubmodular. More formally, submodularity means that\n\nfu(A \u222a {e}) \u2212 fu(A) \u2265 fu(B \u222a {e}) \u2212 fu(B) \u2200A \u2286 B \u2282 V and \u2200e \u2208 V \\ B.\n\n= fu(A\u222a{e})\u2212 fu(A) \u2265 0.\nMonotonicity implies that for any A \u2286 V and e \u2208 V we have \u2206fu (e|A)\n.\nThe term \u2206fu (e|A) is called the marginal gain (or added value) of e to the set A. Whenever it is\nclear from the context we drop fu from \u2206fu (e|A). Without loss of generality, we normalize all\nusers\u2019 functions so that they achieve the same maximum value, i.e., fu(V) = fv(V) for all u, v \u2208 U.\nSecond, and in contrast to public data that is shared among all users, the private data of a user cannot\nbe shared with others. Thus, a user u \u2208 U can only evaluate the public and her own private part of a\nsummary S, i.e., S \u2229 (VP \u222a Vu). In other words, if the summary S contains private data of a user\nv (cid:54)= u, the user u cannot have access or evaluate v\u2019s private part of S, i.e., S \u2229 Vv. In public-private\ndata summarization, we would like to \ufb01nd the smallest subset S \u2286 V such that all users reach a\ndesired utility Q \u2264 fu(V) = fu(VP \u222a Vu) simultaneously, i.e.,\n\nOPT = arg min\n\nS\u2286V\n\n|S|, such that fu(S \u2229 (VP \u222a Vu)) \u2265 Q \u2200u \u2208 U.\n\n(1)\n\nA naive way to solve the above problem is to \ufb01nd a separate summary for each user and then return\nthe union of all summaries as S. A more clever way is to realize that problem (1) is in fact equivalent\nto the following problem [13, 14]\n\n(cid:88)\n\nu\u2208U\n\nOPT = arg min\n\nS\u2286V\n\n|S|, such that f (S)\n\n.\n=\n\nmin{fu(S \u2229 (VP \u222a Vu)), Q} \u2265 Q \u00d7 |U|.\n\n(2)\n\nNote that the surrogate function f (\u00b7) is also monotone submodular as a thresholded submodular\nfunction remains submodular. Thus, \ufb01nding a set S that provides each user with utility Q is equivalent\nof \ufb01nding a set S with f (S) \u2265 L\n= Q \u00d7 |U|. This reduction lets us focus on developing a fast\n.\ndistributed solution for solving a single submodular cover problem. Our method FASTCOVER is\nexplained in detail in Section 4.\n\nRelated Work: When the data is small, we can use the centralized greedy algorithm to solve\nproblem (2) (and equivalently problem (1)). The greedy algorithm sequentially picks elements and\nreturns a solution of size (1 + ln M )OPT \u2248 ln(L)|OPT| where M = maxe\u2208V f (e). As elaborated\nearlier, when the data is large, one cannot run this greedy algorithm as it requires centralized access to\nthe full dataset. This is why scalable solutions for the submodular cover problem have recently gained\na lot of interest. In particular, for the set cover problem (a special case of submodular cover problem)\nthere have been ef\ufb01cient MapReduce-based implementations proposed in the literature [17, 18, 19].\nThere have also been recent studies on the streaming set cover problem [20]. Perhaps the closest work\nto our efforts is [12] where the authors proposed a distributed algorithm for the submodular cover\nproblem called DISCOVER. Their method relies on the reduction of the submodular cover problem to\nmultiple instances of the distributed constrained submodular maximization problem [6, 21]. For any\n\n\ufb01xed 0 < \u03b1 \u2264 1, DISCOVER returns a solution of size (cid:100)2\u03b1k+72 log(L)|OPT|(cid:112)min(m, \u03b1|OPT|))(cid:101)\nin (cid:100)log(\u03b1|OPT|) + 36(cid:112)min(m, \u03b1|OPT|) log(L)/\u03b1 + 1(cid:101) rounds, where m denotes the number\n(cid:112)min(m, \u03b1|OPT|) is far from desirable. Note that as we increase the number of machines, the\n\nof machines. Even though DISCOVER scales better than the greedy algorithm, the solution it\nreturns is usually much larger. Moreover, the dependency of the number of MapReduce rounds on\n\nnumber of rounds may increase (rather than decreasing). Instead, in this paper we propose a fast\ndistributed algorithm, FASTCOVER, that truly scales to massive data and produces a solution that is\ncompetitive with that of the greedy algorithm. More speci\ufb01cally, for any \u0001 > 0, FASTCOVER returns a\nsolution of size at most (cid:100)ln(L)|OPT|/(1\u2212\u0001)(cid:101) with at most (cid:100)log3/2(n/m|OPT|) log(M )/\u0001+log(L)(cid:101)\nrounds, where M = maxe\u2208V f (e). Thus, in terms of speed, FASTCOVER improves exponentially\nupon DISCOVER while providing a smaller solution. Moreover, in our work, the number of rounds\ndecreases as the number of machines increases, in sharp contrast to [12].\n\n3 Applications of Pubic-Private Data Data Summarization\n\nIn this section, we discuss 3 concrete applications where parts of data are private and the remaining\nparts are public. All objective functions are non-negative, monotone, and submodular.\n\n3\n\n\f3.1 Personalized Movie Recommendation\n\n(cid:88)\n\nfu(S) = \u03b1u\n\nsi,j + (1 \u2212 \u03b1u)\n\n(cid:88)\n\nConsider a movie recommender system that allows users to anonymously and privately rate movies.\nThe system can use this information to recognize users\u2019 preferences using existing matrix completion\ntechniques [22]. A good set of recommended movies should meet two criteria: 1) be correlated with\nuser\u2019s preferences, and 2) be diverse and contains globally popular movies. To this end, we de\ufb01ne the\nfollowing sum-coverage function to score the quality of the selected movies S for a user u:\n\ni\u2208S,j\u2208Vu\n\ni\u2208S,j\u2208VP \\S\n\nsi,j,\n\n(3)\n\ni\u2208S,j\u2208Vu\n\ntwo movies i and j. The term(cid:80)\nset S and the user\u2019s preferences. The second term(cid:80)\n\nwhere Vu is the list of highly ranked movies by user u (i.e., private information), VP is the set of\nall movies in the database3, and si,j measures the similarity between movie i and j. The similarity\ncan be easily calculated using the inner product between the corresponding feature vectors of any\nsi,j measures the similarity between the recommended\ni\u2208S,j\u2208VP \\S si,j encourages diversity. Finally,\nthe parameter 0 \u2264 \u03b1u \u2264 1 provides the user the freedom to specify how much she cares about\npersonalization versus diversity, i.e., \u03b1u = 1 indicates that all the recommended movies should be\nvery similar to the movies she highly ranked and \u03b1u = 0 means that she prefers to receive a set of\nglobally popular movies among all users, irrespective of her own private ratings. Note that in this\napplication, the universe of items (i.e., movies) is public. What is private is the users\u2019 ratings through\nwhich we identify the set of highly ranked movies by each user Vu. The effect of private data is\nexpressed in users\u2019 utility functions. The objective is to \ufb01nd the smallest set S of movies V, from\nwhich we can build recommendations for all users in a way that all reach a certain utility.\n\n3.2 Personalized Location Recommendation\n\nNowadays, many mobile apps collect geolocation data of their users. To comply with privacy concerns,\nsome let their customers have control over their data, i.e., users can mark some part of their data\nprivate and disallow the app to share it with other users. In the personalized location recommendation,\na user is interested in identifying a set of locations that are correlated with the places she visited and\npopular places everyone else visited. Note that as close by locations are likely to be similar it is very\ntypical to de\ufb01ne a kernel matrix K capturing the similarity between data points. A commonly used\nkernel in practice is the squared exponential kernel K(ei, ej) = exp(\u2212||ei \u2212 ej||2\n2/h2). To de\ufb01ne the\ninformation gain of a set of locations indexed by S, it is natural to use f (S) = log det(I + \u03c3KS,S).\nThe information gain objective captures the diversity and is used in many ML applications, e.g., active\nset selection for nonparametric learning [6], sensor placement [13], determinantal point processes,\namong many others. Then, the personalized location recommendation can be modeled by\n\nfu(S) = \u03b1uf (S \u2229 Vu) + (1 \u2212 \u03b1u)f (S \u2229 VP ),\n\n(4)\nwhere Vu is the set of locations that user u does not want to share with others and VP is the collection\nof all publicly disclosed locations. Again, the parameter \u03b1u lets the user indicate to what extent she\nis willing to receive recommendations based on her private information. The objective is to \ufb01nd\nthe smallest set of locations to recommend to all users such that each reaches a desired threshold.\nNote that private data is usually small and private functions are fast to compute. Thus, the function\nevaluation is mainly affected by the amount of public data. Moreover, for many objectives, e.g.,\ninformation gain, each machine can evaluate fu(S) by using its own portion of the private data.\n\n3.3 Dominating Set in Social Networks\n\nProbably the easiest way to de\ufb01ne the in\ufb02uence of a subset of users on other members of a social\nnetwork is by the dominating set problem. Here, we assume that there is a graph G = (V, E) where\nV and E indicate the set of nodes and edges, respectively. Let N (S) denote the neighbors of S. Then,\nwe de\ufb01ne the coverage size of S by f (S) = |N (S)\u222aS|. The goal is to \ufb01nd the smallest subset S such\nthat the coverage size is at least some fraction of |V|.This is a trivial instance of public-private data\nsummarization as all the data is public and there is a single utility function. We use the dominating\nset problem to run a large-scale application for which DISCOVER terminates in a reasonable amount\nof time and its performance can be compared to our algorithm FASTCOVER.\n\n3Two private lists may point to similar movies, but for now we treat the items on each list as unique entities.\n\n4\n\n\f4 FASTCOVER for Fast Distributed Submodular Cover\n\nIn this section, we explain in detail our fast distributed Algorithm FASTCOVER shown in Alg. 1. It\nreceives a universe of items V and an integer-valued, non-negative, monotone submodular function\nV \u2192 R+. The objective is to \ufb01nd the smallest set S that achieves a value L \u2264 f (V).\nf : 2\nFASTCOVER starts with S = \u2205, and keeps adding those items x \u2208 V to S whose marginal values\n\u2206(e|S) are at least some threshold \u03c4. In the beginning, \u03c4 is set to a conservative initial value\n.\n= maxx\u2208V f (x). When there are no more items with a marginal value \u03c4, FASTCOVER lowers \u03c4\nM\nby a factor of (1 \u2212 \u0001), and iterates anew through the elements. Thus, \u03c4 ranges over \u03c40 = M, \u03c41 =\n(1 \u2212 \u0001)M,\u00b7\u00b7\u00b7 , \u03c4(cid:96) = (1 \u2212 \u0001)(cid:96)M,\u00b7\u00b7\u00b7 . FASTCOVER terminates when f (S) \u2265 L. The parameter \u0001\ndetermines the size of the \ufb01nal solution. When \u0001 is small, we expect to \ufb01nd better solutions (i.e.,\nsmaller in size) while having to spend more number of rounds.\nOne of the key ideas behind FASTCOVER is that \ufb01nding elements with marginal values \u03c4 = \u03c4(cid:96) can\nbe done in a distributed manner. Effectively, FASTCOVER partitions V into m sets T1, . . . , Tm, one\nfor each cluster node/machine. A naive distributed implementation is the following. For a given set\nS (whose elements are communicated to all machines) each machine i \ufb01nds all of its items x \u2208 Ti\nwhose marginal values \u2206(x|S) are larger than \u03c4 and send them all to a central machine (note that\nS is \ufb01xed on each machine). Then, this central machine sequentially augments S with elements\nwhose marginal values are more than \u03c4 (here S changes by each insertion). The new elements of S\nare communicated back to all machines and they run the same procedure, this time with a smaller\nthreshold \u03c4 (1 \u2212 \u0001). The main problem with this approach is that there might be many items on each\nmachine that satisfy the chosen threshold \u03c4 at each round (i.e., many more than |OPT|). A \ufb02ood of\nsuch items from m machines overwhelms the central machine. Instead, what FASTCOVER does is to\nenforce each machine to randomly pick only k items from their potentially big set of candidates (i.e.,\nTHRESHOLDSAMPLE algorithm shown in Alg. 2). The value k is carefully chosen (line 7). This way\nthe number of items the central machine processes is never more than O(m|OPT|).\n\ni=1 of V;\n\nk \u2190 (cid:100)(L \u2212 f (S))/\u03c4(cid:101);\nforall the 1 \u2264 i \u2264 m do\nforall the x \u2208 \u222am\n\n1 Input: V, \u0001, L, and m\n2 Output: S \u2286 V where f (S) \u2265 L\n3 Find a balanced partition {Ti}m\n4 S \u2190 \u2205;\n5 \u03c4 \u2190 maxx\u2208V f (x);\n6 while \u03c4 \u2265 1 do\n7\n8\n9\n10\n11\n12\n13\n14\n15\n16\n17 Return S;\n\nif \u2200i : F ulli = F alse then\n\nS \u2190 S \u222a {x};\nif f (S) \u2265 L then Break;\n\n<Si, F ulli>\u2190 T hresholdSample(i,\u03c4,k,S);\nif f ({x} \u222a S) \u2212 f (S) \u2265 \u03c4 then\n\ni=1Si do\n\nif \u03c4 > 1 then \u03c4 \u2190 max{1, (1 \u2212 \u0001)\u03c4};\nelse Break;\n\nAlgorithm 1: FASTCOVER\n\n1 Input: Index i, \u03c4, k, and S\n2 Output: Si \u2282 Ti with |Si| \u2264 k\n3 Si \u2190 \u2205;\n4 forall the x \u2208 Si do\n\nif f (S \u222a {x}) \u2212 f (S) \u2265 \u03c4 then\n\n5\n6\n\nSi \u2190 Si \u222a {x};\n\n7 if |Si| \u2264 k then\n8\n9 else\n10\n11\n\nReturn < Si, F alse >;\nSi \u2190 k random items of Si;\nReturn < Si, T rue >;\n\nAlgorithm 2: THRESHOLDSAMPLE\n\nTheorem 4.1. FASTCOVER terminates with at most log3/2(n/(|OPT|m))(1 + log(M )/\u0001) + log2(L)\nrounds (with high probability) and a solution of size at most |OPT| ln(L)/(1 \u2212 \u0001).\nAlthough FASTCOVER is distributed and unlike centralized algorithms does not enjoy the bene\ufb01ts of\naccessing all items together, its solution size is truly competitive with the greedy algorithm and is\nonly away by a factor of 1/(1 \u2212 \u0001). Moreover, its number of rounds is logarithmic in n and L. This\nis in sharp contrast with the previously best known algorithm, DISCOVER [12], where the number of\n\nrounds scales with(cid:112)min(m,|OP T|)4. Thus, FASTCOVER not only improves exponentially over\n4Note that(cid:112)min(m,|OP T|) can be as large as n1/6 when |OP T| = n1/3 and the memory limit of each\n\nmachine is n2/3 which results in m \u2265 n1/3.\n\n5\n\n\fDISCOVER in terms of speed but also its number of rounds decreases as the number of available\nmachines m increases. Even though FASTCOVER is a simple distributed algorithm, its performance\nanalysis is technical and is deferred to the supplementary materials. Below, we provide the main\nideas behind the proof of Theorem 4.1.\nProof sketch. We say that an item has a high value if its marginal value to S is at least \u03c4. We de\ufb01ne\nan epoch to be the rounds during which \u03c4 does not change. In the last round of each epoch, all\nhigh value items are sent to the central machine (i.e., the set \u222am\ni=1Si) because F ulli is false for all\nmachines. We also add every high value item to S in lines 11 \u2212 12. So, at the end of each epoch,\nmarginal values of all items to S are less than \u03c4. Since we reduce \u03c4 by a factor of (1 \u2212 \u0001), we can\nalways say that \u03c4 \u2265 (1\u2212 \u0001) maxx\u2208V \u2206(x|S) which means we are only adding items that have almost\nthe highest marginal values. By the classic analysis of greedy algorithm for submodular maximization,\nwe can conclude that every item we add has an added value that is at least (1\u2212 \u0001)(L\u2212 f (S))/|OPT|.\nTherefore, after adding |OPT| ln(L)/(1 \u2212 \u0001) items, f (S) becomes at least L.\nTo upper bound rounds, we divide the rounds into two groups. In a good round, the algorithm adds\n2 items to S. The rest are bad rounds. In a good round, we add k/2 \u2265 (L \u2212 f (S))/(2\u03c4 )\nat least k\nitems, and each of them increases the value of S by \u03c4. Therefore in a good round, we see at least\n(L \u2212 f (S))/2 increase in value of S. In other words, the gap L \u2212 f (S) is reduced by a factor of at\nleast 2 in each good round. Since f only takes integer values, once L \u2212 f (S) becomes less than 1,\nwe know that f (S) \u2265 L. Therefore, there cannot be more than log2 L good rounds. Every time we\nupdate \u03c4 (start of an epoch), we decrease it by a factor of 1 \u2212 \u0001 (except maybe the last round for\nwhich \u03c4 = 1). Therefore, there are at most 1 + log 1\nepochs.\n1\u2212\u0001\nIn a bad round, a machine with more than k high value items, sends k of those to the central machine,\nand at most k/2 of them are selected. In other words, the addition of these items to S in this bad\nround caused more than half of high value items of each machine to become of low value (marginal\nvalues less than \u03c4). Since there are n/m items in each machine, and F ulli becomes False once there\nare at most k high value items in the machine, we conclude that in expectation there should not be\nmore than log2(n/km) bad rounds in each epoch. Summarizing the upper bounds yields the bound\non total number of rounds. Finer analysis leads to the high probability claim.\n\nlog(1/(1\u2212\u0001)) \u2264 1 + log(M )\n\n(M ) \u2264 1 + log(M )\n\n\u0001\n\n5 Experiments\nIn this section, we evaluate the performance of FASTCOVER on the three applications that we\ndescribed in Section 3: personalized movie recommendation, personalized location recommendation,\nand dominating set on social networks. To validate our theoretical results and demonstrate the\neffectiveness of FASTCOVER, we compare the performance of our algorithm against DISCOVER and\nthe centralized greedy algorithm (when possible).\nOur experimental infrastructure was a cluster of 16 quad-core machines with 20GB of memory\neach, running Spark. The cluster was con\ufb01gured with one master node responsible for resource\nmanagement, and the remaining 15 machines working as executors. We set the number of reducers\nto m = 60. To run FASTCOVER on Spark, we \ufb01rst distributed the data uniformly at random to\nthe machines, and performed a map/reduce task to \ufb01nd the highest marginal gain \u03c4 = M. Each\nmachine then carries out a set of map/reduce tasks in sequence, where each map/reduce stage\n\ufb01lters out elements with a speci\ufb01c threshold \u03c4 on the whole dataset. We then tune the parameter \u03c4,\ncommunicate back the results to the machines and perform another round of map/reduce calculation.\nWe continue performing map/reduce tasks until we get to the desired value L.\n5.1 Personalized Location Recommendation with Spark\nOur location recommendation experiment involves applying FASTCOVER to the information gain\nutility function, described in Eq. (4). Our dataset consists of 3,056 GPS measurements from 20 users\nin the form of (latitude, longitude, altitude) collected during bike tours around Zurich [23]. The size\nof each path is between 50 and 500 GPS coordinates. For each pairs of points i and j we used the\ncorresponding GPS coordinates to calculate their distance in meters d(i, j) and then formed a squared\nexponential kernel Ki,j = exp(\u2212d(i, j)2/h2) with h = 1500. For each user, we marked 20% of her\ndata private (data points are chosen consecutively) selected from each path taken by the biker. The\nparameter \u03b1u is set randomly for each user u.\nFigures 1a, 1b, 1c compare the performance of FASTCOVER to the benchmarks for building a\nrecommendation set that covers 60%, 80%, and 90% of the maximum utility of each user. We\n\n6\n\n\fconsidered running DISCOVER with different values of parameter \u03b1 that makes a trade off between\nthe size of the solution and number of rounds of the algorithm. It can be seen that by avoiding the\ndoubling steps of DISCOVER, our algorithm FASTCOVER is able to return a signi\ufb01cantly smaller\nsolution than that of DISCOVER in considerably less number of rounds. Interestingly, for small values\nof \u0001, FASTCOVER returns a solution that is even smaller than the centralized greedy algorithm.\n5.2 Personalized Movie Recommendation with Spark\nOur personalized public-private recommendation experiment involves FASTCOVER applied to a set\nof 1,313 movies, and 20,000,263 users\u2019 ratings from 138,493 users of the MovieLens database [24].\nAll selected users rated at least 20 movies. Each movie is associated with a 25 dimensional feature\nvector calculated from users\u2019 ratings. We use the inner product of the non-normalized feature vectors\nto compute the similarity si,j between movies i and j [25]. Our \ufb01nal objective function consists of\n138,493 coverage functions -one per user- and a global sum-coverage function de\ufb01ned on the whole\npool of movies (see Eq. (3)). Each function is normalized by its maximum value to make sure that all\nfunctions have the same scale.\nFig 1d, 1e, 1f show the ratio of the size of the solutions obtained by FASTCOVER to that of the greedy\nalgorithm. The \ufb01gures demonstrate the results for 10%, 20%, and 30% covers for all the 138,493\nusers\u2019 utility functions. The parameter \u03b1u is set to 0.7 for all users. We scaled down the number of\niterations by a factor of 0.01, so that the corresponding bars can be shown in the same \ufb01gures. Again,\nFASTCOVER was able to \ufb01nd a considerably smaller solution than the centralized greedy. Here, we\ncouldn\u2019t run DISCOVER because of its prohibitive running time on Spark.\nFig 1g shows the size of the solution set obtained by FASTCOVER for building recommendations\nfrom a set of 1000 movies for 1000 users vs. the size of the merged solutions found by \ufb01nding\nrecommendations separately for each user. It can be seen that FASTCOVER was able to \ufb01nd a much\nsmaller solution by covering all the functions at the same time.\n5.3 Large Scale Dominating Set with Spark\nIn order to be able to compare the performance of our algorithm with DISCOVER more precisely,\nwe applied FASTCOVER to the Friendster network consists of 65,608,366 nodes and 1,806,067,135\nedges [26]. This dataset was used in [12] to evaluate the performance of DISCOVER.\nFig. 1j, 1k, 1l show the performance of FASTCOVER for obtaining covers for 50%, 40%, 30%\nof the whole graph, compared to the centralized greedy solution. Again, the size of the solution\nobtained by FASTCOVER is smaller than the greedy algorithm for small values of \u0001. Note that\nrunning the centralized greedy is impractical if the dataset cannot \ufb01t into the memory of a single\nmachine. Fig. 1h compares the solution set size and the number of rounds for FASTCOVER and\nDISCOVER with different values of \u0001 and \u03b1. The points in the bottom left correspond to the solution\nobtained by FASTCOVER which con\ufb01rm its superior performance. We further measured the actual\nrunning time of both algorithms on a smaller instance of the same graph with 14,043,721 nodes. We\ntuned \u0001 and \u03b1 to get solutions of approximately equal size for both algorithms. Fig. 1i shows the\nspeedup of FASTCOVER over DISCOVER. It can be observed that by increasing the coverage value\nL, FASTCOVER shows an exponential speedup over DISCOVER.\n\n6 Conclusion\nIn this paper, we introduced the public-private model of data summarization motivated by privacy\nconcerns of recommender systems. We also developed a fast distributed algorithm, FASTCOVER,\nthat provides a succinct summary for all users without violating their privacy. We showed that\nFASTCOVER returns a solution that is competitive to that of the best centralized, polynomial-time\nalgorithm (i.e., greedy solution). We also showed that FASTCOVER runs exponentially faster than\nthe previously proposed distributed algorithms. The superior practical performance of FASTCOVER\nagainst all the benchmarks was demonstrated through a large set of experiments, including movie\nrecommendation, location recommendation and dominating set (all were implemented with Spark).\nOur theoretical results combined with the practical performance of FASTCOVER makes it the only\nexisting distributed algorithm for the submodular cover problem that truly scales to massive data.\n\nAcknowledgment: This research was supported by Google Faculty Research Award and DARPA\nYoung Faculty Award (D16AP00046).\n\n7\n\n\f(a) Location data (60%)\n\n(b) Location data (80%)\n\n(c) Location data (90%)\n\n(d) Movies (10%)\n\n(e) Movies (20%)\n\n(f) Movies (30%)\n\n(g) Movie (1K)\n\n(h) Friendster (50%)\n\n(i) Friendster (14M)\n\n(j) Friendster (30%)\n\n(k) Friendster (40%)\n\n(l) Friendster (50%)\n\nFigure 1: Performance of FASTCOVER vs. other baselines. a), b), c) solution set size vs. number of rounds for\npersonalized location recommendation on a set of 3,056 GPS measurements, for covering 60%, 80%, 90% of the\nmaximum utility of each user. d), e), f) same measures for personalized movie recommendation on a set of 1000\nmovies, 138,493 users and 20,000,263 ratings, for covering 10%, 20%, 30% of the maximum utility of each user.\ng) solution set size vs. coverage for simultaneously covering all users vs. covering users one by one and taking\nthe union. The recommendation is on a set of 1000 movies for 1000 users. h) solution set size vs. the number of\nrounds for FASTCOVER and DISCOVER for covering 50% of the Friendster network with 65,608,366 vertices. i)\nExponential speedup of FASTCOVER over DISCOVER on a subgraph of 14M nodes. j), k), l) solution set size vs.\nthe number of rounds for covering 30%, 40%, 50% of the Friendster network.\n\n8\n\nNumber of rounds10203040Solution set size380390400410420430440450FastCoverDisCoverGreedy,=1.00=0.60=0.9,=0.2,=0.10=0.40=0.3Number of rounds102030405060Solution set size125013001350140014501500FastCoverDisCoverGreedy,=1.0,=0.40=0.90=0.6,=0.2,=0.10=0.40=0.3Number of rounds1020304050Solution set size2100215022002250230023502400FastCoverDisCoverGreedy0=0.9,=1.0,=0.40=0.60=0.40=0.3,=0.1,=0.20=0.50=0.30=0.100.10.20.30.40.50.60.70.8Number of iterationsNormalized solution set size0=0.70=0.50=0.300.050.10.150.20.250.30.35Number of iterationsNormalized solution set size0=0.70=0.50=0.300.050.10.150.20.250.3Number of iterationsNormalized solution set sizeCoverage0.10.20.30.40.5Solution set size100200300400500600700800900Union of the summaries for each userSingle summary for all usersNumber of rounds050100150200Solution set size#1052.62.833.23.43.63.84DisCover ,=0.1DisCover ,=0.2DisCover ,=0.4DisCover ,=1.0FastCover 0=0.5FastCover 0=0.3FastCover 0=0.1Solution set size1M 2M 3M 4M 5M 6M 7M FastCover speedup012345678Number of rounds102030Solution set size#1044.44.54.64.74.84.955.1FastCoverGreedy0=0.30=0.50=0.1Number of rounds10203040Solution set size#1051.051.11.151.21.251.31.35FastCoverGreedy0=0.30=0.10=0.5Number of rounds1020304050Solution set size#1052.72.752.82.852.92.9533.053.1FastCoverGreedy0=0.10=0.30=0.5\fReferences\n[1] Sebastian Tschiatschek, Rishabh Iyer, Haochen Wei, and Jeff Bilmes. Learning Mixtures of Submodular\n\nFunctions for Image Collection Summarization. In NIPS, 2014.\n\n[2] Khalid El-Arini and Carlos Guestrin. Beyond keyword search: discovering relevant scienti\ufb01c literature. In\n\nKDD, 2011.\n\n[3] Ian Simon, Noah Snavely, and Steven M Seitz. Scene summarization for online image collections. In\n\nICCV, 2007.\n\n[4] Delbert Dueck and Brendan J Frey. Non-metric af\ufb01nity propagation for unsupervised image categorization.\n\nIn ICCV, 2007.\n\n[5] Ryan Gomes and Andreas Krause. Budgeted nonparametric learning from data streams. In ICML, 2010.\n\n[6] Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause. Distributed submodular maxi-\n\nmization: Identifying representative elements in massive data. In NIPS, 2013.\n\n[7] Hui Lin and Jeff Bilmes. A class of submodular functions for document summarization. In North American\n\nchapter of the Assoc. for Comp. Linguistics/Human Lang. Tech., 2011.\n\n[8] Ruben Sipos, Adith Swaminathan, Pannaga Shivaswamy, and Thorsten Joachims. Temporal corpus\n\nsummarization using submodular word coverage. In CIKM, 2012.\n\n[9] Laurence A. Wolsey. An analysis of the greedy algorithm for the submodular set covering problem.\n\nCombinatorica, 1982.\n\n[10] Uriel Feige. A threshold of ln n for approximating set cover. Journal of the ACM, 1998.\n\n[11] J. Dean and S. Ghemawat. Mapreduce: Simpli\ufb01ed data processing on large clusters. In OSDI, 2004.\n\n[12] Baharan Mirzasoleiman, Amin Karbasi, Ashwinkumar Badanidiyuru, and Andreas Krause. Distributed\n\nsubmodular cover: Succinctly summarizing massive data. In NIPS, 2015.\n\n[13] Andreas Krause, Brendan McMahan, Carlos Guestrin, and Anupam Gupta. Robust submodular observation\n\nselection. JMLR, 2008.\n\n[14] Rishabh K Iyer and Jeff A Bilmes. Submodular optimization with submodular cover and submodular\n\nknapsack constraints. In NIPS, 2013.\n\n[15] Flavio Chierichetti, Alessandro Epasto, Ravi Kumar, Silvio Lattanzi, and Vahab Mirrokni. Ef\ufb01cient\n\nalgorithms for public-private social networks. In KDD, 2015.\n\n[16] Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, and Amin Karbasi. Fast constrained submodular\n\nmaximization: Personalized data summarization. In ICML, 2016.\n\n[17] Bonnie Berger, John Rompel, and Peter W Shor. Ef\ufb01cient nc algorithms for set cover with applications to\n\nlearning and geometry. Journal of Computer and System Sciences, 1994.\n\n[18] Guy E. Blelloch, Richard Peng, and Kanat Tangwongsan. Linear-work greedy parallel approximate set\n\ncover and variants. In SPAA, 2011.\n\n[19] Stergios Stergiou and Kostas Tsioutsiouliklis. Set cover at web scale. In SIGKDD, 2015.\n\n[20] Erik D Demaine, Piotr Indyk, Sepideh Mahabadi, and Ali Vakilian. On streaming and communication\n\ncomplexity of the set cover problem. In Distributed Computing. 2014.\n\n[21] Ravi Kumar, Benjamin Moseley, Sergei Vassilvitskii, and Andrea Vattani. Fast greedy algorithms in\n\nmapreduce and streaming. TOPC, 2015.\n\n[22] Emmanuel J Cand\u00e8s and Benjamin Recht. Exact matrix completion via convex optimization. Foundations\n\nof Computational mathematics, 2009.\n\n[23] https://refind.com/fphilipe/topics/open-data.\n\n[24] Grouplens. movielens 20m dataset. http://grouplens.org/datasets/movielens/20m/.\n\n[25] Erik M Lindgren, Shanshan Wu, and Alexandros G Dimakis. Sparse and greedy: Sparsifying submodular\n\nfacility location problems. NIPS, 2015.\n\n[26] Jaewon Yang and Jure Leskovec. De\ufb01ning and evaluating network communities based on ground-truth.\n\nKnowledge and Information Systems, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1792, "authors": [{"given_name": "Baharan", "family_name": "Mirzasoleiman", "institution": "ETH Zurich"}, {"given_name": "Morteza", "family_name": "Zadimoghaddam", "institution": "Google Research"}, {"given_name": "Amin", "family_name": "Karbasi", "institution": "Yale"}]}