{"title": "Adaptive Anonymity via $b$-Matching", "book": "Advances in Neural Information Processing Systems", "page_first": 3192, "page_last": 3200, "abstract": "The adaptive anonymity problem is formalized where each individual shares their data along with an integer value to indicate their personal level of desired privacy. This problem leads to a generalization of $k$-anonymity to the $b$-matching setting. Novel algorithms and theory are provided to implement this type of anonymity. The relaxation achieves better utility, admits theoretical privacy guarantees that are as strong, and, most importantly, accommodates a variable level of anonymity for each individual. Empirical results confirm improved utility on benchmark and social data-sets.", "full_text": "Adaptive Anonymity via b-Matching\n\nKrzysztof Choromanski\nColumbia University\nkmc2178@columbia.edu\n\nTony Jebara\n\nColumbia University\n\ntj2008@columbia.edu\n\nKui Tang\n\nColumbia University\n\nkt2384@columbia.edu\n\nAbstract\n\nThe adaptive anonymity problem is formalized where each individual shares their\ndata along with an integer value to indicate their personal level of desired privacy.\nThis problem leads to a generalization of k-anonymity to the b-matching setting.\nNovel algorithms and theory are provided to implement this type of anonymity.\nThe relaxation achieves better utility, admits theoretical privacy guarantees that\nare as strong, and, most importantly, accommodates a variable level of anonymity\nfor each individual. Empirical results con\ufb01rm improved utility on benchmark and\nsocial data-sets.\n\n1 Introduction\nIn many situations, individuals wish to share their personal data for machine learning applications\nand other exploration purposes. If the data contains sensitive information, it is necessary to protect\nit with privacy guarantees while maintaining some notion of data utility [18, 2, 24]. There are\nvarious de\ufb01nitions of privacy. These include k-anonymity [19], l-diversity [16], t-closeness [14]\nand differential1 privacy [3, 22]. All these privacy guarantees fundamentally treat each contributed\ndatum about an individual equally. However, the acceptable anonymity and comfort-level of each\nindividual in a population can vary widely. This article explores the adaptive anonymity setting and\nshows how to generalize the k-anonymity framework to handle it. Other related approaches have\nbeen previously explored [20, 21, 15, 5, 6, 23] yet herein we contribute novel ef\ufb01cient algorithms\nand formalize precise privacy guarantees. Note also that there are various de\ufb01nitions of utility. This\narticle focuses on the use of suppression since it is well-formalized. Therein, we hide certain values\nin the data-set by replacing them with a \u2217 symbol (fewer \u2217 symbols indicate higher utility). The\noverall goal is to maximize utility while preserving each individual\u2019s level of desired privacy.\nThis article is organized as follows. \u00a7 2 formalizes the adaptive anonymity problem and shows\nhow k-anonymity does not handle it. This leads to a relaxation of k-anonymity into symmetric\nand asymmetric bipartite regular compatibility graphs. \u00a7 3 provides algorithms for maximizing\nutility under these relaxed privacy criteria. \u00a7 4 provides theorems to ensure the privacy of these\nrelaxed criteria for uniform anonymity as well as for adaptive anonymity. \u00a7 5 shows experiments on\nbenchmark and social data-sets. Detailed proofs are provided in the Supplement.\n2 Adaptive anonymity and necessary relaxations to k-anonymity\nThe adaptive anonymity problem considers a data-set X \u2208 Zn\u00d7d consisting of n \u2208 N observations\n{x1, . . . , xn} each of which is a d-dimensional discrete vector, in other words, xi \u2208 Zd. Each user\ni contributes an observation vector xi which contains discrete attributes pertaining to that user2.\nFurthermore, each user i provides an adaptive anonymity parameter \u03b4i \u2208 N they desire to keep\nwhen the database is released. Given such a data-set and anonymity parameters, we wish to output\nan obfuscated data-set denoted by Y \u2208{ Z \u222a \u2217} n\u00d7d which consists of vectors {y1, . . . , yn} where\n1Differential privacy often requires specifying the data application (e.g. logistic regression) in advance [4].\n2For instance, a vector can contain a user\u2019s gender, race, height, weight, age, income bracket and so on.\n\n1\n\n\fyi(k) \u2208{ xi(k),\u2217}. The star symbol \u2217 indicates that the k\u2019th attribute has been masked in the i\u2019th\nuser-record. We say that vector xi is compatible with vector yj if xi(k) = yj(k) for all elements of\nyj(k) $= \u2217. The goal of this article is to create a Y which contains a minimal number of \u2217 symbols\nsuch that each entry yi of Y is compatible with at least \u03b4i entries of X and vice-versa.\nThe most pervasive method for anonymity in the released data is the k-anonymity method [19, 1].\nHowever, it is actually more constraining than the above desiderata. If all users have the same value\n\u03b4i = k, then k-anonymity suppresses data in the database such that, for each user\u2019s data vector in the\nreleased (or anonymized) database, there are at least k \u2212 1 identical copies in the released database.\nThe existence of copies is used by k-anonymity to justify some protection to attack.\nWe will show that the idea of k \u2212 1 copies can be understood as forming a compatibility graph be-\ntween the original database and the released database which is composed of several fully-connected\nk-cliques. However, rather than guaranteeing copies or cliques, the anonymity problem can be\nrelaxed into a k-regular compatibility to achieve nearly identical resilience to attack. More interest-\ningly, this relaxation will naturally allow users to select different \u03b4i anonymity values or degrees in\nthe compatibility graph and allow them to achieve their desired personal protection level.\nWhy can\u2019t k-anonymity handle heterogeneous anonymity levels \u03b4i? Consider the case where the\npopulation contains many liberal users with very low anonymity levels yet one single paranoid user\n(user i) wants to have a maximal anonymity with \u03b4i = n. In the k-anonymity framework, that user\nwill require n \u2212 1 identical copies of his data in the released database. Thus, a single paranoid user\nwill destroy all the information of the database which will merely contain completely redundant\nvectors. We will propose a b-matching relaxation to k-anonymity which prevents this degeneracy\nsince it does not merely handle compatibility queries by creating copies in the released data.\nWhile k-anonymity is not the only criterion for privacy, there are situations in which it is suf\ufb01cient\nas illustrated by the following scenario. First assume the data-set X is associated with a set of\nidentities (or usernames) and Y is associated with a set of keys. A key may be the user\u2019s password\nor some secret information (such as their DNA sequence). Represent the usernames and keys using\nintegers x1, . . . , xn and y1, . . . , yn, respectively. Username xi \u2208 Z is associated with entry xi and\nkey yj \u2208 Z is associated with entry yj. Furthermore, assume that these usernames and keys are\ndiverse, unique and independent of their corresponding attributes. These x and y values are known\nas the sensitive attributes and the entries of X and Y are the non-sensitive attributes [16]. We aim to\nrelease an obfuscated database Y and its keys with the possibility that an adversary may have access\nto all or a subset of X and the identities.\nThe goal is to ensure that the success of an attack (using a username-key pair) is low. In other\nwords, the attack succeeds with probability no larger than 1/\u03b4i for a user which speci\ufb01ed \u03b4i \u2208 N.\nThus, the attack we seek to protect against is the use of the data to match usernames to keys (rather\nthan attacks in which additional non-sensitive attributes about a user are discovered). In the uniform\n\u03b4i setting, k-anonymity guarantees that a single one-time attack using a single username-key pair\nsucceeds with probability at most 1/k. In the extreme case, it is easy to see that replacing all of Y\nwith \u2217 symbols will result in an attack success probability of 1/n if the adversary attempts a single\nrandom attack-pair (username and key). Meanwhile, releasing a database Y = X with keys could\nallow the adversary to succeed with an initial attack with probability 1.\nWe \ufb01rst assume that all degrees \u03b4i are constant and set to \u03b4 and discuss how the proposed b-matching\nprivacy output subtly differs from standard k-anonymity [19]. First, de\ufb01ne quasi-identi\ufb01ers as sets\nof attributes like gender and age that can be linked with external data to uniquely identify an individ-\nual in the population. The k-anonymity criterion says that a data-set such as Y is protected against\nlinking attacks that exploit quasi-identi\ufb01ers if every element is indistinguishable from at least k \u2212 1\nother elements with respect to every set of quasi-identi\ufb01er attributes. We will instead use a compati-\nbility graph G to more precisely characterize how elements are indistinguishable in the data-sets and\nwhich entries of Y are compatible with entries in the original data-set X. The graph places edges\nbetween entries of X which are compatible with entries of Y. Clearly, G is an undirected bipartite\ngraph containing two equal-sized partitions (or color-classes) of nodes A and B each of cardinality\nn where A = {a1, . . . , an} and B = {b1, . . . , bn}. Each element of A is associated with an entry of\nX and each element of B is associated with an entry of Y. An edge e = (i, j) \u2208 G that is adjacent\nto a node in A and a node in B indicates that the entries xi and yj are compatible. The absence of\nan edge means nothing: entries are either compatible or not compatible.\n\n2\n\n\fFor \u03b4i = \u03b4, b-matching produces \u03b4-regular bipartite graphs G while k-anonymity produces \u03b4-regular\nclique-bipartite graphs3 de\ufb01ned as follows.\nDe\ufb01nition 2.1 Let G(A, B) be a bipartite graph with color classes: A, B where A =\n{a1, ..., an}, B = {b1, ..., bn}. We call a k-regular bipartite graph G(A, B) a clique-bipartite\ngraph if it is a union of pairwise disjoint and nonadjacent complete k-regular bipartite graphs.\nthe family of \u03b4-regular bipartite graphs with n nodes. Similarly, denote by Gn,\u03b4\nDenote by Gn,\u03b4\nthe family of \u03b4-regular graphs clique-bipartite graphs. We will also denote by Gn,\u03b4\nthe family of\nsymmetric b-regular graphs using the following de\ufb01nition of symmetry.\nDe\ufb01nition 2.2 Let G(A, B) be a bipartite graph with color classes: A, B where A =\n{a1, ..., an}, B = {b1, ..., bn}. We say that G(A, B) is symmetric if the existence of an edge (ai, bj)\nin G(A, B) implies the existence of an edge (aj, bi), where 1 \u2264 i, j \u2264 n.\nFor values of n that are not trivially small,\nit is easy to see that the graph families satisfy\n. This holds since symmetric \u03b4-regular graphs are \u03b4-regular with the additional\nGn,\u03b4\nk \u2286G n,\u03b4\nsymmetry constraint. Clique-bipartite graphs are \u03b4-regular graphs constrained to be clique-bipartite\nand the latter property automatically yields symmetry.\nto enforce privacy since these are relaxations\nThis article introduces graph families Gn,\u03b4\nand Gn,\u03b4\nas previously explored in k-anonymity research. These relaxations will achieve\nof the family Gn,b\nbetter utility in the released database. Furthermore, they will allow us to permit adaptive anonymity\nlevels across the users in the database. We will drop the superscripts n and \u03b4 whenever the meaning\nis clear from the context. Additional properties of these graph families will be formalized in \u00a7 4 but\nwe \ufb01rst informally illustrate how they are useful in achieving data privacy.\n\ns \u2286G n,\u03b4\n\nk\n\nb\n\ns\n\nk\n\ns\n\nb\n\nb\n\nusername\nalice\nbob\ncarol\ndave\neve\nfred\n\n1\n0\n0\n1\n1\n0\n\n0\n0\n0\n0\n1\n1\n\n0\n0\n1\n1\n0\n1\n\n0\n0\n1\n1\n0\n1\n\n*\n*\n*\n*\n*\n*\n\n0\n0\n0\n0\n1\n1\n\n0\n0\n1\n1\n*\n*\n\n0\n0\n1\n1\n*\n*\n\nkey\nggacta\ntacaga\nctagag\ntatgaa\ncaacgc\ntgttga\n\nFigure 1: Traditional k-anonymity (in Gk) for n = 6, d = 4, \u03b4 = 2 achieves #(\u2217) = 10. Left to\nright: usernames with data (x, X), compatibility graph (G) and anonymized data with keys (Y, y).\n\nusername\nalice\nbob\ncarol\ndave\neve\nfred\n\n1\n0\n0\n1\n1\n0\n\n0\n0\n0\n0\n1\n1\n\n0\n0\n1\n1\n0\n1\n\n0\n0\n1\n1\n0\n1\n\n*\n*\n*\n*\n1\n0\n\n0\n*\n0\n*\n*\n*\n\n0\n0\n1\n1\n0\n1\n\n0\n0\n1\n1\n0\n1\n\nkey\nggacta\ntacaga\nctagag\ntatgaa\ncaacgc\ntgttga\n\nFigure 2: The b-matching anonymity (in Gb) for n = 6, d = 4, \u03b4 = 2 achieves #(\u2217) = 8. Left to\nright: usernames with data (x, X), compatibility graph (G) and anonymized data with keys (Y, y).\n\nIn \ufb01gure 1, we see an example of k-anonymity with a graph from Gk. Here each entry of the\nanonymized data-set Y appears k = 2 times (or \u03b4 = 2). The compatibility graph shows 3 fully\nconnected cliques since each of the k copies in Y has identical entries. By brute force exploration\n3 Traditional k-anonymity releases an obfuscated database of n rows where there are k copies of each row.\nSo, each copy has the same neighborhood. Similarly, the entries of the original database all have to be connected\nto the same k copies in the obfuscated database. This induces a so-called bipartite clique-connectivity.\n\n3\n\n\fwe \ufb01nd that the minimum number of stars to achieve this type of anonymity is #(\u2217) = 10. More-\nover, since this problem is NP-hard [17], ef\ufb01cient algorithms rarely achieve this best-possible utility\n(minimal number of stars).\nNext, consider \ufb01gure 2 where we have achieved superior utility by only introducing #(\u2217) = 8 stars\nto form Y. The compatibility graph is at least \u03b4 = 2-regular. It was possible to \ufb01nd a smaller\nnumber of stars since \u03b4-regular bipartite graphs are a relaxation of k-clique graphs as shown in\n\ufb01gure 1. Another possibility (not shown in the \ufb01gures) is a symmetric version of \ufb01gure 2 where\nnodes on the left hand side and nodes on the right hand side have a symmetric connectivity. Such an\nintermediate solution (since Gk \u2282G s \u2282G b) should potentially achieve #(\u2217) between 8 and 10.\nIt is easy to see why all graphs have to have a minimum degree of \u03b4 at least (i.e. must contain a\n\u03b4-regular graph). If one of the nodes has a degree of 1, then the adversary will know the key (or the\nusername) for that node with certainty. If each node has degree \u03b4 or larger, then the adversary will\nhave probability at most 1/\u03b4 of choosing the correct key (or username) for any random victim.\nWe next describe algorithms which accept X and integers \u03b41, . . . ,\u03b4 n and output Y such that each\nentry i in Y is compatible with at least \u03b4i entries in X and vice-versa. These algorithms operate by\n\ufb01nding a graph in Gb or Gs and achieve similar protection as k-anonymity (which \ufb01nds a graph in\nthe most restrictive family Gk and therefore requires more stars). We provide a theoretical analysis\nof the topology of G in these two new families to show resilience to single and sustained attacks\nfrom an all-powerful adversary.\n3 Approximation algorithms\nWhile the k-anonymity suppression problem is known to be NP-hard, a polynomial time method\nwith an approximationguarantee is the forest algorithm [1] which has an approximationratio of 3k\u2212\n3. In practice, though, the forest algorithm is slow and achieves poor utility compared to clustering\nmethods [10]. We provide an algorithm for the b-matching anonymity problem with approximation\nratio of \u03b4 and runtime of O(\u03b4m\u221an) where n is the number of users in the data, \u03b4 is the largest\nanonymity level in {\u03b41, . . . ,\u03b4 n} and m is the number of edges to explore (in the worst case with\nno prior knowledge, we have m = O(n2) edges between all possible users). One algorithm solves\nfor minimum weight bipartite b-matchings which is easy to implement using linear programming,\nmax-\ufb02ow methods or belief propagation in the bipartite case [9, 11]. The other algorithm uses a\ngeneral non-bipartite solver which involves Blossom structures and requires O(\u03b4mn log(n)) time[8,\n9, 13]. Fortunately, minimum weight general matching has recently been shown to require only\nO(m\u0001\u22121 log \u0001\u22121) time to achieve a (1 \u2212 \u0001) approximation [7].\nFirst, we de\ufb01ne two quantities of interest. Given a graph G with adjacency matrix G \u2208 Bn\u00d7n and a\ndata-set X, the Hamming error is de\ufb01ned as h(G) = !i !j Gij !k(Xik $= Xjk). The number of\nstars to achieve G is s(G) = nd \u2212 !i !k \"j (1 \u2212 Gij(Xik $= Xjk)) .\nRecall Gb is the family of regular bipartite graphs. Let minG\u2208Gb s(G) be the minimum number of\nstars (or suppressions) that one can place in Y while keeping the entries in Y compatible with at\nleast \u03b4 entries in X and vice-versa. We propose the following polynomial time algorithm which,\nin its \ufb01rst iteration, minimizes h(G) over the family Gb and then iteratively minimizes a variational\nupper bound [12] on s(G) using a weighted version of the Hamming distance.\n\nAlgorithm 1 variational bipartite b-matching\nInput X \u2208 Zn\u00d7d, \u03b4i \u2208 N for i \u2208{ 1, . . . , n}, \u03b5> 0 and initialize W \u2208 Rn\u00d7d to the all 1s matrix\nWhile not converged {\n\nSet \u02c6G = arg minG\u2208Bn\u00d7n !ij Gij !k Wik(Xik $= Xjk)\nFor all i and k set Wik = exp#!j\n1+\u03b5$\n\u02c6Gij(Xik $= Xjk) ln \u03b5\nFor all i and k set Yik = \u2217 if \u02c6Gij = 1 and Xjk $= Xik for any j\nChoose random permutation M as matrix M \u2208 Bn\u00d7n and output Ypublic = MY\nWe can further restrict the b-matching solver such that the graph G is symmetric with respect to both\nthe original data X and the obfuscated data Y. To do so, we require that G is a symmetric matrix.\nThis will produce a graph G \u2208G s. In such a situation, the value of \u02c6G is recovered by a general\n\ns.t. !j Gij = !j Gji \u2265 \u03b4i\n\n}\n\n4\n\n\f}\n\nunipartite b-matching algorithm rather than a bipartite b-matching program. Thus, the set of possible\noutput solutions is strictly smaller (the bipartite formulation relaxes the symmetric one).\nAlgorithm 2 variational symmetric b-matching\nInput X \u2208 Zn\u00d7d, \u03b4i \u2208 N for i \u2208{ 1, . . . , n}, \u03b5> 0 and initialize W \u2208 Rn\u00d7d to the all 1s matrix\nWhile not converged {\ns.t. !j Gij \u2265 \u03b4i, Gij = Gji\n\nSet \u02c6G = arg minG\u2208Bn\u00d7n !ij Gij !k Wik(Xik $= Xjk)\nFor all i and k set Wik = exp#!j\n1+\u03b5$\n\u02c6Gij(Xik $= Xjk) ln \u03b5\nFor all i and k set Yik = \u2217 if \u02c6Gij = 1 and Xjk $= Xik for any j\nChoose random permutation M as matrix M \u2208 Bn\u00d7n and output Ypublic = MY\nTheorem 1 For \u03b4i \u2264 \u03b4, iteration #1 of algorithm 1 \ufb01nds \u02c6G such that s( \u02c6G) \u2264 \u03b4 minG\u2208Gb s(G).\nTheorem 2 Each iteration of algorithm 1 monotonically decreases s( \u02c6G).\nTheorem 1 and 2 apply to both algorithms. Both algorithms4 manipulate a bipartite regular graph\nG(A, B) containing the true matching {(a1, b1), . . . , (an, bn)}. However, they ultimately release the\ndata-set Ypublic after randomly shuf\ufb02ing Y according to some matching or permutation M which\nhides the true matching. The random permutation or matching M can be represented as a matrix\nM \u2208 Bn\u00d7n or as a function \u03c3 : {1, . . . , n}\u2192{ 1, . . . , n}. We now discuss how an adversary can\nattack privacy by recovering this matching or parts of it.\n4 Privacy guarantees\nWe now characterize the anonymity provided by a compatibility graph G \u2208G b (or G \u2208G s) under\nseveral attack models. The goal of the adversary is to correctly match people to as many records as\npossible. In other words, the adversary wishes to \ufb01nd the random matching M used in the algorithms\n(or parts of M) to connect the entries of X to the entries of Ypublic (assuming the adversary has\nstolen X and Ypublic or portions of them). More precisely, we have a bipartite graph G(A, B) with\ncolor classes A, B, each of size n. Class A corresponds to n usernames and class B to n keys. Each\nusername in A is matched to its key in B through some unknown matching M.\nWe consider the model where the graph G(A, B) is \u03b4-regular, where \u03b4 \u2208 N is a parameter chosen by\nthe publisher. The latter is especially important if we are interested in guaranteeing different levels\nof privacy for different users and allowing \u03b4 to vary with the user\u2019s index i.\nSometimes it is the case that the adversary has some additionalinformation and at the very beginning\nknows some complete records that belong to some people. In graph-theoretic terms, the adversary\nthus knows parts of the hidden matching M in advance. Alternatively, the adversary may have\ncome across such additional information through sustained attack where previous attempts revealed\nthe presence or absence of an edge. We are interested in analyzing how this extra knowledge can\nhelp him further reveal other edges of the matching. We aim to show that, for some range of the\nparameters of the bipartite graphs, this additional knowledge does not help him much. We will\ncompare the resilience to attack relative to the resilience of k-anonymity. We say that a person v is\nk-anonymous if his or her real data record can be confused with at least k \u2212 1 records from different\npeople. We \ufb01rst discuss the case of single attacks and then discuss sustained attacks.\n4.1 One-Time Attack Guarantees\nAssume \ufb01rst that the adversary has no extra information about the matching and performs a one-time\nattack. Then, lemma 4.1 holds which is a direct implication of lemma 4.2.\nLemma 4.1 If G(A, B) is an arbitrary \u03b4-regular graph and the adversary does not know any edges\nof the matching he is looking for then every person is \u03b4-anonymous.\n\n4It is straightforward to put a different weight on certain suppressions over others if the utility of the data\nis not uniform for each entry or bit. This done by using an n \u00d7 d weight matrix in the optimization. It is also\nstraightforward to handle missing data by allowing initial stars in X before anonymizing.\n\n5\n\n\f2\n\nLemma 4.2 Let G(A, B) be a \u03b4-regular bipartite graph. Then for every edge e of G(A, B) there\nexists a perfect matching in G(A, B) that uses e.\nThe result does not assume any structure in the graph beyond its \u03b4-regularity. Thus, for a single\nattack, b-matching anonymity (symmetric or asymmetric) is equivalent to k-anonymity when b = k.\nCorollary 4.1 Assume the bipartite graph G(A, B) is either \u03b4-regular, symmetric \u03b4-regular or\nclique-bipartite and \u03b4-regular. An adversary attacking G once succeeds with probability \u2264 1/\u03b4.\n4.2 Sustained Attack on k-Cliques\nNow consider the situation of sustained attacks or attacks with prior information. Here, the adver-\nsary may know c \u2208 N edges in M a priori by whatever means (previous attacks or through side\ninformation). We begin by analyzing the resilience of k-anonymity where G is a cliques-structured\ngraph. In the clique-bipartite graph, even if the adversary knows some edges of the matching (but\nnot too many) then there still is hope of good anonymity for all people. The anonymity of every\nperson decreases from \u03b4 to at least (\u03b4 \u2212 c). So, for example, if the adversary knows in advance \u03b4\nedges of the matching then we get the same type of anonymity for every person as for the model\nwith two times smaller degree in which the adversary has no extra knowledge. So we will be able to\nshow the following:\nLemma 4.3 If G(A, B) is clique-bipartite \u03b4-regular graph and the adversary knows in advance c\nedges of the matching then every person is (\u03b4 \u2212 c)-anonymous.\nThe above is simply a consequence of the following lemma.\nLemma 4.4 Assume that G(A, B) is clique-bipartite \u03b4-regular graph. Denote by M some perfect\nmatching in G(A, B). Let C be some subset of the edges of M and let c = |C|. Fix some vertex\nv \u2208 A not matched in C. Then there are at least (\u03b4 \u2212 c) edges adjacent to v such that, for each of\nthese edges e, there exists some perfect matching M e in G(A, B) that uses both e and C.\nCorollary 4.2 Assume graph G(A, B) is a clique-bipartite and \u03b4-regular. Assume that the adver-\nsary knows in advance c edges of the matching. The adversary selects uniformly at random a vertex\nthe privacy of which he wants to break from the set of vertices he does not know in advance. Then\nhe succeeds with probability at most\nWe next show that b-matchings achieve comparable resilience under sustained attack.\n4.3 Sustained attack on asymmetric bipartite b-matching\nWe now consider the case where we do not have a graph G(A, B) which is clique-bipartite but rather\nis only \u03b4-regular and potentially asymmetric (as returned by algorithm 1).\nTheorem 4.1 Let G(A,B) be a \u03b4-regular bipartite graph with color classes: A and B. Assume\nthat |A| = |B| = n. Denote by M some perfect matching M in G(A, B). Let C be some\nsubset of the edges of M and let c = |C|. Take some \u03be \u2265 c. Denote n$ = n \u2212 c. Fix\nany function \u03c6 : N \u2192 R satisfying \u2200\u03b4(\u03be%2\u03b4 + 1\n4 <\u03c6 (\u03b4) <\u03b4 ). Then for all but at most\n\u03c6(\u03b4) vertices v \u2208 A not matched in C the following\n\u03b7 =\nholds: The size of the set of edges e adjacent to v and having the additional property that there ex-\nists some perfect matching M v in G(A, B) that uses e and edges from C is: at least (\u03b4 \u2212 c \u2212 \u03c6(\u03b4)).\nEssentially, theorem 4.1 says that all but at most a small number \u03b7 of people are (\u03b4 \u2212 c \u2212 \u03c6(\u03b4))-\n4 <\u03c6 (\u03b4) <\u03b4 if the adversary knows in advance c\nanonymous for every \u03c6 satisfying: c%2\u03b4 + 1\nedges of the matching. For example, take \u03c6(\u03b4) := \u03b8\u03b4 for \u03b8 \u2208 (0, 1). Fix \u03be = c and assume that\n4 edges of the matching. Then, using the formula from\nthe adversary knows in advance at most \u03b4 1\n\n2c\u03b42n\"\u03be(1+ \u03c6(\u03b4)+\u221a\u03c62 (\u03b4)\u22122\u03be2 \u03b4\n\n\u03b4\u2212c.\n\n\u03c63(\u03b4)(1+r1\u2212 2\u03be2\u03b4\n\n\u03c62 (\u03b4)\n\n\u03be \u2212 c\n\n\u03c6(\u03b4) +\n\n+ c\u03b4\n\n)( 1\n\n1\n\n2\u03be\u03b4\n\n)\n\n\u03b4(1\u2212 c\n\u03c6(\u03b4)\n\n\u03be )\n\n)\n\n6\n\n\f\u03b8 people from those that\n4 )-anonymous. So if \u03b4 is large enough\nof all people not known in advance are almost\n\n+ \u03b4\n\n1\n4 \u03b83\n\n\u03b4\n\n\u03b4\n\n1\n4\n\n4\n1\n4 \u03b83\n\ntheorem 4.1, we obtain that (for n large enough) all but at most 4n\"\nthe adversary does not know in advance are ((1 \u2212 \u03b8)\u03b4 \u2212 \u03b4 1\nthen all but approximately a small fraction\n(1 \u2212 \u03b8)\u03b4-anonymous.\nAgain take \u03c6(\u03b4) := \u03b8\u03b4 where \u03b8 \u2208 (0, 1). Take \u03be = 2c. Next assume that 1 \u2264 c \u2264 min( \u03b4\n4 ,\u03b4 (1 \u2212\n\u03b8 \u2212 \u03b82)). Assume that the adversary selects uniformly at random a person to attack. Our goal is to\n\ufb01nd an upper bound on the probability he succeeds. Then, using theorem 4.1, we can conclude that\nall but at most F n$ people whose records are not known in advance are ((1 \u2212 \u03b8)\u03b4 \u2212 c)-anonymous\nfor F = 33c2\n(1\u2212\u03b8)\u03b4\u2212c. Using the expression\non F that we have and our assumptions, we can conclude that the probability we are looking for is\nat most 34c2\nTheorem 4.2 Assume graph G(A, B) is \u03b4-regular and the adversary knows in advance c edges of\nthe matching, where c satis\ufb01es: 1 \u2264 c \u2264 min( \u03b4\n4 ,\u03b4 (1 \u2212 \u03b8 \u2212 \u03b82)). The adversary selects uniformly at\nrandom a vertex the privacy of which he wants to break from those that he does not know in advance.\n\u03b82\u03b4 .\nThen he succeeds with probability at most 34c2\n\n\u03b82\u03b4 . The probability of success is at most: F + (1 \u2212 F )\n\u03b82\u03b4 . Therefore we have:\n\n1\n\n4.4 Sustained attack on symmetric b-matching with adaptive anonymity\nWe now consider the case where the graph is not only \u03b4-regular but also symmetric as de\ufb01ned in\nde\ufb01nition 2.2 and as recovered by algorithm 2. Furthermore, we consider the case where we have\nvarying values of \u03b4i for each node since some users want higher privacy than others. It turns out\nthat if the corresponding bipartite graph is symmetric (we de\ufb01ne this term below) we can conclude\nthat each user is (\u03b4i \u2212 c)-anonymous, where \u03b4i is the degree of a vertex associated with the user\nof the bipartite matching graph. So we get results completely analogous to those for the much\nsimpler models described before. We will use a slightly more elaborate de\ufb01nition of symmetric5,\nhowever, since this graph has one if its partitions permuted by a random matching (the last step in\nboth algorithms before releasing the data).\nDe\ufb01nition 4.1 Let G(A, B) be a bipartite graph with color classes: A, B and matching M =\n{(a1, b1), ...(an, bn)}, where A = {a1, ..., an}, B = {b1, ..., bn}. We say that G(A, B) is symmetric\nwith respect to M if the existence of an edge (ai, bj) in G(A, B) implies the existence of an edge\n(aj, bi), where 1 \u2264 i, j \u2264 n.\nFrom now on, the matching M with respect to which G(A, B) is symmetric is a canonical matching\nof G(A, B). Assume that G(A, B) is symmetric with respect to its canonical matching M (it does\nnot need to be a clique-bipartite graph). In such a case, we will prove that, if the adversary knows\nin advance c edges of the matching, then every person from the class A of degree \u03b4i is (\u03b4i \u2212 c)-\nanonymous. So we obtain the same type of anonymity as in a clique-bipartite graph (see: lemma 4.3).\nLemma 4.5 Assume that G(A, B) is a bipartite graph, symmetric with respect to its canonical\nmatching M. Assume furthermore that the adversary knows in advance c edges of the matching.\nThen every person that he does not know in advance is (\u03b4i \u2212 c)-anonymous, where \u03b4i is a degree of\nthe related vertex of the bipartite graph.\nAs a corollary, we obtain the same privacy guarantees in the symmetric case as the k-cliques case.\nCorollary 4.3 Assume bipartite graph G(A, B) is symmetric with respect to its canonical match-\nings M. Assume that the adversary knows in advance c edges of the matching. The adversary selects\nuniformly at random a vertex the privacy of which he wants to break from the set of vertices he does\nnot know in advance. Then he succeeds with probability at most\n\u03b4i\u2212c, where \u03b4i is a degree of a\nvertex of the matching graph associated with the user.\n5A symmetric graph G(A, B) may not remain symmetric according to de\ufb01nition 2.2 if nodes in B are\nshuf\ufb02ed by a permutation M. However, it will still be symmetric with respect to M according to de\ufb01nition 4.1.\n\n1\n\n7\n\n\fIn summary, the symmetric case is as resilient to sustained attack as the cliques-bipartite case, the\nusual one underlying k-anonymity if we set \u03b4i = \u03b4 = k everywhere. The adversary succeeds with\nprobability at most 1/(\u03b4i\u2212c). However, the asymmetric case is potentially weaker and the adversary\ncan succeed with probability at most 34c2\n\u03b82\u03b4 . Interestingly, in the symmetric case with variable \u03b4i\ndegrees, however, we can provide guarantees that are just as good without forcing all individuals to\nagree on a common level of anonymity.\n\ny\nt\ni\nl\ni\nt\n\nu\n\ny\nt\ni\nl\ni\nt\n\nu\n\n1\n0.9\n0.8\n0.7\n0.6\n0.5\n0.4\n0\n\n \n\n1\n0.95\n0.9\n0.85\n0.8\n0.75\n0.7\n0\n\n \n\nb\u2212matching\nb\u2212symmetric\nk\u2212anonymity\n\n5\n\n10\n\nanonymity\n\nb\u2212matching\nb\u2212symmetric\nk\u2212anonymity\n\n5\n\n10\n\nanonymity\n\n \n\n1\n\n0.8\n\ny\nt\ni\nl\ni\nt\n\nu\n\n0.6\n\n0.4\n\n0.2\n0\n\n \n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\n0\n0\n\n15\n\n20\n\n \n\ny\nt\ni\nl\ni\nt\n\nu\n\n15\n\n20\n\n \n\n1\n\n0.8\n\ny\nt\ni\nl\ni\nt\n\nu\n\n0.6\n\n0.4\n\n0.2\n0\n\n \n\n1\n0.9\n0.8\n0.7\n0.6\n0.5\n0.4\n0\n\n \n\n15\n\n20\n\n \n\ny\nt\ni\nl\ni\nt\n\nu\n\n15\n\n20\n\nb\u2212matching\nb\u2212symmetric\nk\u2212anonymity\n\n5\n\n10\n\nanonymity\n\nb\u2212matching\nb\u2212symmetric\nk\u2212anonymity\n\n5\n\n10\n\nanonymity\n\n \n\n20\n\n \n\n15\n\n15\n\n20\n\ny\nt\ni\nl\ni\nt\n\nu\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n5\n\n \n\n0.95\n\n0.9\n\ny\nt\ni\nl\ni\nt\n\nu\n\n0.85\n\n0.8\n\n0.75\n5\n\n \n\n \n\nb\u2212matching\nb\u2212symmetric\nk\u2212anonymity\n15\n20\nanonymity\n\n10\n\n25\n\n30\n\n \n\nb\u2212matching\nb\u2212symmetric\nk\u2212anonymity\n15\n20\nanonymity\n\n10\n\n(b)\n\n25\n\n30\n\nb\u2212matching\nb\u2212symmetric\nk\u2212anonymity\n\n5\n\n10\n\nanonymity\n\nb\u2212matching\nb\u2212symmetric\nk\u2212anonymity\n\n5\n\n10\n\nanonymity\n\n(a)\n\nFigure 3: Utility (1 \u2212 #(\u2217)\nnd ) versus anonymity on (a) Bupa (n = 344, d = 7), Wine (n = 178, d =\n14), Heart (n = 186, d = 23), Ecoli (n = 336, d = 8), and Hepatitis (n = 154, d = 20) and Forest\nFires (n = 517, d = 44) data-sets and (b) CalTech University Facebook (n = 768, d = 101) and\nReed University Facebook (n = 962, d = 101) data-sets.\n\n5 Experiments\nWe compared algorithms 1 and 2 against an agglomerative clustering competitor (optimized to min-\nimize stars) which is known to outperform the forest method [10]. Agglomerative clustering starts\nwith singleton clusters and keeps unifyingthe two closest clusters with smallest increase in stars until\nclusters grow to a size at least k. Both algorithms release data with suppressions to achieve a de-\nsired constant anonymity level \u03b4. For our algorithms, we swept values of \u03b5 in {2\u22121, 2\u22122, . . . , 2\u221210}\nfrom largest to smallest and chose the solution that produced the least number of stars. Further-\nmore, we warm-started the symmetric algorithm with the star-pattern solution of the asymmetric\nalgorithm to make it converge more quickly. We \ufb01rst explored six standard data-sets from UCI\nhttp://archive.ics.uci.edu/ml/ in the uniform anonymity setting. Figure 3(a) summarizes the re-\nsults where utility is plotted against \u03b4. Fewer stars imply greater utility and larger \u03b4 implies higher\nanonymity. We discretized each numerical dimension in a data-set into a binary attribute by \ufb01nding\nall elements above and below the median and mapped categoricalvalues in the data-sets into a binary\ncode (potentially increasing the dimensionality). Algorithms 1 achieved signi\ufb01cantly better utility\nfor any given \ufb01xed constant anonymity level \u03b4 while algorithm 2 achieved a slight improvement.\nWe next explored Facebook social data experiments where each user has a different level of desired\nanonymity and has 7 discrete pro\ufb01le attributes which were binarized into d = 101 dimensions. We\nused the number of friends fi a user has to compute their desired anonymity level (which decreases\nas the number of friends increases). We set F = maxi=1,...n -log fi. and, for each value of \u03b4 in the\nplot, we set user i\u2019s privacy level to \u03b4i = \u03b4 \u2212 (F \u2212 -log fi.). Figure 3(b) summarizes the results\nwhere utility is plotted against \u03b4. Since the k-anonymity agglomerative clustering method requires\na constant \u03b4 for all users, we set k = maxi \u03b4i in order to have a privacy guarantee. Algorithms 1\nand 2 consistently achieved signi\ufb01cantly better utility in the adaptive anonymity setting while also\nachieving the desired level of privacy protection.\n6 Discussion\nWe described the adaptive anonymity problem where data is obfuscated to respect each individual\nuser\u2019s privacy settings. We proposed a relaxation of k-anonymity which is straightforward to imple-\nment algorithmically. It yields similar privacy protection while offering greater utility and the ability\nto handle heterogeneous anonymity levels for each user.\n\n8\n\n\fReferences\n[1] G. Aggarwal, T. Feder, K. Kenthapadi, R. Motwani, R. Panigrahy, D. Thomas, and A. Zhu.\n\nApproximation algorithms for k-anonymity. Journal of Privacy Technology, 2005.\n\n[2] M. Allman and V. Paxson. Issues and etiquette concerning use of shared measurement data. In\n\nProceedings of the 7th ACM SIGCOMM conference on Internet measurement, 2007.\n\n[3] M. Bugliesi, B. Preneel, V. Sassone, I Wegener, and C. Dwork. Lecture Notes in Computer Sci-\nence - Automata, Languages and Programming, chapter Differential Privacy. Springer Berlin\n/ Heidelberg, 2006.\n\n[4] K. Chaudhuri, C. Monteleone, and A.D. Sarwate. Differentially private empirical risk mini-\n\nmization. Journal of Machine Learning Research, (12):1069\u20131109, 2011.\n\n[5] G. Cormode, D. Srivastava, S. Bhagat, and B. Krishnamurthy. Class-based graph anonymiza-\n\ntion for social network data. In PVLDB, volume 2, pages 766\u2013777, 2009.\n\n[6] G. Cormode, D. Srivastava, T. Yu, and Q. Zhang. Anonymizing bipartite graph data using safe\n\ngroupings. VLDB J., 19(1):115\u2013139, 2010.\n\n[7] R. Duan and S. Pettie. Approximating maximum weight matching in near-linear time.\n\nProceedings 51st Symposium on Foundations of Computer Science, 2010.\n\n[8] J. Edmonds. Paths, trees and \ufb02owers. Canadian Journal of Mathematics, 17, 1965.\n[9] H.N. Gabow. An ef\ufb01cient reduction technique for degree-constrained subgraph and bidirected\nnetwork \ufb02ow problems. In Proceedings of the \ufb01fteenth annual ACM symposium on Theory of\ncomputing, 1983.\n\n[10] A. Gionis, A. Mazza, and T. Tassa. k-anonymization revisited. In ICDE, 2008.\n[11] B. Huang and T. Jebara. Fast b-matchingvia suf\ufb01cient selection belief propagation. In Arti\ufb01cial\n\nIn\n\nIntelligence and Statistics, 2011.\n\n[12] M.I. Jordan, Z. Ghahramani, T. Jaakkola, and L.K. Saul. An introduction to variational meth-\n\nods for graphical models. Machine Learning, 37(2):183\u2013233, 1999.\n\n[13] V.N. Kolmogorov. Blossom V: A new implementation of a minimum cost perfect matching\n\nalgorithm. Mathematical Programming Computation, 1(1):43\u201367, 2009.\n\n[14] N. Li, T. Li, and S. Venkatasubramanian.\n\nt-closeness: Privacy beyond k-anonymity and l-\n\ndiversity. In ICDE, 2007.\n\n[15] S. Lodha and D. Thomas. Probabilistic anonymity. In PinKDD, 2007.\n[16] A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam. L-diversity: Privacy\nbeyondk-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD), 1, 2007.\n\n[17] A. Meyerson and R. Williams. On the complexity of optimal k-anonymity. In PODS, 2004.\n[18] P. Samarati and L. Sweeney. Generalizing data to provide anonymity when disclosing infor-\n\nmation. In PODS, 1998.\n\n[19] L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression.\nInternational Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5):571\u2013\n588, 2002.\n\n[20] Y. Tao and X. Xiao. Personalized privacy preservation. In SIGMOD Conference, 2006.\n[21] Y. Tao and X. Xiao. Personalized privacy preservation. In Privacy-Preserving Data Mining,\n\n[22] O. Williams and F. McSherry. Probabilistic inference and differential privacy. In NIPS, 2010.\n[23] M. Xue, P. Karras, C. Rassi, J. Vaidya, and K.-L. Tan. Anonymizing set-valued data by nonre-\n\nciprocal recoding. In KDD, 2012.\n\n[24] E. Zheleva and L. Getoor. Preserving the privacy of sensitive relationships in graph data. In\n\n2008.\n\nKDD, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1478, "authors": [{"given_name": "Krzysztof", "family_name": "Choromanski", "institution": "Google Research"}, {"given_name": "Tony", "family_name": "Jebara", "institution": "Columbia University"}, {"given_name": "Kui", "family_name": "Tang", "institution": "Columbia University"}]}