{"title": "Embed and Project: Discrete Sampling with Universal Hashing", "book": "Advances in Neural Information Processing Systems", "page_first": 2085, "page_last": 2093, "abstract": "We consider the problem of sampling from a probability distribution defined over a high-dimensional discrete set, specified for instance by a graphical model. We propose a sampling algorithm, called PAWS, based on embedding the set into a higher-dimensional space which is then randomly projected using universal hash functions to a lower-dimensional subspace and explored using combinatorial search methods. Our scheme can leverage fast combinatorial optimization tools as a blackbox and, unlike MCMC methods, samples produced are guaranteed to be within an (arbitrarily small) constant factor of the true probability distribution. We demonstrate that by using state-of-the-art combinatorial search tools, PAWS can efficiently sample from Ising grids with strong interactions and from software verification instances, while MCMC and variational methods fail in both cases.", "full_text": "Embed and Project:\n\nDiscrete Sampling with Universal Hashing\n\nStefano Ermon, Carla P. Gomes\n\nDept. of Computer Science\n\nCornell University\n\nIthaca NY 14853, U.S.A.\n\nAshish Sabharwal\n\nBart Selman\n\nIBM Watson Research Ctr.\n\nDept. of Computer Science\n\nYorktown Heights\nNY 10598, U.S.A.\n\nCornell University\n\nIthaca NY 14853, U.S.A.\n\nAbstract\n\nWe consider the problem of sampling from a probability distribution de\ufb01ned over\na high-dimensional discrete set, speci\ufb01ed for instance by a graphical model. We\npropose a sampling algorithm, called PAWS, based on embedding the set into\na higher-dimensional space which is then randomly projected using universal\nhash functions to a lower-dimensional subspace and explored using combinatorial\nsearch methods. Our scheme can leverage fast combinatorial optimization tools\nas a blackbox and, unlike MCMC methods, samples produced are guaranteed to\nbe within an (arbitrarily small) constant factor of the true probability distribution.\nWe demonstrate that by using state-of-the-art combinatorial search tools, PAWS\ncan ef\ufb01ciently sample from Ising grids with strong interactions and from software\nveri\ufb01cation instances, while MCMC and variational methods fail in both cases.\n\n1\n\nIntroduction\n\nSampling techniques are one of the most widely used approaches to approximate probabilistic rea-\nsoning for high-dimensional probability distributions where exact inference is intractable. In fact,\nmany statistics of interest can be estimated from sample averages based on a suf\ufb01ciently large num-\nber of samples. Since this can be used to approximate #P-complete inference problems, sampling is\nalso believed to be computationally hard in the worst case [1, 2].\n\nSampling from a succinctly speci\ufb01ed combinatorial space is believed to much harder than searching\nthe space. Intuitively, not only do we need to be able to \ufb01nd areas of interest (e.g., modes of the\nunderlying distribution) but also to balance their relative importance. Typically, this is achieved\nusing Markov Chain Monte Carlo (MCMC) methods. MCMC techniques are a specialized form\nof local search that only allows moves that maintain detailed balance, thus guaranteeing the right\noccupation probability once the chain has mixed. However, in the context of hard combinatorial\nspaces with complex internal structure, mixing times are often exponential. An alternative is to\nuse complete or systematic search techniques such as Branch and Bound for integer programming,\nDPLL for SATis\ufb01ability testing, and constraint and answer-set programming (CP & ASP), which\nare preferred in many application areas, and have witnessed a tremendous success in the past few\ndecades. It is therefore a natural question whether one can construct sampling techniques based on\nthese more powerful complete search methods rather than local search.\n\nPrior work in cryptography by Bellare et al. [3] showed that it is possible to uniformly sample\nwitnesses of an NP language leveraging universal hash functions and using only a small number\nof queries to an NP-oracle. This is signi\ufb01cant because samples can be used to approximate #P-\ncomplete (counting) problems [2], a complexity class believed to be much harder than NP. Practical\nalgorithms based on these ideas were later developed [4\u20136] to near-uniformly sample solutions of\npropositional SATis\ufb01ability instances, using a SAT solver as an NP-oracle. However, unlike SAT,\n\n1\n\n\fmost models used in Machine Learning, physics, and statistics are weighted (represented, e.g., as\ngraphical models) and cannot be handled using these techniques.\n\nWe \ufb01ll this gap by extending this approach, based on hashing-based projections and NP-oracle\nqueries, to the weighted sampling case. Our algorithm, called PAWS, uses a form of approximation\nby quantization [7] and an embedding technique inspired by slice sampling [8], before applying\nprojections. This parallels recent work [9] that extended similar ideas for unweighted counting to\nthe weighted counting world, addressing the problem of discrete integration. Although in theory\none could use that technique to produce samples by estimating ratios of discrete integrals [1, 2],\nthe general sampling-by-counting reduction requires a large number of such estimates (proportional\nto the number of variables) for each sample. Further, the accuracy guarantees on the sampling\nprobability quickly become loose when taking ratios of estimates. In contrast, PAWS is a more\ndirect and practical sampling approach, providing better accuracy guarantees while requiring a much\nsmaller number of NP-oracle queries per sample.\n\nAnswering NP-oracle queries, of course, requires exponential time in the worst case, in accordance\nwith the hardness of sampling. We rely on the fact that combinatorial search tools, however, are\noften extremely fast in practice, and any complete solver can be used as a black box in our sampling\nscheme. Another key advantage is that when combinatorial search succeeds, our analysis provides a\ncerti\ufb01cate that, with high probability, any samples produced will be distributed within an (arbitrarily\nsmall) constant factor of the desired probability distribution. In contrast, with MCMC methods it\nis generally hard to assess whether the chain has mixed. We empirically demonstrate that PAWS\noutperforms MCMC as well as variational methods on hard synthetic Ising Models and on a real-\nworld test case generation problem for software veri\ufb01cation.\n\n2 Setup and Problem De\ufb01nition\n\nWe are given a probability distribution p over a (high-dimensional) discrete set X , where the proba-\nbility of each item x \u2208 X is proportional to a weight function w : X \u2192 R+, with R+ being the set\nof non-negative real numbers. Speci\ufb01cally, given x \u2208 X , its probability p(x) is given by\n\np(x) =\n\nw(x)\n\nZ\n\n, Z = Xx\u2208X\n\nw(x)\n\nwhere Z is a normalization constant known as the partition function. We assume w is speci\ufb01ed\ncompactly, e.g., as the product of factors or in a conjunctive normal form. As our driving example,\nwe consider the case of undirected discrete graphical models [10] with n = |V | random variables\n{xi, i \u2208 V } where each xi takes values in a \ufb01nite set Xi. We consider a factor graph representation\nfor a joint probability distribution over elements (or con\ufb01gurations) x \u2208 X = X1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 Xn:\n\np(x) =\n\nw(x)\n\nZ\n\n=\n\n1\n\nZ Y\u03b1\u2208I\n\n\u03c8\u03b1({x}\u03b1).\n\n(1)\n\nThis is a compact representation for p(x) based on the weight function w(x) = Q\u03b1\u2208I \u03c8\u03b1({x}\u03b1),\nde\ufb01ned as the product of potentials or factors \u03c8\u03b1 : {x}\u03b1 7\u2192 R+, where I is an index set and\n{x}\u03b1 \u2286 V the subset of variables factor \u03c8\u03b1 depends on. For simplicity of exposition, without loss\nof generality, we will focus on the case of binary variables, where X = {0, 1}n.\nWe consider the fundamental problem of (approximately) sampling from p(x), i.e., designing a\nrandomized algorithm that takes w as input and outputs elements x \u2208 X according to the probability\ndistribution p. This is a hard computational problem in the worst case. In fact, it is more general\nthan NP-complete decision problems (e.g., sampling solutions of a SATis\ufb01ability instance speci\ufb01ed\nas a factor graph entails \ufb01nding at least one solution, or deciding there is none). Further, samples\ncan be used to approximate #P-complete problems [2], such as estimating a marginal probability.\n\n3 Sampling by Embed, Project, and Search\n\nConceptually, our sampling strategy has three steps, described in Sections 3.1, 3.2, and 3.3, resp.\n(1) From the input distribution p we construct a new distribution p\u2032 that is \u201cclose\u201d to p but more\n\n2\n\n\fdiscrete. Speci\ufb01cally, p\u2032 is based on a new weight function w\u2032 that takes values only in a discrete\nset of geometrically increasing weights. (2) From p\u2032, we de\ufb01ne a uniform probability distribution\np\u2032\u2032 over a carefully constructed higher-dimensional embedding of X = {0, 1}n. The previous\ndiscretization step allows us to specify p\u2032\u2032 in a compact form, and sampling from p\u2032\u2032 can be seen\nto be precisely equivalent to sampling from p\u2032. (3) Finally, we indirectly sample from the desired\ndistribution p by sampling uniformly from p\u2032\u2032, by randomly projecting the embedding onto a lower-\ndimensional subspace using universal hash functions and then searching for feasible states.\n\nThe \ufb01rst and third steps involve a bounded loss of accuracy, which we can trade off with computa-\ntional ef\ufb01ciency by setting hyper-parameters of the algorithm. A key advantage is that our technique\nreduces the weighted sampling problem to that of solving one MAP query (i.e., \ufb01nding the most likely\nstate) and a polynomial number of feasibility queries (i.e., \ufb01nding any state with non-zero probabil-\nity) for the original graphical model augmented (through an embedding) with additional variables\nand carefully constructed factors. In practice, we use a combinatorial optimization package, which\nrequires exponential time in the worst case (consistent with the hardness of sampling) but is often fast\nin practice. Our analysis shows that whenever the underlying combinatorial search and optimization\nqueries succeed, the samples produced are guaranteed, with high probability, to be coming from an\napproximately accurate distribution.\n\n3.1 Weight Discretization\n\nWe use a geometric discretization of the weights into \u201cbuckets\u201d, i.e., a uniform discretization of the\nlog-probability. As we will see, \u0398(n) buckets are suf\ufb01cient to preserve accuracy.\nDe\ufb01nition 1. Let M = maxx w(x), r > 1, \u01eb > 0, and \u2113 = \u2308logr(2n/\u01eb)\u2309. Partition the con-\n\ufb01gurations into the following weight based disjoint buckets: Bi = {x | w(x) \u2208 (cid:0) M\nri(cid:3)}, i =\n0, . . . , \u2113 \u2212 1 and B\u2113 = {x | w(x) \u2208 [0, M\nr\u2113 ]}. The discretized weight function w\u2032 : {0, 1}n \u2192 R+ is\nde\ufb01ned as follows: w\u2032(x) = M\nri+1 if x \u2208 Bi for i < \u2113 and w\u2032(x) = 0 if x \u2208 B\u2113. The corresponding\ndiscretized probability distribution p\u2032(x) = w\u2032(x)/Z \u2032 where Z \u2032 is the normalization constant.\nLemma 1. Let \u03c1 = r2/(1 \u2212 \u01eb). For all x \u2208 \u222al\u22121\ni=0B\u2113, p(x) and p\u2032(x) are within a factor of \u03c1 of each\nother. Furthermore,Px\u2208B\u2113 p(x) \u2264 \u01eb.\nProof. Since w maps to non-negative values, we have Z \u2265 M. Further,\n\u01ebM\nZ Xx\u2208B\u2113\nZ \u2264\n\nThis proves the second part of the claim. For the \ufb01rst part, note that by construction, Z \u2032 \u2264 Z and\n\nr\u2113 = |B\u2113|\n\n\u01ebM\nZ \u2264 \u01eb.\n\n1\nZ |B\u2113|\n\nXx\u2208B\u2113\n\nw(x) \u2264\n\nri+1 , M\n\np(x) =\n\n2n\n\nM\n\n1\n\nZ \u2032 =\n\n\u2113\n\nXi=0 Xx\u2208Bi\n\nw\u2032(x) \u2265\n\n\u2113\u22121\n\nXi=0 Xx\u2208Bi\n\n1\nr\n\nw(x) =\n\n1\n\nr Z \u2212 Xx\u2208B\u2113\n\nw(x)! \u2265 (1 \u2212 \u01eb)Z.\n\n1\n\u03c1\n\np(x) \u2264\n\nw(x)\nrZ \u2032 \u2264\n\nw(x)\nrZ \u2264\n\nThus Z and Z \u2032 are within a factor of r/(1\u2212\u01eb) of each other. For all x such that w(x) /\u2208 Bn, recalling\nthat r > 1 > 1 \u2212 \u01eb and that w(x)/r \u2264 w\u2032(x) \u2264 rw(x), we have\n\nw\u2032(x)\nZ \u2032 = p\u2032(x) =\n\nr2\n1 \u2212 \u01eb\nThis \ufb01nishes the proof that p(x) and p\u2032(x) are within a factor of \u03c1 of each other.\nRemark 1. If the weights w de\ufb01ned by the original graphical model are represented in \ufb01nite preci-\nsion (e.g., there are 264 possible weights in double precision \ufb02oating point), for every b \u2265 1 there\nis a possibly large but \ufb01nite value of \u2113 (such that M/r\u2113 is smaller than the smallest representable\nweight) such that B\u2113 is empty and the discretization error \u01eb is effectively zero.\n3.2 Embed: From Weighted to Uniform Sampling\n\nw\u2032(x)\nZ \u2032 \u2264\n\nrw(x)\nZ \u2032 \u2264\n\n= \u03c1p(x).\n\nw(x)\n\nZ\n\nWe now show how to reduce the problem of sampling from the discrete distribution p\u2032 (weighted\nsampling) to the problem of uniformly sampling, without loss of accuracy, from a higher-\ndimensional discrete set into which X = {0, 1}n is embedded. This is inspired by slice sampling [8],\nand can be intuitively understood as its discrete counterpart where we uniformly sample points (x, y)\nfrom a discrete representation of the area under the (y vs. x) probability density function of p\u2032 .\n\n3\n\n\fw(x) \u2264\n\nM\nri \u21d2\n\nM\n\nk=1 yk\n\nb\n\n_k=1\n\n\u2113\u22121(cid:1)(cid:12)(cid:12)(cid:12)\n\n1, y2\n\n1, . . . , yb\u22121\n\n\u2113\u22121 , yb\n\nyk\ni , 1 \u2264 i \u2264 \u2113 \u2212 1; w(x) >\n\ni may alternatively be thought of as the linear constraintPb\n\nDe\ufb01nition 2. Let w : X \u2192 R+, M = maxx w(x), and r = 2b/(2b \u2212 1). Then the embedding\nS(w, \u2113, b) of X in X \u00d7 {0, 1}(\u2113\u22121)b is de\ufb01ned as:\nr\u2113) .\ni \u2265 1. Further, let\n\nS(w, \u2113, b) =((cid:0)x, y1\nwhereWb\nk=1 yk\np\u2032\u2032 denote a uniform probability distribution over S(w, \u2113, b) and n\u2032 = n + (\u2113 \u2212 1)b.\nGiven a compact representation of w within a combinatorial search or optimization framework, the\nset S(w, \u2113, b) can often be easily encoded using the disjunctive constraints on the y variables.\nLemma 2. Let (x, y) = (x, y1\ni.e., a uniformly sampled element from S(w, \u2113, b). Then x is distributed according to p\u2032.\nInformally, given x \u2208 Bi and x\u2032 \u2208 Bi+1 with i + 1 \u2264 l \u2212 1, there are precisely r = 2b/(2b \u2212 1)\ntimes more valid con\ufb01gurations (x, y) than (x\u2032, y\u2032). Thus x is sampled r times more often than x\u2032.\nA formal proof may be found in the Appendix.\n\n\u2113\u22121) be a sample from p\u2032\u2032,\n\n2,\u00b7\u00b7\u00b7 , yb\n\n2,\u00b7\u00b7\u00b7 , y1\n\n\u2113\u22121 \u00b7\u00b7\u00b7 , yb\n\n1, y2\n\n1,\u00b7\u00b7\u00b7 , yb\n\n1, y1\n\n3.3 Project and Search: Uniform Sampling with Hash Functions and an NP-oracle\n\nIn principle, using the technique of Bellare et al. [3] and n\u2032-wise independent hash functions we can\nsample purely uniformly from S(w, \u2113, b) using an NP oracle to answer feasibility queries. However,\nsuch hash functions involve constructions that are dif\ufb01cult to implement and reason about in exist-\ning combinatorial search methods. Instead, we use a more practical algorithm based on pairwise\nindependent hash functions that can be implemented using parity constraints (modular arithmetic)\nand still provides accuracy guarantees. The approach is similar to [5], but we include an algorithmic\nway to estimate the number of parity constraints to be used. We also use the pivot technique from\n[6] but extend that work in two ways: we introduce a parameter \u03b1 (similar to [5]) that allows us to\ntrade off uniformity against runtime and also provide upper bounds on the sampling probabilities.\n\nWe refer to our algorithm as PArity-basedWeightedSampler (PAWS) and provide its pseudocode as\nAlgorithm 1. The idea is to project by randomly constraining the con\ufb01guration space using a family\nof universal hash functions, search for up to P \u201csurviving\u201d con\ufb01gurations, and then, if fewer than P\nsurvive, perform rejection sampling to choose one of them. The number k of constraints or factors\n(encoding a randomly chosen hash function) to add is determined \ufb01rst; this is where we depart from\nboth Gomes et al. [5], who do not provide a way to compute k, and Chakraborty et al. [6], who do\nnot \ufb01x k or provide upper bounds. Then we repeatedly add k such constraints, check whether fewer\nthan P con\ufb01gurations survive, and if so output one con\ufb01guration chosen using rejection sampling.\nIntuitively, we need the hashed space to contain no more than P solutions because that is a base case\nwhere we know how to produce uniform samples via enumeration. k is a guess (accurate with high\nprobability) of the number of constraints that is likely to reduce (by hashing) the original problem\nto a situation where enumeration is feasible. If too many or too few con\ufb01gurations survive, the\nalgorithm fails and is run again. The small failure probability, accounting for a potentially poor\nchoice of random hash functions, can be bounded irrespective of the underlying graphical model.\n\nA combinatorial optimization procedure is used once in order to determine the maximum weight\nM through MAP inference. M is used in the discretization step. Subsequently, several feasibility\nqueries are issued to the underlying combinatorial search procedure in order to, e.g., count the\nnumber of surviving con\ufb01gurations and produce one as a sample.\n\nWe brie\ufb02y review the construction and properties of universal hash functions [11, 12].\nDe\ufb01nition 3. H = {h : {0, 1}n \u2192 {0, 1}m} is a family of pairwise independent hash functions\nif the following two conditions hold when a function H is chosen uniformly at random from H: 1)\n\u2200x \u2208 {0, 1}n, the random variable H(x) is uniformly distributed in {0, 1}m; 2) \u2200x1, x2 \u2208 {0, 1}n\nx1 6= x2, the random variables H(x1) and H(x2) are independent.\nProposition 1. Let A \u2208 {0, 1}m\u00d7n, c \u2208 {0, 1}m. The family H = {hA,c(x) : {0, 1}n \u2192 {0, 1}m}\nwhere hA,c(x) = Ax + c mod 2 is a family of pairwise independent hash functions.\nFurther, H is also known to be a family of three-wise independent hash functions [5].\n\n4\n\n\fk \u2190 \u22121 ;\n\ncount \u2190 0\n\ncount \u2190 0\n\nk \u2190 k + 1 ;\nfor t = 1, \u00b7 \u00b7 \u00b7 , T do\n\nT \u2190 24 \u2308ln (n\u2032/\u03b4)\u2309 ;\nrepeat\n\nAlgorithm 1 Algorithm PAWS for sampling con\ufb01gurations \u03c3 according to w\n1: procedure COMPUTEK(n\u2032, \u03b4, P , S)\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13: end procedure\n\nSample hash function hk\nLet S k,t , {(x, y) \u2208 S, hk\nif |S k,t| < P then\n\nuntil count \u2265 \u2308T /2\u2309 or k = n\u2032\nreturn k\n\ncount \u2190 count + 1\n\nA,c : {0, 1}n\u2032\n\nA,c(x, y) = 0}\n\n\u2192 {0, 1}k\n\nend for\n\n/* search for \u2265 P different elements */\n\n/* compute with one MAP inference query on w */\n/* as in Definition 2 */\n\n\u2192 {0, 1}i, i.e., uniformly choose A \u2208 {0, 1}i\u00d7n\u2032, c \u2208 {0, 1}i\n\nA,c(x, y) = 0}\n\nA,c : {0, 1}n\u2032\n\nM \u2190 maxx w(x)\nS \u2190 S(w, \u2113, b); n\u2032 \u2190 n + b(\u2113 \u2212 1)\ni \u2190 COMPUTEK(n\u2032, \u03b4, \u03b3, P , S) + \u03b1\nSample hash fn. hi\nLet S i , {(x, y) \u2208 S, hi\nCheck if |S i| \u2265 P by searching for at least P different elements\nif |S i| \u2265 P or |S i| = 0 then\n\n14: procedure PAWS(w : {0, 1}n \u2192 R+, \u2113, b, \u03b4, P , \u03b1)\n15:\n16:\n17:\n18:\n19:\n20:\n21:\n22:\n23:\n24:\n25:\n26:\n27:\n28:\n29:\n30: end procedure\n\nFix an arbitrary ordering of S i\nUniformly sample p from {0, 1, . . . , P \u2212 1}\nif p \u2264 |S i| then\n\nSelect p-th element (x, y) of S i ;\n\nreturn \u22a5\n\nreturn \u22a5\n\nreturn x\n\nelse\n\nelse\n\n/* failure */\n\n/* for rejection sampling */\n\n/* failure */\n\nLemma 3 (see Appendix for a proof) shows that the subroutine COMPUTEK in Algorithm 1 outputs\nwith high probability a value close to log(|S(w, \u2113, b)|/P ). The idea is similar to an unweighted\nversion of the WISH algorithm [9] but with tighter guarantees and using more feasibility queries.\nLemma 3. Let S = S(w, \u2113, b) \u2286 {0, 1}n\u2032, \u03b4 > 0, and \u03b3 > 0. Further, let P \u2265 min{2, 2\u03b3+2/(2\u03b3 \u2212\nP = log(Z/P ), and k be the output of procedure COMPUTEK(n\u2032, \u03b4, P,S). Then,\n1)2}, Z = |S|, k\u2217\nP + 1 + \u03b3] \u2265 1 \u2212 \u03b4 and COMPUTEK uses O(n\u2032 ln (n\u2032/\u03b4)) feasibility queries.\nP[k\u2217\nP \u2212 \u03b3 \u2264 k \u2264 k\u2217\nLemma 4. Let S = S(w, \u2113, b) \u2286 {0, 1}n\u2032, \u03b4 > 0, P \u2265 2, and \u03b3 = log(cid:0)(P + 2\u221aP + 1 + 2)/P(cid:1).\nFor any \u03b1 \u2208 Z, \u03b1 > \u03b3, let c(\u03b1, P ) = 1 \u2212 2\u03b3\u2212\u03b1/(1 \u2212 1\nP \u2212 2\u03b3\u2212\u03b1)2. Then with probability\nat least 1 \u2212 \u03b4 the following holds: PAWS(w, \u2113, b, \u03b4, P , \u03b1) outputs a sample with probability\nat least c(\u03b1, P )2\u2212(\u03b3+\u03b1+1) P\nP \u22121 and, conditioned on outputting a sample, every element (x, y) \u2208\nS(w, \u2113, b) is selected (Line 27) with probability p\u2032\ns(x, y) within a constant factor c(\u03b1, P ) of the\nuniform probability p\u2032\u2032(x, y) = 1/|S|.\nProof Sketch. For lack of space, we defer details to the Appendix. Brie\ufb02y, the probability P[\u03c3 \u2208 S i]\nthat \u03c3 = (x, y) survives is 2\u2212i by the properties of the hash functions in De\ufb01nition 3, and the\nprobability of being selected by rejection sampling is 1/(P \u2212 1). Conditioned on \u03c3 surviving, the\nmean and variance of the size of the surviving set |S i| are independent of \u03c3 because of 3-wise\nindependence. When k\u2217\nP + 1 + \u03b3 and i = k + \u03b1, \u03b1 > \u03b3, on average |S i| < P and\nthe size is concentrated around the mean. Using Chebychev\u2019s inequality, one can upper bound by\n1 \u2212 c(\u03b1, P ) the probability P[Si \u2265 P | \u03c3 \u2208 S i] that the algorithm fails because |Si| is too large.\nNote that the bound is independent of \u03c3 and lets us bound the probability ps(\u03c3) that \u03c3 is output:\n\nP \u2212 \u03b3 \u2264 k \u2264 k\u2217\n\nc(\u03b1, P )\n\n2\u2212i\nP \u2212 1\n\n=(cid:18)1 \u2212\n\n2\u03b3\u2212\u03b1\n\nP \u2212 2\u03b3\u2212\u03b1)2(cid:19) 2\u2212i\n\nP \u2212 1 \u2264 ps(\u03c3) \u2264\n\n(1 \u2212 1\n\n2\u2212i\nP \u2212 1\n\n.\n\n(2)\n\n5\n\n\fP + 1 + \u03b3 + \u03b1 and summing the lower bound of ps(\u03c3) over all \u03c3, we obtain the\nFrom i = k + \u03b1 \u2264 k\u2217\ndesired lower bound on the success probability. Note that given \u03c3, \u03c3\u2032, ps(\u03c3) and ps(\u03c3\u2032) are within\ns(\u03c3) (for various \u03c3)\na constant factor c(\u03b1, P ) of each other from (2). Therefore, the probabilities p\u2032\nthat \u03c3 is output conditioned on outputting a sample are also within a constant factor of each other.\ns(x, y) is within a constant\n\ns(\u03c3) = 1, one gets the desired result that p\u2032\n\nFrom the normalizationP\u03c3 p\u2032\nfactor c(\u03b1, P ) of the uniform probability p\u2032\u2032(x, y) = 1/|S|.\n3.4 Main Results: Sampling with Accuracy Guarantees\n\nCombining pieces from the previous three sections, we have the following main result:\nTheorem 1. Let w : {0, 1}n \u2192 R+, \u01eb > 0, b \u2265 1, \u03b4 > 0, and P \u2265 2. Fix \u03b1 \u2208 Z as in Lemma 4,\nr = 2b/(2b\u22121), \u2113 = \u2308logr(2n/\u01eb)\u2309, \u03c1 = r2/(1\u2212\u01eb), bucket B\u2113 as in De\ufb01nition 1, and \u03ba = 1/c(\u03b1, P ).\nP \u22121 , PAWS(w, \u2113, b,\nThenPx\u2208B\u2113 p(x) \u2264 \u01eb and with probability at least (1 \u2212 \u03b4)c(\u03b1, P )2\u2212(\u03b3+\u03b1+1) P\n\u03b4, P , \u03b1) succeeds and outputs a sample \u03c3 from {0, 1}n \\ B\u2113. Upon success, each \u03c3 \u2208 {0, 1}n \\ B\u2113\nis output with probability p\u2032\ns(\u03c3) within a constant factor \u03c1\u03ba of the desired probability p(\u03c3) \u221d w(\u03c3).\nProof. Success probability follows from Lemma 4. For x \u2208 {0, 1}n \\ B\u2113, combining Lemmas 1, 2,\n4 we obtain\n\n1\n\u03c1\u03ba\n\np(x) \u2264\n\n1\n\u03ba\n\np\u2032(x) = Xy:(x,y)\u2208S(w,\u2113,b)\n\n1\n\u03ba\n\np\u2032\ns(x, y) = p\u2032\n\ns(x)\n\np\u2032\u2032(x, y) \u2264 Xy|(x,y)\u2208S(w,\u2113,b)\n\u2264 Xy:(x,y)\u2208S(w,\u2113,b)\n\n\u03bap\u2032\u2032(x, y) = \u03bap\u2032(x) \u2264 \u03c1\u03bap(x)\n\nwhere the \ufb01rst inequality accounts for discretization error from p(x) to p\u2032(x) (Lemma 1), equality\ns is bounded by Lemma 4. The rest\nfollows from Lemma 2, and the sampling error between p\u2032\u2032 and p\u2032\nis proved in Lemmas 1, 2.\n\nRemark 2. By appropriately setting the hyper-parameters b and \u2113 we can make the discretization\nerrors \u03c1 and \u01eb arbitrarily small. Although this does not change the number of required feasibility\nqueries, it can signi\ufb01cantly increase the runtime of combinatorial search because of the increased\nsearch space size |S(w, \u2113, b)|. Practically, one should set these parameters as large as possible, while\nensuring combinatorial searches can be completed within the available time budget. Increasing pa-\nrameter P improves the accuracy as well, but also increases the number of feasibility queries issued,\nwhich is proportional to P (but does not affect the structure of the search space). Similarly, by\nincreasing \u03b1 we can make \u03ba arbitrarily small. However, the probability of success of the algorithm\ndecreases exponentially as \u03b1 is increased. We will demonstrate in the next section that a practi-\ncal tradeoff between computational complexity and accuracy can be achieved for reasonably sized\nproblems of interest.\nCorollary 2. Let w, b, \u01eb, \u2113, \u03b4, P, \u03b1, and B\u2113 be as in Theorem 1, and p\u2032\nof PAWS(w, \u2113, b, \u03b4, P , \u03b1). Let \u03c6 : {0, 1}n \u2192 R and \u03b7\u03c6 = maxx\u2208B\u2113 |\u03c6(x)| \u2264 k\u03c6k\u221e. Then,\n\ns(\u03c3) be the output distribution\n\n1\n\u03c1\u03ba\n\nEp\u2032\n\ns [\u03c6] \u2212 \u01eb\u03b7\u03c6 \u2264 Ep[\u03c6] \u2264 \u03c1\u03baEp\u2032\n\ns [\u03c6] + \u01eb\u03b7\u03c6\n\nwhere Ep\u2032\n\ns [\u03c6] can be approximated with a sample average using samples produced by PAWS.\n\n4 Experiments\n\nWe evaluate PAWS on synthetic Ising Models and on a real-world test case generation problem for\nsoftware veri\ufb01cation. All experiments used Intel Xeon 5670 3GHz machines with 48GB RAM.\n\n4.1\n\nIsing Grids Models\n\nWe \ufb01rst consider the marginal computation task for synthetic grid-structured Ising models with\nrandom interactions (attractive and mixed). Speci\ufb01cally, the corresponding graphical model has n\nbinary variables xi, i = 1,\u00b7\u00b7\u00b7 , n, with single node potentials \u03c8i(xi) = exp(fixi) and pairwise\n\n6\n\n\fi\n\nl\n\ns\na\nn\ng\nr\na\nm\nd\ne\n\n \n\nt\n\na\nm\n\ni\nt\ns\nE\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n \n0\n\n0.1\n\n0.2\n\n0.3\n\n \n\nGibbs\nBelief Propagation\nWISH\nPAWS b=1\nPAWS b=2\n\n \n\nGibbs\nBelief Propagation\nWISH\nPAWS b=1\nPAWS b=2\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\nl\n\ni\n\ns\na\nn\ng\nr\na\nm\nd\ne\na\nm\n\n \n\nt\n\ni\nt\ns\nE\n\n0.4\n\n0.5\n\n0.6\n\nTrue marginals\n\n0.7\n\n0.8\n\n0.9\n\n1\n\n0\n \n0.16\n\n0.165\n\n0.17\n\n0.175\n\nTrue marginals\n\n0.18\n\n0.185\n\n0.19\n\n(a) Mixed (w = 4.0,f = 0.6)\n\n(b) Attractive (w = 3.0,f = 0.45)\n\nFigure 1: Estimated marginals vs. true marginals on 8 \u00d7 8 Ising Grid models. Closeness to the 45\ndegree line indicates accuracy. PAWS is run with b \u2208 {1, 2}, P = 4, \u03b1 = 1, and \u2113 = 25 (mixed\ncase) or \u2113 = 40 (attractive case).\n\ninteractions \u03c8ij(xi, xj) = exp(wijxixj), where fi \u2208R [\u2212f, f ] and wij \u2208R [\u2212w, w] in the mixed\ncase, while wij \u2208R [0, w] in the attractive case.\nOur implementation of PAWS uses the open source solver ToulBar2 [13] to compute M =\nmaxx w(x) and as an oracle to check the existence of at least P different solutions. We aug-\nmented ToulBar2 with the IBM ILOG CPLEX CP Optimizer 12.3 [14] based on techniques bor-\nrowed from [15] to ef\ufb01ciently reason about parity constraints (the hash functions) using Gauss-\nJordan elimination. We run the subroutine COMPUTEK in Algorithm 1 only once at the beginning,\nand then generate all the samples with the same value of i (Line 17). The comparison is with Gibbs\nsampling, Belief Propagation, and the recent WISH algorithm [9]. Ground truth is obtained using\nthe Junction Tree method [16].\n\nIn Figure 1, we show a scatter plot of the estimated vs. true marginal probabilities for two Ising\ngrids with mixed and attractive interactions, respectively, representative of the general behavior in\nthe large-weights regime. Each sampling method is run for 10 minutes. Marginals computed with\nGibbs sampling (run for about 108 iterations) are clearly very inaccurate (far from the 45 degree\nline), an indication that the Markov Chain had not mixed as an effect of the relatively large weights\nthat tend to create barriers between modes which are hard to traverse. In contrast, samples from\nPAWS provide much more accurate marginals, in part because it does not rely on local search and\nhence is not directly affected by the energy landscape (with respect to the Hamming metric). Further,\nwe see that we can improve the accuracy by increasing the hyper-parameter b. These results highlight\nthe practical value of having accuracy guarantees on the quality of the samples after \ufb01nite amounts\nof time vs. MCMC-style guarantees that hold only after a potentially exponential mixing time.\n\nBelief Propagation can be seen from Figure 1 to be quite inaccurate in this large-weights regime. Fi-\nnally, we also compare to the recent WISH algorithm [9] which uses similar hash-based techniques\nto estimate the partition function of graphical models. Since producing samples with the general\nsampling-by-counting reduction [1, 2] or estimating each marginal as the ratio of two partition func-\ntions (with and without a variable clamped) would be too expensive (requiring n + 1 calls to WISH)\nwe heuristically run it once and use the solutions of the optimization instances it solves in the inner\nloop as samples. We see in Figure 1 that while samples produced by WISH can sometimes produce\nfairly accurate marginal estimates, these estimates can also be far from the true value because of an\ninherent bias introduced by the arg max operator.\n\n4.2 Test Case Generation for Software Veri\ufb01cation\n\nHardware and software veri\ufb01cation tools are becoming increasingly important in industrial system\ndesign. For example, IBM estimates $100 million savings over the past 10 years from hardware ver-\ni\ufb01cation tools alone [17]. Given that complete formal veri\ufb01cation is often infeasible, the paradigm\nof choice has become that of randomly generating \u201cinteresting\u201d test cases to stress the code or chip\n\n7\n\n\fInstance\nbench1039\nbench431\nbench115\nbench97\nbench590\nbench105\n\n785\n410\n458\n401\n527\n524\n\nVars Factors Time (s) MSE (\u00d710\u22125)\n330\n173\n189\n170\n244\n243\n\n1710\n34.97\n52.75\n67.03\n593.71\n842.35\n\n5.76\n4.35\n20.74\n45.57\n8.11\n8.56\n\n \n\nTheoretical\nSample Frequency\n\n0.018\n\n0.016\n\n0.014\n\n0.012\n\n0.01\n\n0.008\n\n0.006\n\n0.004\n\n0.002\n\ny\nc\nn\ne\nu\nq\ne\nr\nF\n\n0\n\n \n0\n\n100\n\n200\n\n300\n\n400\n\nSolution\n\n500\n\n600\n\n700\n\n800\n\n(a) Marginals: runtime and mean squared error\n\n(b) True vs. observed sampling frequencies.\n\nFigure 2: Experiments on software veri\ufb01cation benchmark.\n\nwith the hope of uncovering bugs. Typically, a model based on hard constraints is used to specify\nconsistent input/output pairs, or valid program execution traces. In addition, in some systems, do-\nmain knowledge can be speci\ufb01ed by experts in the form of soft constraints, for instance to introduce\na preference for test cases where operands are zero and bugs are more likely [17].\n\nFor our experiments, we focus on software (SW) veri\ufb01cation, using an industrial benchmark [18]\nproduced by Microsoft\u2019s SAGE system [19, 20]. Each instance de\ufb01nes a uniform probability distri-\nbution over certain valid traces of a computer program. We modify this benchmark by introducing\nsoft constraints de\ufb01ning a weighted distribution over valid traces, indicating traces that meet certain\ncriteria should be sampled more often. Speci\ufb01cally, following Naveh et al. [17] we introduce a pref-\nerence towards traces where certain registers are zero. The weight is chosen to be a power of two,\nso that there is no loss of accuracy due to discretization using the previous construction with b = 1.\nThese instances are very dif\ufb01cult for MCMC methods because of the presence of very large regions\nof zero probability that cannot be traversed and thus can break the ergodicity assumption. Indeed we\nobserved that Gibbs sampling often fails to \ufb01nd a non-zero probability state, and when it \ufb01nds one\nit gets stuck there, because there might not be a non-zero probability path from one feasible state\nto another. In contrast, our sampling strategy is not affected and does not require any ergodicity\nassumption. Table 2a summarizes the results obtained using the propositional satis\ufb01ability (SAT)\nsolver CryptoMiniSAT [21] as the feasibility query oracle for PAWS. CryptoMiniSAT has built-in\nsupport for parity constraints Ax = c mod 2. We report the time to collect 1000 samples and\nthe Mean Squared Error (MSE) of the marginals estimated using these samples. We report results\nonly on the subset of instances where we could enumerate all feasible states using the exact model\ncounter Relsat [22] in order to obtain ground truth marginals for MSE computation. We see that\nPAWS scales to fairly large instances with hundreds of variables and gives accurate estimates of\nthe marginals. Figure 2b shows the theoretical vs. observed sampling frequencies (based on 50000\nsamples) for a small instance with 810 feasible states (execution traces), where we see that the output\ndistribution p\u2032\n\ns is indeed very close to the target distribution p.\n\n5 Conclusions\n\nWe introduced a new approach, called PAWS, to the fundamental problem of sampling from a dis-\ncrete probability distribution speci\ufb01ed, up to a normalization constant, by a weight function, e.g., by\na discrete graphical model. While traditional sampling methods are based on the MCMC paradigm\nand hence on some form of local search, PAWS can leverage more advanced combinatorial search\nand optimization tools as a black box. A signi\ufb01cant advantage over MCMC methods is that PAWS\ncomes with a strong accuracy guarantee: whenever combinatorial search succeeds, our analysis\nprovides a certi\ufb01cate that, with high probability, the samples are produced from an approximately\ncorrect distribution. In contrast, accuracy guarantees for MCMC methods hold only in the limit, with\nunknown and potentially exponential mixing times. Further, the hyper-parameters of PAWS can be\ntuned to trade off runtime with accuracy. Our experiments demonstrate that PAWS outperforms\ncompeting sampling methods on challenging domains for MCMC.\n\n8\n\n\fReferences\n[1] N.N. Madras. Lectures on Monte Carlo Methods. American Mathematical Society, 2002.\n\nISBN 0821829785.\n\n[2] M. Jerrum and A. Sinclair. The Markov chain Monte Carlo method: an approach to approxi-\nmate counting and integration. Approximation algorithms for NP-hard problems, pages 482\u2013\n520, 1997.\n\n[3] Mihir Bellare, Oded Goldreich, and Erez Petrank. Uniform generation of NP-witnesses using\n\nan NP-oracle. Information and Computation, 163(2):510\u2013526, 2000.\n\n[4] Stefano Ermon, Carla P. Gomes, and Bart Selman. Uniform solution sampling using a con-\n\nstraint solver as an oracle. In UAI, pages 255\u2013264, 2012.\n\n[5] C.P. Gomes, A. Sabharwal, and B. Selman. Near-uniform sampling of combinatorial spaces\n\nusing XOR constraints. In NIPS-2006, pages 481\u2013488, 2006.\n\n[6] S. Chakraborty, K. Meel, and M. Vardi. A scalable and nearly uniform generator of SAT\n\nwitnesses. In CAV-2013, 2013.\n\n[7] Vibhav Gogate and Pedro Domingos. Approximation by quantization. In UAI, pages 247\u2013255,\n\n2011.\n\n[8] Radford M Neal. Slice sampling. Annals of statistics, pages 705\u2013741, 2003.\n[9] Stefano Ermon, Carla Gomes, Ashish Sabharwal, and Bart Selman. Taming the curse of di-\n\nmensionality: Discrete integration by hashing and optimization. In ICML, 2013.\n\n[10] M.J. Wainwright and M.I. Jordan. Graphical models, exponential families, and variational\n\ninference. Foundations and Trends in Machine Learning, 1(1-2):1\u2013305, 2008.\n\n[11] S. Vadhan. Pseudorandomness. Foundations and Trends in Theoretical Computer Science,\n\n2011.\n\n[12] O. Goldreich. Randomized methods in computation. Lecture Notes, 2011.\n[13] D. Allouche, S. de Givry, and T. Schiex. Toulbar2, an open source exact cost function network\n\nsolver. Technical report, INRIA, 2010.\n\n[14] IBM ILOG. IBM ILOG CPLEX Optimization Studio 12.3, 2011.\n[15] Carla P. Gomes, Willem Jan van Hoeve, Ashish Sabharwal, and Bart Selman. Counting CSP\n\nsolutions using generalized XOR constraints. In AAAI, 2007.\n\n[16] Steffen L Lauritzen and David J Spiegelhalter. Local computations with probabilities on graph-\nical structures and their application to expert systems. Journal of the Royal Statistical Society.\nSeries B (Methodological), pages 157\u2013224, 1988.\n\n[17] Yehuda Naveh, Michal Rimon, Itai Jaeger, Yoav Katz, Michael Vinov, Eitan s Marcu, and Gil\nShurek. Constraint-based random stimuli generation for hardware veri\ufb01cation. AI magazine,\n28(3):13, 2007.\n\n[18] Clark Barrett, Aaron Stump, and Cesare Tinelli. The Satis\ufb01ability Modulo Theories Library\n\n(SMT-LIB). www.SMT-LIB.org, 2010.\n\n[19] Patrice Godefroid, Michael Y Levin, David Molnar, et al. Automated whitebox fuzz testing.\n\nIn NDSS, 2008.\n\n[20] Patrice Godefroid, Michael Y. Levin, and David Molnar. Sage: Whitebox fuzzing for security\n\ntesting. Queue, 10(1):20:20\u201320:27, January 2012. ISSN 1542-7730.\n\n[21] M. Soos, K. Nohl, and C. Castelluccia. Extending SAT solvers to cryptographic problems. In\n\nSAT-2009. Springer, 2009.\n\n[22] Robert J Bayardo and Joseph Daniel Pehoushek. Counting models using connected compo-\n\nnents. In AAAI-2000, pages 157\u2013162, 2000.\n\n9\n\n\f", "award": [], "sourceid": 1038, "authors": [{"given_name": "Stefano", "family_name": "Ermon", "institution": "Cornell University"}, {"given_name": "Carla", "family_name": "Gomes", "institution": "Cornell University"}, {"given_name": "Ashish", "family_name": "Sabharwal", "institution": "IBM Watson Research Center"}, {"given_name": "Bart", "family_name": "Selman", "institution": "Cornell University"}]}