{"title": "Sensor Selection in High-Dimensional Gaussian Trees with Nuisances", "book": "Advances in Neural Information Processing Systems", "page_first": 2211, "page_last": 2219, "abstract": "We consider the sensor selection problem on multivariate Gaussian distributions where only a \\emph{subset} of latent variables is of inferential interest.  For pairs of vertices connected by a unique path in the graph, we show that there exist decompositions of nonlocal mutual information into local information measures that can be computed efficiently from the output of message passing algorithms.  We integrate these decompositions into a computationally efficient greedy selector where the computational expense of quantification can be distributed across nodes in the network.  Experimental results demonstrate the comparative efficiency of our algorithms for sensor selection in high-dimensional distributions. We additionally derive an online-computable performance bound based on augmentations of the relevant latent variable set that, when such a valid augmentation exists, is applicable for \\emph{any} distribution with nuisances.", "full_text": "Sensor Selection in High-Dimensional\n\nGaussian Trees with Nuisances\n\nDaniel Levine\n\nMIT LIDS\n\ndlevine@mit.edu\n\nJonathan P. How\n\nMIT LIDS\n\njhow@mit.edu\n\nAbstract\n\nWe consider the sensor selection problem on multivariate Gaussian distributions\nwhere only a subset of latent variables is of inferential interest. For pairs of ver-\ntices connected by a unique path in the graph, we show that there exist decom-\npositions of nonlocal mutual information into local information measures that can\nbe computed ef\ufb01ciently from the output of message passing algorithms. We inte-\ngrate these decompositions into a computationally ef\ufb01cient greedy selector where\nthe computational expense of quanti\ufb01cation can be distributed across nodes in the\nnetwork. Experimental results demonstrate the comparative ef\ufb01ciency of our al-\ngorithms for sensor selection in high-dimensional distributions. We additionally\nderive an online-computable performance bound based on augmentations of the\nrelevant latent variable set that, when such a valid augmentation exists, is applica-\nble for any distribution with nuisances.\n\n1\n\nIntroduction\n\nThis paper addresses the problem of focused active inference: selecting a subset of observable ran-\ndom variables that is maximally informative with respect to a speci\ufb01ed subset of latent random\nvariables. The subset selection problem is motivated by the desire to reduce the overall cost of\ninference while providing greater inferential accuracy. For example, in the context of sensor net-\nworks, control of the data acquisition process can lead to lower energy expenses in terms of sensing,\ncomputation, and communication [1, 2].\nIn many inferential problems, the objective is to reduce uncertainty in only a subset of the unknown\nquantities, which are related to each other and to observations through a joint probability distribution\nthat includes auxiliary variables called nuisances. On their own, nuisances are not of any extrinsic\nimportance to the uncertainty reduction task and merely serve as intermediaries when describing\nstatistical relationships, as encoded with the joint distribution, between variables. The structure in\nthe joint can be represented parsimoniously with a probabilistic graphical model, often leading to\nef\ufb01cient inference algorithms [3, 4, 5]. However, marginalization of nuisance variables is potentially\nexpensive and can mar the very sparsity of the graphical model that permitted ef\ufb01cient inference.\nTherefore, we seek methods for selecting informative subsets of observations in graphical models\nthat retain nuisance variables.\nTwo primary issues arise from the inclusion of nuisance variables in the problem. Observation\nrandom variables and relevant latent variables may be nonadjacent in the graphical model due to\nthe interposition of nuisances between them, requiring the development of information measures\nthat extend beyond adjacency (alternatively, locality) in the graph. More generally, the absence of\ncertain conditional independencies, particularly between observations conditioned on the relevant\nlatent variable set, means that one cannot directly apply the performance bounds associated with\nsubmodularity [6, 7, 8].\n\n1\n\n\fIn an effort to pave the way for analyzing focused active inference on the class of general distribu-\ntions, this paper speci\ufb01cally examines multivariate Gaussian distributions \u2013 which exhibit a number\nof properties amenable to analysis \u2013 and later specializes to Gaussian trees. This paper presents\na decomposition of pairwise nonlocal mutual information (MI) measures on Gaussian graphs that\npermits ef\ufb01cient information valuation, e.g., to be used in a greedy selection. Both the valuation\nand subsequent selection may be distributed over nodes in the network, which can be of bene\ufb01t for\nhigh-dimensional distributions and/or large-scale distributed sensor networks. It is also shown how\nan augmentation to the relevant set can lead to an online-computable performance bound for general\ndistributions with nuisances.\nThe nonlocal MI decomposition extensively exploits properties of Gaussian distributions, Markov\nrandom \ufb01elds, and Gaussian belief propagation (GaBP), which are reviewed in Section 2. The formal\nproblem statement of focused active inference is stated in Section 3, along with an example that\ncontrasts focused and unfocused selection. Section 4 presents pairwise nonlocal MI decompositions\nfor scalar and vectoral Gaussian Markov random \ufb01elds. Section 5 shows how to integrate pairwise\nnonlocal MI into a distributed greedy selection algorithm for the focused active inference problem;\nthis algorithm is benchmarked in Section 6. A performance bound applicable to any focused selector\nis presented in Section 7.\n\n2 Preliminaries\n\n2.1 Markov Random Fields (MRFs)\nLet G = (V,E) be a Markov random \ufb01eld (MRF) with vertex set V and edge set E. Let u and\nv be vertices of the graph G. A u-v path is a \ufb01nite sequence of adjacent vertices, starting with\nvertex u and terminating at vertex v, that does not repeat any vertex. Let PG(u, v) denote the set\nof all paths between distinct u and v in G. If |PG(u, v)| > 0, then u and v are graph connected. If\n|PG(u, v)| = 1, then there is a unique path between u and v, and denote the sole element of PG(u, v)\nby \u00afPu:v.\nIf |PG(u, v)| = 1 for all u, v \u2208 V, then G is a tree. If |PG(u, v)| \u2264 1 for all u, v \u2208 V, then G is a\nforest, i.e., a disjoint union of trees. A chain is a simple tree with diameter equal to the number of\nnodes. A chain is said to be embedded in graph G if the nodes in the chain comprise a unique path\nin G.\nFor MRFs, the global Markov property relates connectivity in the graph to implied conditional\nindependencies. If D \u2286 V, then GD = (D,ED) is the subgraph induced by D, with ED = E \u2229\n(D \u00d7 D). For disjoint subsets A, B, C \u2282 V, let G\\B be the subgraph induced by V \\ B. The global\nMarkov property holds that xA \u22a5\u22a5 xC | xB iff |PG\\B (i, j)| = 0 for all i \u2208 A and j \u2208 C.\n\n2.2 Gaussian Distributions in Information Form\nConsider a random vector x distributed according to a multivariate Gaussian distribution N (\u00b5, \u039b)\nwith mean \u00b5 and (symmetric, positive de\ufb01nite) covariance \u039b > 0. One could equivalently consider\nthe information form x \u223c N \u22121(h, J) with precision matrix J = \u039b\u22121 > 0 and potential vector\nh = J\u00b5, for which px(x) \u221d exp{\u2212 1\nOne can marginalize out or condition on a subset of random variables by considering a partition of\nx into two subvectors, x1 and x2, such that\n\n(cid:21)(cid:19)\n\n(cid:20)J11 J12\n\nJT\n12 J22\n\n,\n\n.\n\n2 xT Jx + hT x}.\n(cid:19)\n\n(cid:18)(cid:18)h1\n\n\u223c N \u22121\n\n(cid:19)\n\nh2\n\n(cid:18)x1\n\nx2\n\nx =\n\n22 h2 and J(cid:48)\n\nIn the information form, the marginal distribution over x1 is px1(\u00b7) = N \u22121(\u00b7; h(cid:48)\n1), where\n1 = h1 \u2212 J12J\u22121\nh(cid:48)\n12, the latter being the Schur complement of\nJ22. Conditioning on a particular realization x2 of the random subvector x2 induces the conditional\ndistribution px1|x2 (x1|x2) = N \u22121(x1; h(cid:48)\n1|2 = h1 \u2212 J12x2, and J11 is exactly the\nupper-left block submatrix of J. (Note that the conditional precision matrix is independent of the\nvalue of the realized x2.)\n\n1 = J11 \u2212 J12J\u22121\n\n1|2, J11), where h(cid:48)\n\n1, J(cid:48)\n\n22 JT\n\n2\n\n\fIf x \u223c N \u22121(h, J), where h \u2208 Rn and J \u2208 Rn\u00d7n, then the (differential) entropy of x is [9]\n\nH(x) = \u2212 1\n2\n\nlog ((2\u03c0e)n \u00b7 det(J)) .\n\n(1)\n\nLikewise, for nonempty A \u2286 {1, . . . , n}, and (possibly empty) B \u2286 {1, . . . , n}\\ A, let J(cid:48)\nprecision matrix parameterizing pxA|xB . The conditional entropy of xA \u2208 Rd given xB is\n\nA|B be the\n\nH(xA|xB) = \u2212 1\n2\nThe mutual information between xA and xB is\n\nlog((2\u03c0e)d \u00b7 det(J(cid:48)\n\nA|B)).\n\n(cid:32) det(J(cid:48)\n\n(cid:33)\n\n,\n\n(2)\n\n(3)\n\nI(xA; xB) = H(xA) + H(xB) \u2212 H(xA, xB) =\n\n1\n2\n\nlog\n\ndet(J(cid:48)\n\n{A,B})\nA) det(J(cid:48)\nB)\n\nwhich generally requires O(n3) operations to compute via Schur complement.\n\n2.3 Gaussian MRFs (GMRFs)\nIf x \u223c N \u22121(h, J), the conditional independence structure of px(\u00b7) can be represented with a Gaus-\nsian MRF (GMRF) G = (V,E), where E is determined by the sparsity pattern of J and the pairwise\nMarkov property: {i, j} \u2208 E iff Jij (cid:54)= 0.\nIn a scalar GMRF, V indexes scalar components of x.\nIn a vectoral GMRF, V indexes disjoint\nsubvectors of x, each of potentially different dimension. The block submatrix Jii can be thought of\nas specifying the sparsity pattern of the scalar micro-network within the vectoral macro-node i \u2208 V.\n\n2.4 Gaussian Belief Propagation (GaBP)\n\nIf x can be partitioned into n subvectors of dimension at most d, and the resulting graph is tree-\ni, i \u2208 V can be computed by Gaussian belief propa-\nshaped, then all marginal precision matrices J(cid:48)\ngation (GaBP) [10] in O(n \u00b7 d3). For such trees, one can also compute all edge marginal precision\nmatrices J(cid:48)\nIn light of (3), pairwise MI quantities between adjacent nodes i and j may be expressed as\n\n{i,j}, {i, j} \u2208 E, with the same asymptotic complexity of O(n \u00b7 d3).\n\nI(xi; xj) = H(xi) + H(xj) \u2212 H(xi, xj),\nln det(J(cid:48)\n\nln det(J(cid:48)\n\ni) \u2212 1\n2\n\n= \u2212 1\n2\n\nj) +\n\nln det(J(cid:48)\n\n{i,j}),\n\n1\n2\n\n{i, j} \u2208 E,\n\n(4)\n\ni.e., purely in terms of node and edge marginal precision matrices. Thus, GaBP provides a way of\ncomputing all local pairwise MI quantities in O(n \u00b7 d3).\nNote that Gaussian trees comprise an important class of distributions that subsumes Gaussian hidden\nMarkov models (HMMs), and GaBP on trees is a generalization of the Kalman \ufb01ltering/smoothing\nalgorithms that operate on HMMs. Moreover, the graphical inference community appears to best un-\nderstand the convergence of message passing algorithms for continuous distributions on subclasses\nof multivariate Gaussians (e.g., tree-shaped [10], walk-summable [11], and feedback-separable [12]\nmodels, among others).\n\n3 Problem Statement\nLet px(\u00b7) = N \u22121(\u00b7; h, J) be represented by GMRF G = (V,E), and consider a partition of V into\nthe subsets of latent nodes U and observable nodes S, with R \u2286 U denoting the subset of relevant\nlatent variables (i.e., those to be inferred). Given a cost function c : 2S \u2192 R\u22650 over subsets of\nobservations, and a budget \u03b2 \u2208 R\u22650, the focused active inference problem is\n\nmaximizeA\u2286S\ns.t.\n\nI(xR; xA)\nc(A) \u2264 \u03b2.\n\n(5)\n\n3\n\n\fThe focused active inference problem in (5) is distinguished from the unfocused active inference\nproblem\n\nmaximizeA\u2286S\ns.t.\n\nI(xU ; xA)\nc(A) \u2264 \u03b2,\n\n(6)\nwhich considers the entirety of the latent state U \u2287 R to be of interest. Both problems are known to\nbe NP-hard [13, 14].\nBy the chain rule and nonnegativity of MI, I(xU ; xA) = I(xR; xA) + I(xU\\R; xA | xR) \u2265\nI(xR; xA), for any A \u2286 S. Therefore, maximizing unfocused MI does not imply maximizing\nfocused MI. Focused active inference must be posed as a separate problem to avoid the situa-\ntion where the observation selector becomes \ufb01xated on inferring nuisance variables as a result of\nI(xU\\R; xA | xR) being included implicitly in the valuation. In fact, an unfocused selector can\nperform arbitrarily poorly with respect to a focused metric, as the following example illustrates.\n\nExample 1. Consider a scalar GMRF over a four-node chain (Figure 1a), whereby J13 = J14 =\nJ24 = 0 by the pairwise Markov property, with R = {2}, S = {1, 4}, c(A) = |A| (i.e., unit-cost ob-\nservations), and \u03b2 = 1. The optimal unfocused decision rule A\u2217\n(U F ) = argmaxa\u2208{1,4} I(x2, x3; xa)\ncan be shown, by conditional independence and positive de\ufb01niteness of J, to reduce to\n\n|J34|\n\nA\u2217\n(U F )={4}\n\n(cid:82)\n\n(U F )={1}\nA\u2217\n\n|J12|,\n\nindependent of J23, which parameterizes the edge potential between nodes 2 and 3. Conversely, the\noptimal focused decision rule A\u2217\n\n(F ) = argmaxa\u2208{1,4} I(x2; xa) can be shown to be\n\n|J23| \u00b7 1{J 2\n\n34\u2212J 2\n\n34\u2212J 2\n\n12J 2\n\n12\u22650}\n\nA\u2217\n(F )={4}\n\n(cid:82)\n\nA\u2217\n(F )={1}\n\n(1 \u2212 J 2\nJ 2\n34\n\n34)J 2\n12\n\n,\n\n(cid:115)\n\nwhere 1{\u00b7} is the indicator function, which evaluates to 1 when its argument is true and 0 otherwise.\nThe loss associated with optimizing the \u201cwrong\u201d information measure is demonstrated in Figure 1b.\nThe reason for this loss is that as |J23| \u2192 0+, the information that node 3 can convey about node 2\nalso approaches zero, although the unfocused decision rule is oblivious to this fact.\n\nx1\n\nx2\n\nx3\n\nx4\n\n(a) Graphical model.\n\n(b) Policy comparison.\n\nFigure 1: (a) Graphical model for the four-node chain example. (b) Unfocused vs. focused policy\ncomparison. There exists a range of values for |J23| such that the unfocused and focused policies\ncoincide; however, as |J23| \u2192 0+, the unfocused policy approaches complete performance loss with\nrespect to the focused measure.\n\n4\n\n00.10.20.30.40.50.600.10.20.30.40.50.60.70.80.91|J23|I(xR;xA)  [nats]Score vs. |J23| (with J212 = 0.3, J234 = 0.5)  Unfocused PolicyFocused Policy\f1\n\n\u02dcG1\n\n2\n\n. . .\n\nk\n\n\u02dcG2\n\n\u02dcGk\n(a) Unique path with sidegraphs.\n\n(b) Vectoral graph with thin edges.\n\nFigure 2: (a) Example of a nontree graph G with a unique path \u00afP1:k between nodes 1 and k. The\n\u201csidegraph\u201d attached to each node i \u2208 \u00afP1:k is labeled as \u02dcGi. (b) Example of a vectoral graph with\nthin edges, with internal (scalar) structure depicted.\n\n4 Nonlocal MI Decomposition\n\n1\n\nFor GMRFs with n nodes indexing d-dimensional random subvectors, I(xR; xA) can be computed\nexactly in O((nd)3) via Schur complements/inversions on the precision matrix J. However, cer-\ntain graph structures permit the computation via belief propagation of all local pairwise MI terms\nI(xi; xj), for adjacent nodes i, j \u2208 V in O(n \u00b7 d3) \u2013 a substantial savings for large networks. This\nsection describes a transformation of nonlocal MI between uniquely path-connected nodes that per-\nmits a decomposition into the sum of transformed local MI quantities, i.e., those relating adjacent\nnodes in the graph. Furthermore, the local MI terms can be transformed in constant time, yielding\nan O(n \u00b7 d3) for computing any pairwise nonlocal MI quantity coinciding with a unique path.\nDe\ufb01nition 1 (Warped MI). For disjoint subsets A, B, C \u2286 V,\nthe warped mutual\nmation measure W : 2V \u00d7 2V \u00d7 2V \u2192 (\u2212\u221e, 0]\n2 log (1 \u2212 exp{\u22122I(xA; xB|xC)}).\nFor convenience, let W (i; j|C) (cid:44) W ({i};{j}|C) for i, j \u2208 V.\nRemark 2. For i, j \u2208 V indexing scalar nodes, the warped MI of De\ufb01nition 1 reduces to W (i; j) =\nlog |\u03c1ij|, where \u03c1ij \u2208 [\u22121, 1] is the correlation coef\ufb01cient between scalar r.v.s xi and xj. The\nmeasure log |\u03c1ij| has long been known to the graphical model learning community as an \u201cadditive\ntree distance\u201d [15, 16], and our decomposition for vectoral graphs is a novel application for sensor\nselection problems. To the best of the authors\u2019 knowledge, the only other distribution class with\nestablished additive distances are tree-shaped symmetric discrete distributions [16], which require a\nvery limiting parameterization of the potentials functions de\ufb01ned over edges in the factorization of\nthe joint distribution.\nProposition 3 (Scalar Nonlocal MI Decomposition). For any GMRF G = (V,E) where V indexes\nscalar random variables, if |PG(u, v)| = 1 for distinct vertices u, v \u2208 V, then for any C \u2286 V \\\n{u, v}, I(xu; xv|xC) can be decomposed as\n\ninfor-\nis de\ufb01ned such that W (A; B|C) (cid:44)\n\n(cid:88)\n\nW (u; v|C) =\n\nW (i; j|C),\n\n(7)\n\n{i,j}\u2208 \u00afEu:v\n\nwhere \u00afEu:v is the set of edges joining consecutive nodes of \u00afPu:v, the unique path between u and v\nand sole element of PG(u, v).\n(Proofs of this and subsequent propositions can be found in the supplementary material.)\nRemark 4. Proposition 3 requires only that the path between vertices u and v be unique. If G is\na tree, this is obviously satis\ufb01ed. However, the result holds on any graph for which: the subgraph\ninduced by \u00afPu:v is a chain; and every i \u2208 \u00afPu:v separates N (i) \\ \u00afPu:v from \u00afPu:v \\ {i}, where\nN (i) (cid:44) {j : {i, j} \u2208 E} is the neighbor set of i. See Figure 2a for an example of a nontree graph\nwith a unique path.\nDe\ufb01nition 5 (Thin Edges). An edge {i, j} \u2208 E of GMRF G = (V,E; J) is thin if the corresponding\nsubmatrix Jij has exactly one nonzero scalar component. (See Figure 2b.)\nFor vectoral problems, each node may contain a subnetwork of arbitrarily connected scalar random\nvariables (see Figure 2b). Under the assumption of thin edges (De\ufb01nition 5), a unique path between\nnodes u and v must enter interstitial nodes through one scalar r.v. and leave through one scalar\n\n5\n\n\fr.v. Therefore, let \u03b6i(u, v|C) \u2208 (\u2212\u221e, 0] denote the warped MI between the enter and exit r.v.s of\ninterstitial vectoral node i on \u00afPu:v, with conditioning set C \u2286 V \\ {u, v}.1 Note that \u03b6i(u, v|C) can\nbe computed online in O(d3) via local marginalization given J(cid:48)\nProposition 6 (Vectoral Nonlocal MI Decomposition). For any GMRF G = (V,E) where V indexes\nrandom vectors of dimension at most d and the edges in E are thin, if |PG(u, v)| = 1 for distinct\nvertices u, v \u2208 V, then for any C \u2286 V \\ {u, v}, I(xu; xv|xC) can be decomposed as\n\ni|C, which is an output of GaBP.\n\n(cid:88)\n\nW (u; v|C) =\n\nW (i; j|C) +\n\n{i,j}\u2208 \u00afEu:v\n\n5\n\n(Distributed) Focused Greedy Selection\n\n(cid:88)\n\n\u03b6i(u, v|C).\n\ni\u2208 \u00afPu:v\\{u,v}\n\n(8)\n\nThe nonlocal MI decompositions of Section 4 can be used to ef\ufb01ciently solve the focused greedy\nselection problem, which at each iteration, given the subset A \u2282 S of previously selected observable\nrandom variables, is\n\nargmax\n\n{y\u2208S\\A : c(y)\u2264\u03b2\u2212c(A)}\n\nI(xR; xy | xA).\n\nTo proceed, \ufb01rst consider the singleton case R = {r} for r \u2208 U. Running GaBP on the graph\nG conditioned on A and subsequently computing all terms W (i; j|A),\u2200{i, j} \u2208 E incurs a com-\nputational cost of O(n \u00b7 d3). Once GaBP has converged, node r authors an \u201cr-message\u201d with the\nvalue 0. Each neighbor i \u2208 N (r) receives that message with value modi\ufb01ed by W (r; i|A); there\nis no \u03b6 term because there are no interstitial nodes between r and its neighbors. Subsequently,\neach i \u2208 N (r) messages its neighbors j \u2208 N (i) \\ {r}, modifying the value of its r-message by\nW (i; j|A) + \u03b6i(r, j|A), the latter term being computed online in O(d3) from J(cid:48)\ni|A, itself an output\nof GaBP.2 Then j messages N (j) \\ {i}, and so on down to the leaves of the tree. Since there are at\nmost n\u22121 edges in a forest, the total cost of dissemination is still O(n\u00b7d3), after which all nodes y in\nthe same component as r will have received an r-message whose value on arrival is W (r; y|A), from\nwhich I(xr; xy|A) can be computed in constant time. Thus, for |R| = 1, all scores I(xR; xy|xA)\nfor y \u2208 S \\ A can collectively be computed at each iteration of the greedy algorithm in O(n \u00b7 d3).\nNow consider |R| > 1. Let R = (r1, . . . , r|R|) be an ordering of the elements of R, and let Rk\n(cid:80)|R|\nbe the \ufb01rst k elements of R. Then, by the chain rule of mutual information, I(xR; xy | xA) =\nk=1 I(xrk ; xy | xA\u222aRk\u22121), y \u2208 S \\ A, where each term in the sum is a pairwise (potentially\nnonlocal) MI evaluation. The implication is that one can run |R| separate instances of GaBP, each\nusing a different conditioning set A \u222a Rk\u22121, to compute \u201cnode and edge weights\u201d (W and \u03b6 terms)\ncost of a greedy update is then O(cid:0)|R| \u00b7 nd3(cid:1).\nfor the r-message passing scheme outlined above. The chain rule suggests one should then sum the\nunwarped r-scores of these |R| instances to yield the scores I(xR; xy|xA) for y \u2208 S \\ A. The total\n\nOne of the bene\ufb01ts of the focused greedy selection algorithm is its amenability to parallelization.\nAll quantities needed to form the W and \u03b6 terms are derived from GaBP, which is parallelizable and\nguaranteed to converge on trees in at most diam(G) iterations [10]. Parallelization reallocates the\nexpense of quanti\ufb01cation across networked computational resources, often leading to faster solution\ntimes and enabling larger problem instantiations than are otherwise permissible. However, full paral-\nlelization, wherein each node i \u2208 V is viewed as separate computing resource, incurs a multiplicative\noverhead of O(diam(G)) due to each i having to send |N (i)| messages diam(G) times, yielding lo-\ncal communication costs of O(diam(G)|N (i)|\u00b7d3) and overall complexity of O(diam(G)\u00b7|R|\u00b7nd3).\nThis overhead can be alleviated by instead assigning to every computational resource a connected\nsubgraph of G.\n\n1As node i may have additional neighbors that are not on the u-v path, using the notation \u03b6i(u, v|C) is\na convenient way to implicitly specify the enter/exit scalar r.v.s associated with the path. Any unique path\nsubsuming u-v, or any unique path subsumed in u-v for which i is interstitial, will have equivalent \u03b6i terms.\n2If i is in the conditioning set, its outgoing message can be set to be \u2212\u221e, so that the nodes it blocks\nfrom reaching r see an apparent information score of 0. Alternatively, i could simply choose not to transmit\nr-messages to its neighbors.\n\n6\n\n\fIt should also be noted that if the quanti\ufb01cation is instead performed using serial BP \u2013 which can be\nconceptualized as choosing an arbitrary root, collecting messages from the leaves up to the root, and\ndisseminating messages back down again \u2013 a factor of 2 savings can be achieved for R2, . . . , R|R|\nby noting that in moving between instances k and k + 1, only rk is added to the conditioning set.\nTherefore, by reassigning rk as the root for the BP instance associated with rk+1 (i.e., A \u222a Rk as\nthe conditioning set), only the second half of the message passing schedule (disseminating messages\nfrom the root to the leaves) is necessary. We subsequently refer to this trick as \u201ccaching.\u201d\n\n6 Experiments\n\nTo benchmark the runtime performance of the algorithm in Section 5, we implemented its serial\nGaBP variant in Java, with and without the caching trick described above.\nWe compare our algorithm with greedy selectors that use matrix inversion (with cubic complexity)\nto compute nonlocal mutual information measures. Let Sfeas := {y \u2208 S \\ A : c(y) \u2264 \u03b2 \u2212\nc(A)}. At each iteration of the greedy selector, the blocked inversion-based quanti\ufb01er computes \ufb01rst\nR|A\u222ay,\u2200y \u2208\nJ(cid:48)\nR\u222aSfeas|A (entailing a block marginalization of nuisances), from which J(cid:48)\nSfeas, are computed. Then I(xR; xy | xA),\u2200y \u2208 Sfeas, are computed via a variant of (3). The na\u00a8\u0131ve\ninversion-based quanti\ufb01er computes I(xR; xy | xA),\u2200y \u2208 Sfeas, \u201cfrom scratch\u201d by using separate\nSchur complements of J submatrices and not storing intermediate results. The inversion-based\nquanti\ufb01ers were implemented in Java using the Colt sparse matrix libraries [17].\n\nR|A and J(cid:48)\n\nFigure 3: Performance of GaBP-based and inversion-based quanti\ufb01ers used in greedy selectors.\nFor each n, the mean of the runtimes over 20 random scalar problem instances is displayed. Our\nBP-Quant algorithm of Section 5 empirically has approximately linear complexity; caching re-\nduces the mean runtime by a factor of approximately 2.\n\nFigure 3 shows the comparative mean runtime performance of each of the quanti\ufb01ers for scalar\nnetworks of size n, where the mean is taken over the 20 problem instances proposed for each value\nof n. Each problem instance consists of a randomly generated, symmetric, positive-de\ufb01nite, tree-\nshaped precision matrix J, along with a randomly labeled S (such that, arbitrarily, |S| = 0.3|V|)\nand R (such that |R| = 5), as well as randomly selected budget and heterogeneous costs de\ufb01ned\nover S. Note that all selectors return the same greedy selection; we are concerned with how the\ndecompositions proposed in this paper aid in the computational performance. In the \ufb01gure, it is\nclear that the GaBP-based quanti\ufb01cation algorithms of Section 5 vastly outperform both inversion-\nbased methods; for relatively small n, the solution times for the inversion-based methods became\nprohibitively long. Conversely, the behavior of the BP-based quanti\ufb01ers empirically con\ufb01rms the\nasymptotic O(n) complexity of our method for scalar networks.\n\n7\n\n0200400600800100012001400160018002000101102103104105106n (Network Size)Mean Runtime [ms]Greedy Selection Total Runtimes for Quantification Algorithms  BP\u2212Quant\u2212CacheBP\u2212Quant\u2212NoCacheInv\u2212Quant\u2212BlockInv\u2212Quant\u2212Naive\f7 Performance Bounds\nDue to the presence of nuisances in the model, even if the subgraph induced by S is completely\ndisconnected, it is not always the case that the nodes in S are conditionally independent when\nconditioned on only the relevant latent set R. Lack of conditional independence means one cannot\nguarantee submodularity of the information measure, as per [6]. Our approach will be to augment\nR such that submodularity is guaranteed and relate the performance bound to this augmented set.\nLet \u02c6R be any subset such that R \u2282 \u02c6R \u2286 U and such that nodes in S are conditionally independent\nconditioned on \u02c6R. Then, by Corollary 4 of [6], I(x \u02c6R; xA) is submodular and nondecreasing on S.\nAdditionally, for the case of unit-cost observations (i.e., c(A) = |A| for all A \u2286 S), a greedily\nselected subset Ag\n\n\u03b2( \u02c6R) of cardinality \u03b2 satis\ufb01es the performance bound\nI( \u02c6R;Ag\n\nmax\n\n\u03b2( \u02c6R)) \u2265\n\n{A\u2286S:|A|\u2264\u03b2} I( \u02c6R;A)\n{A\u2286S:|A|\u2264\u03b2}[I(R;A) + I( \u02c6R \\ R;A|R)]\n{A\u2286S:|A|\u2264\u03b2} I(R;A),\n\nmax\n\nmax\n\n(9)\n\n(10)\n\n(11)\n\nwhere (9) is due to [6], (10) to the chain rule of MI, and (11) to the nonnegativity of MI. The\nfollowing proposition follows immediately from (11).\nProposition 7. For any set \u02c6R such that R \u2282 \u02c6R \u2286 U and nodes in S are conditionally independent\nconditioned on \u02c6R, provided I( \u02c6R;Ag\n\u03b2( \u02c6R)) > 0, an online-computable performance bound for any\n\u00afA \u2286 S in the original focused problem with relevant set R and unit-cost observations is\n\nI(R; \u00afA) \u2265\n\n1 \u2212 1\ne\n\n{A\u2286S:|A|\u2264\u03b2}I(R;A).\n\nmax\n\n(12)\n\n(cid:19)\n(cid:19)\n(cid:19)\n\n(cid:18)\n(cid:18)\n(cid:18)\n\n1 \u2212 1\ne\n1 \u2212 1\ne\n1 \u2212 1\ne\n\n=\n\n\u2265\n\n(cid:34)\n(cid:124)\n\n(cid:35)(cid:18)\n\nI(R; \u00afA)\n\nI( \u02c6R;Ag\n\n(cid:123)(cid:122)\n\u03b2( \u02c6R))\n(cid:44) \u03b4R( \u00afA, \u02c6R)\n\n(cid:19)\n(cid:125)\n\nProposition 7 can be used at runtime to determine what percentage \u03b4R( \u00afA, \u02c6R) of the optimal ob-\njective is guaranteed, for any focused selector, despite the lack of conditional independence of S\nconditioned on R. In order to compute the bound, a greedy heuristic running on a separate, surro-\ngate problem with \u02c6R as the relevant set is required. Finding an \u02c6R \u2283 R providing the tightest bound\nis an area of future research.\n\n8 Conclusion\n\nIn this paper, we have considered the sensor selection problem on multivariate Gaussian distributions\nthat, in order to preserve a parsimonious representation, contain nuisances. For pairs of nodes con-\nnected in the graph by a unique path, there exist decompositions of nonlocal mutual information into\nlocal MI measures that can be computed ef\ufb01ciently from the output of message passing algorithms.\nFor tree-shaped models, we have presented a greedy selector where the computational expense of\nquanti\ufb01cation can be distributed across nodes in the network. Despite de\ufb01ciency in conditional in-\ndependence of observations, we have derived an online-computable performance bound based on\nan augmentation of the relevant set. Future work will consider extensions of the MI decomposition\nto graphs with nonunique paths and/or non-Gaussian distributions, as well as extend the analysis of\naugmented relevant sets to derive tighter performance bounds.\n\nAcknowledgments\n\nThe authors thank John W. Fisher III, Myung Jin Choi, and Matthew Johnson for helpful discussions\nduring the preparation of this paper. This work was supported by DARPA Mathematics of Sensing,\nExploitation and Execution (MSEE).\n\n8\n\n\fReferences\n[1] C. M. Kreucher, A. O. Hero, and K. D. Kastella. An information-based approach to sensor\nmanagement in large dynamic networks. Proc. IEEE, Special Issue on Modeling, Identi\ufb01cia-\ntion, & Control of Large-Scale Dynamical Systems, 95(5):978\u2013999, May 2007.\n\n[2] H.-L. Choi and J. P. How. Continuous trajectory planning of mobile sensors for informative\n\nforecasting. Automatica, 46(8):1266\u20131275, 2010.\n\n[3] V. Chandrasekaran, N. Srebro, and P. Harsha. Complexity of inference in graphical models. In\n\nProc. Uncertainty in Arti\ufb01cial Intelligence, 2008.\n\n[4] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT\n\nPress, 2009.\n\n[5] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algo-\n\nrithm. IEEE Transactions on Information Theory, 47(2):498\u2013519, Feb 2001.\n\n[6] A. Krause and C. Guestrin. Near-optimal nonmyopic value of information in graphical models.\n\nIn Proc. Uncertainty in Arti\ufb01cial Intelligence (UAI), 2005.\n\n[7] G. Nemhauser, L. Wolsey, and M. Fisher. An analysis of approximations for maximizing\n\nsubmodular set functions. Mathematical Programming, 14:489\u2013498, 1978.\n\n[8] J. L. Williams, J. W. Fisher III, and A. S. Willsky. Performance guarantees for information\nIn M. Meila and X. Shen, editors, Proc. Eleventh Int. Conf. on\n\ntheoretic active inference.\nArti\ufb01cial Intelligence and Statistics, pages 616\u2013623, 2007.\n\n[9] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, 2nd ed. edition, 2006.\n[10] Y. Weiss and W. T. Freeman. Correctness of belief propagation in Gaussian graphical models\n\nof arbitrary topology. Neural Computation, 13(10):2173\u20132200, 2001.\n\n[11] D. M. Malioutov, J. K. Johnson, and A. S. Willsky. Walk-sums and belief propagation in\n\nGaussian graphical models. Journal of Machine Learning Research, 7:2031\u20132064, 2006.\n\n[12] Y. Liu, V. Chandrasekaran, A. Anandkumar, and A. S. Willsky. Feedback message passing for\ninference in gaussian graphical models. IEEE Transactions on Signal Processing, 60(8):4135\u2013\n4150, Aug 2012.\n\n[13] C. Ko, J. Lee, and M. Queyranne. An exact algorithm for maximum entropy sampling. Oper-\n\nations Research, 43:684\u2013691, 1995.\n\n[14] A. Krause and C. Guestrin. Optimal value of information in graphical models. Journal of\n\nArti\ufb01cial Intelligence Research, 35:557\u2013591, 2009.\n\n[15] P. L. Erd\u02ddos, M. A. Steel, L. A. Sz\u00b4ekely, and T. J. Warnow. A few logs suf\ufb01ce to build (almost)\n\nall trees: Part ii. Theoretical Computer Science, 221:77\u2013118, 1999.\n\n[16] M. J. Choi, V. Y. F. Tan, A. Anandkumar, and A. S. Willsky. Learning latent tree graphical\n\nmodels. Journal of Machine Learning Research, 12:1771\u20131812, May 2011.\n\n[17] CERN - European Organization for Nuclear Research. Colt, 1999.\n\n9\n\n\f", "award": [], "sourceid": 1076, "authors": [{"given_name": "Daniel", "family_name": "Levine", "institution": "MIT"}, {"given_name": "Jonathan", "family_name": "How", "institution": "MIT"}]}