{"title": "A* Lasso for Learning a Sparse Bayesian Network Structure for Continuous Variables", "book": "Advances in Neural Information Processing Systems", "page_first": 2418, "page_last": 2426, "abstract": "We address the problem of learning a sparse Bayesian network  structure for continuous variables in a high-dimensional space. The constraint that the estimated Bayesian network structure  must be a directed acyclic graph (DAG) makes the problem challenging because of the huge search space of network structures. Most previous methods were based on a two-stage approach that prunes the search space in the first stage and then searches for a network structure that satisfies the DAG constraint in the second stage. Although this approach is effective in a low-dimensional setting, it is difficult to ensure that the correct network structure is not pruned in the first stage in a high-dimensional setting.  In this paper, we propose a single-stage method, called A* lasso, that recovers the optimal  sparse Bayesian network structure by solving a single optimization problem with A* search algorithm that uses lasso in its scoring system. Our approach substantially improves the computational efficiency of the well-known exact methods  based on dynamic programming. We also present a heuristic scheme that further improves the efficiency of A* lasso without significantly compromising the quality of solutions and   demonstrate this on benchmark Bayesian networks and real data.", "full_text": "A* Lasso for Learning a Sparse Bayesian Network\n\nStructure for Continuous Variables\n\nJing Xiang\n\nMachine Learning Department\nCarnegie Mellon University\n\nPittsburgh, PA 15213\njingx@cs.cmu.edu\n\nSeyoung Kim\n\nLane Center for Computational Biology\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nsssykim@cs.cmu.edu\n\nAbstract\n\nWe address the problem of learning a sparse Bayesian network structure for con-\ntinuous variables in a high-dimensional space. The constraint that the estimated\nBayesian network structure must be a directed acyclic graph (DAG) makes the\nproblem challenging because of the huge search space of network structures. Most\nprevious methods were based on a two-stage approach that prunes the search\nspace in the \ufb01rst stage and then searches for a network structure satisfying the\nDAG constraint in the second stage. Although this approach is effective in a low-\ndimensional setting, it is dif\ufb01cult to ensure that the correct network structure is not\npruned in the \ufb01rst stage in a high-dimensional setting. In this paper, we propose\na single-stage method, called A* lasso, that recovers the optimal sparse Bayesian\nnetwork structure by solving a single optimization problem with A* search algo-\nrithm that uses lasso in its scoring system. Our approach substantially improves\nthe computational ef\ufb01ciency of the well-known exact methods based on dynamic\nprogramming. We also present a heuristic scheme that further improves the ef-\n\ufb01ciency of A* lasso without signi\ufb01cantly compromising the quality of solutions.\nWe demonstrate our approach on data simulated from benchmark Bayesian net-\nworks and real data.\n\n1\n\nIntroduction\n\nBayesian networks have been popular tools for representing the probability distribution over a large\nnumber of variables. However, learning a Bayesian network structure from data has been known\nto be an NP-hard problem [1] because of the constraint that the network structure has to be a di-\nrected acyclic graph (DAG). Many of the exact methods that have been developed for recovering the\noptimal structure are computationally expensive and require exponential computation time [15, 7].\nApproximate methods based on heuristic search are more computationally ef\ufb01cient, but they recover\na suboptimal structure. In this paper, we address the problem of learning a Bayesian network struc-\nture for continuous variables in a high-dimensional space and propose an algorithm that recovers the\nexact solution with less computation time than the previous exact algorithms, and with the \ufb02exibility\nof further reducing computation time without a signi\ufb01cant decrease in accuracy.\nMany of the existing algorithms are based on scoring each candidate graph and \ufb01nding a graph with\nthe best score, where the score decomposes for each variable given its parents in a DAG. Although\nmethods may differ in the scoring method that they use (e.g., MDL [9], BIC [14], and BDe [4]),\nmost of these algorithms, whether exact methods or heuristic search techniques, have a two-stage\nlearning process. In Stage 1, candidate parent sets for each node are identi\ufb01ed while ignoring the\nDAG constraint. Then, Stage 2 employs various algorithms to search for the best-scoring network\nstructure that satis\ufb01es the DAG constraint by limiting the search space to the candidate parent sets\nfrom Stage 1. For Stage 1, methods such as sparse candidate [2], max-min parents children [17], and\n\n1\n\n\ftotal conditioning [11] algorithms have been previously proposed. For Stage 2, exact methods based\non dynamic programming [7, 15] and A* search algorithm [19] as well as inexact methods such as\nheuristic search technique [17] and linear programming formulation [6] have been developed. These\napproaches have been developed primarily for discrete variables, and regardless of whether exact or\ninexact methods are used in Stage 2, Stage 1 involved exponential computation time and space.\nFor continuous variables, L1-regularized Markov blanket (L1MB) [13] was proposed as a two-stage\nmethod that uses lasso to select candidate parents for each variable in Stage 1 and performs heuristic\nsearch for DAG structure and variable ordering in Stage 2. Although a two-stage approach can\nreduce the search space by pruning candidate parent sets in Stage 1, Huang et al. [5] observed that\napplying lasso in Stage 1 as in L1MB is likely to miss the true parents in a high-dimensional setting,\nthereby limiting the quality of the solution in Stage 2. They proposed the sparse Bayesian network\n(SBN) algorithm that formulates the problem of Bayesian network structure learning as a single-\nstage optimization problem and transforms it into a lasso-type optimization to obtain an approximate\nsolution. Then, they applied a heuristic search to re\ufb01ne the solution as a post-processing step.\nIn this paper, we propose a new algorithm, called A* lasso, for learning a sparse Bayesian net-\nwork structure with continuous variables in high-dimensional space. Our method is a single-stage\nalgorithm that \ufb01nds the optimal network structure with a sparse set of parents while ensuring the\nDAG constraint is satis\ufb01ed. We \ufb01rst show that a lasso-based scoring method can be incorporated\nwithin dynamic programming (DP). While previous approaches based on DP required identifying\nthe exponential number of candidate parent sets and their scores for each variable in Stage 1 before\napplying DP in Stage 2 [7, 15], our approach effectively combines the score computation in Stage\n1 within Stage 2 via lasso optimization. Then, we present A* lasso which signi\ufb01cantly prunes the\nsearch space of DP by incorporating the A* search algorithm [12], while guaranteeing the optimality\nof the solution. Since in practice, A* search can still be expensive compared to heuristic methods,\nwe explore heuristic schemes that further limit the search space of A* lasso. We demonstrate in\nour experiments that this heuristic approach can substantially improve the computation time without\nsigni\ufb01cantly compromising the quality of the solution, especially on large Bayesian networks.\n\n2 Background on Bayesian Network Structure Learning\n\nA Bayesian network is a probabilistic graphical model de\ufb01ned over a DAG G with a set of p =\n(cid:81)p\n|V | nodes V = {v1, . . . , vp}, where each node vj is associated with a random variable Xj [8].\nThe probability model associated with G in a Bayesian network factorizes as p(X1, . . . , Xp) =\nj=1 p(Xj|Pa(Xj)), where p(Xj|Pa(Xj)) is the conditional probability distribution for Xj given\nits parents Pa(Xj) with directed edges from each node in Pa(Xj) to Xj in G. We assume continuous\nrandom variables and use a linear regression model for the conditional probability distribution of\neach node Xj = Pa(Xj)(cid:48)\u03b2j + \u0001, where \u03b2j = {\u03b2jk\u2019s for Xk \u2208 Pa(Xj)} is the vector of unknown\nparameters to be estimated from data and \u0001 is the noise distributed as N (0, 1).\nGiven a dataset X = [x1, . . . , xp], where xj is a vector of n observations for random variable Xj,\nour goal is to estimate the graph structure G and the parameters \u03b2j\u2019s jointly. We formulate this\nproblem as that of obtaining a sparse estimate of \u03b2j\u2019s, under the constraint that the overall graph\nstructure G should not contain directed cycles. Then, the nonzero elements of \u03b2j\u2019s indicate the\npresence of edges in G. We obtain an estimate of Bayesian network structure and parameters by\nminimizing the negative log likelihood of data with sparsity enforcing L1 penalty as follows:\n\np(cid:88)\n\nmin\n\n\u03b21,...,\u03b2p\n\np(cid:88)\n\n(cid:107) xj \u2212 x\u2212j\n\n(cid:48)\u03b2j (cid:107)2\n\n2 +\u03bb\n\n(cid:107) \u03b2j (cid:107)1\n\ns.t. G \u2208 DAG,\n\n(1)\n\nj=1\n\nj=1\n\nwhere x\u2212j represents all columns of X excluding xj, assuming all other variables are candidate\nparents of node vj. Given the estimate of \u03b2j\u2019s, the set of parents for node vj can be found as the\nsupport of \u03b2j, S(\u03b2j) = {vi|\u03b2ji (cid:54)= 0}. The \u03bb is the regularization parameter that determines the\namount of sparsity in \u03b2j\u2019s and can be determined by cross-validation. We notice that if the acyclicity\nconstraint is ignored, Equation (1) decomposes into individual lasso estimations for each node:\n\nLassoScore(vj|V \\vj) = min\n\n\u03b2j\n\n(cid:107) xj \u2212 x\u2212j\n\n(cid:48)\u03b2j (cid:107)2\n\n2 +\u03bb (cid:107) \u03b2j (cid:107)1,\n\n2\n\n\fwhere V \\vj represents the set of all nodes in V excluding vj. The above lasso optimization problem\ncan be solved ef\ufb01ciently with the shooting algorithm [3]. However, the main challenge in optimizing\nEquation (1) arises from ensuring that the \u03b2j\u2019s satisfy the DAG constraint.\n\n3 A* Lasso for Bayesian Network Structure Learning\n\n3.1 Dynamic Programming with Lasso\n\nThe problem of learning a Bayesian network structure that satis\ufb01es\nthe constraint of no directed cycles can be cast as that of learning an\noptimal ordering of variables [8]. Once the optimal variable ordering\nis given, the constraint of no directed cycles can be trivially enforced\nby constraining the parents of each variable in the local conditional\nprobability distribution to be a subset of the nodes that precede the\n1 , . . . , \u03c0V|V |] denote an\ngiven node in the ordering. We let \u03a0V = [\u03c0V\nindicates the node v \u2208 V in\nordering of the nodes in V , where \u03c0V\nj\nthe jth position of the ordering, and \u03a0V\u227avj denote the set of nodes in\nV that precede node vj in ordering \u03a0V .\nFigure 1: Search space of\nAlgorithms based on DP have been developed to learn the optimal\nvariable ordering for three\nvariables V = {v1, v2, v3}.\nvariable ordering for Bayesian networks [16]. These approaches are\nbased on the observation that the score of the optimal ordering of the\nfull set of nodes V can be decomposed into (a) the optimal score for the \ufb01rst node in the ordering,\ngiven a choice of the \ufb01rst node and (b) the score of the optimal ordering of the nodes excluding the\n\ufb01rst node. The optimal variable ordering can be constructed by recursively applying this decompo-\nsition to select the \ufb01rst node in the ordering and to \ufb01nd the optimal ordering of the set of remaining\nnodes U \u2282 V . This recursion is given as follows, with an initial call of the recursion with U = V :\n(2)\n\nOptScore(U ) = min\nvj\u2208U\n\u03c0U\n1 = argmin\n\nvj\u2208U\n\nOptScore(U\\vj) + BestScore(vj|V \\U )\nOptScore(U\\vj) + BestScore(vj|V \\U ),\n\n(3)\n\nwhere BestScore(vj|V \\U ) is the optimal score of vj under the optimal choice of parents from V \\U.\nIn order to obtain BestScore(vj|V \\U ) in Equations (2) and (3), for the case of discrete variables,\nmany previous approaches enumerated all possible subsets of V as candidate sets of parents for node\nvj to precompute BestScore(vj|V \\U ) in Stage 1 before applying DP in Stage 2 [7, 15]. While this\napproach may perform well in a low-dimensional setting, in a high-dimensional setting, a two-stage\nmethod is likely to miss the true parent sets in Stage 1, which in turn affects the performance of Stage\n2 [5]. In this paper, we consider the high-dimensional setting and present a single-stage method that\napplies lasso to obtain BestScore(vj|V \\U ) within DP as follows:\n\nBestScore(vj|V \\U ) = LassoScore(vj|V \\U )\n\n=\n\nmin\n\n\u03b2j ,S(\u03b2j )\u2286V \\U\n\n(cid:107) xj \u2212 x\u2212j\n\n(cid:48)\u03b2j (cid:107)2\n\n2 +\u03bb (cid:107) \u03b2j (cid:107)1 .\n\nThe constraint S(\u03b2j) \u2286 V \\U in the above lasso optimization can be trivially maintained by setting\nthe \u03b2jk for vk \u2208 U to 0 and optimizing only for the other \u03b2jk\u2019s. When applying the recursion in\nEquations (2) and (3), DP takes advantage of the overlapping subproblems to prune the search space\nof orderings, since the problem of computing OptScore(U) for U \u2286 V can appear as a subproblem\nof scoring orderings of any larger subsets of V that contain U.\nThe problem of \ufb01nding the optimal variable ordering can be viewed as that of \ufb01nding the shortest\npath from the start state to the goal state in a search space given as a subset lattice. The search\nspace consists of a set of states, each of which is associated with one of the 2|V | possible subsets\nof nodes in V . The start state is the empty set {} and the goal state is the set of all variables V . A\nvalid move in this search space is de\ufb01ned from a state for subset Qs to another state for subset Qs(cid:48),\nonly if Qs(cid:48) contains one additional node to Qs. Each move to the next state corresponds to adding a\nnode at the end of the ordering of the nodes in the previous state. The cost of such a move is given\nby BestScore(v|Qs), where v = Qs(cid:48)\\Qs. Each path from the start state to the goal state gives one\n\n3\n\n{\u03c51,\u03c52} {\u03c53} {\u03c52} {} {\u03c51} {\u03c51,\u03c53} {\u03c52,\u03c53} {\u03c51,\u03c52,\u03c53} \fpossible ordering of nodes. Figure 1 illustrates the search space, where each state is associated with\na Qs. DP \ufb01nds the shortest path from the start state to the goal state that corresponds to the optimal\nvariable ordering by considering all possible paths in this search space and visiting all 2|V | states.\n\n3.2 A* Lasso for Pruning Search Space\nAs discussed in the previous section, DP considers all 2|V | states in the subset lattice to \ufb01nd the\noptimal variable ordering. Thus, it is not suf\ufb01ciently ef\ufb01cient to be practical for problems with\nmore than 20 nodes. On the other hand, a greedy algorithm is computationally ef\ufb01cient because\nit explores a single variable ordering by greedily selecting the most promising next state based on\nBestScore(v|Qs), but it returns a suboptimal solution.\nIn this paper, we propose A* lasso that\nincorporates the A* search algorithm [12] to construct the optimal variable ordering in the search\nspace of the subset lattice. We show that this strategy can signi\ufb01cantly prune the search space\ncompared to DP, while maintaining the optimality of the solution.\nWhen selecting the next move in the process of constructing a path in the search space, instead of\ngreedily selecting the move, A* search also accounts for the estimate of the future cost given by a\nheuristic function h(Qs) that will be incurred to reach the goal state from the candidate next state.\nAlthough the exact future cost is not known until A* search constructs the full path by reaching\nthe goal state, a reasonable estimate of the future cost can be obtained by ignoring the directed\nacyclicity constraint. It is well-known that A* search is guaranteed to \ufb01nd the shortest path if the\nheuristic function h(Qs) is admissible [12], meaning that h(Qs) is always an underestimate of the\ntrue cost of reaching the goal state. Below, we describe an admissible heuristic for A* lasso.\nWhile exploring the search space, A* search algorithm assigns a score f (Qs) to each state and\nits corresponding subset Qs of variables for which the ordering has been determined. A* search\nalgorithm computes this score f (Qs) as the sum of the cost g(Qs) that has been incurred so far to\nreach the current state from the start state and an estimate of the cost h(Qs) that will be incurred to\nreach the goal state from the current state:\n\n(4)\nMore speci\ufb01cally, given the ordering \u03a0Qs of variables in Qs that has been constructed along the\npath from the start state to the state for Qs, the cost that has been incurred so far is de\ufb01ned as\n\nf (Qs) = g(Qs) + h(Qs).\n\ng(Qs) =\n\nLassoScore(vj|\u03a0Qs\u227avj )\n\nand the heuristic function for the estimate of the future cost to reach the goal state is de\ufb01ned as:\n\n(cid:88)\n(cid:88)\n\nvj\u2208Qs\n\n(5)\n\n(6)\n\nh(Qs) =\n\nvj\u2208V \\Qs\n\nLassoScore(vj|V \\vj)\n\nNote that the heuristic function is admissible, or an underestimate of the true cost, since the con-\nstraint of no directed cycles is ignored and each variable in V \\Qs is free to choose any variables in\nV as its parents, which lowers the lasso objective value.\nWhen the search space is a graph where multiple paths can reach the same state, we can further\nimprove ef\ufb01ciency if the heuristic function has the property of consistency in addition to admis-\nsibility. A consistent heuristic always satis\ufb01es h(Qs) \u2264 h(Qs(cid:48)) + LassoScore(vk|Qs), where\nLassoScore(vk|Qs) is the cost of moving from state Qs to state Qs(cid:48) with {vk} = Qs(cid:48)\\Qs. Consis-\ntency ensures that the \ufb01rst path found by A* search to reach the given state is always the shortest\npath to that state [12]. This allows us to prune the search when we reach the same state via a different\npath later in the search. The following proposition states that our heuristic function is consistent.\nProposition 1 The heuristic in Equation (6) is consistent.\nProof For any successor state Qs(cid:48) of Qs, let vk = Qs(cid:48)\\Qs.\n\n(cid:88)\n(cid:88)\n\nvj\u2208V \\Qs\n\nh(Qs) =\n\n=\n\nLassoScore(vj|V \\vj)\n\nvj\u2208V \\Qs,vj(cid:54)=vk\n\n\u2264 h(Qs(cid:48)) + LassoScore(vk|Qs),\n\n4\n\nLassoScore(vj|V \\vj) + LassoScore(vk|V \\vk)\n\n\fCompute LassoScore(v|Qs) with lasso shooting algorithm;\ng(Qs(cid:48) ) \u2190 g(Qs) + LassoScore(v|Qs);\nh(Qs(cid:48) ) \u2190 h(Qs) \u2212 LassoScore(v|V \\v);\nf (Qs(cid:48) ) \u2190 g(Qs(cid:48) ) + h(Qs(cid:48) );\n\u03a0Qs(cid:48) \u2190 [\u03a0Qs , v];\nOPEN.insert(L = (Qs(cid:48) , f (Qs(cid:48) ), g(Qs(cid:48) ), \u03a0Qs(cid:48) ));\nCLOSED \u2190 CLOSED \u222a{Qs(cid:48)};\n\nend\n\nend\n\nend\n\n: X, V , \u03bb\n\nInput\nOutput: Optimal variable ordering \u03a0V\nInitialize OPEN to an empty queue;\nInitialize CLOSED to an empty set;\nCompute LassoScore(vj|V \\vj) for all vj \u2208 V ;\nOPEN.insert((Qs = {}, f (Qs) = h({}), g(Qs) = 0, \u03a0Qs = [ ]));\nwhile true do\n\n(Qs, f (Qs), g(Qs), \u03a0Qs ) \u2190 OPEN.pop();\nif h(Qs) = 0 then\n\nReturn \u03a0V \u2190 \u03a0Qs;\nend\nforeach v \u2208 V \\Qs do\nQs(cid:48) \u2190 Qs \u222a {v};\nif Qs(cid:48) /\u2208 CLOSED then\n\nAlgorithm 1: A* lasso for learning Bayesian network structure\n\nwhere LassoScore(vk|Qs) is the true cost of moving from state Qs to Qs(cid:48). The inequal-\nity above holds because vk has fewer parents to choose from in LassoScore(vk|Qs) than in\nLassoScore(vk|V \\vk). Thus, our heuristic in Equation (6) is consistent.\nGiven a consistent heuristic, many paths that go through the same state can be pruned by maintaining\nan OPEN list and a CLOSED list during A* search. In practice, the OPEN list can be implemented\nwith a priority queue and the CLOSED list can be implemented with a hash table. The OPEN list is\na priority queue that maintains all the intermediate results (Qs, f (Qs), g(Qs), \u03a0Qs )\u2019s for a partial\nconstruction of the variable ordering up to Qs at the frontier of the search, sorted according to the\nscore f (Qs). During search, A* lasso pops from the OPEN list the partial construction of ordering\nwith the lowest score f (Qs), visits the successor states by adding another node to the ordering \u03a0Qs,\nand queues the results onto the OPEN list. Any state that has been popped by A* lasso is placed\nin the CLOSED list. The states that have been placed in the CLOSED list are not considered again,\neven if A* search reaches these states through different paths later in the search.\nThe full algorithm for A* lasso is given in Algorithm 1. As in DP with lasso, A* lasso is a single-\nstage algorithm that solves lasso within A* search. Every time A* lasso moves from state Qs to\nthe next state Qs(cid:48) in the search space, LassoScore(vj|\u03a0Qs\u227avj ) for {vj} = Qs(cid:48)\\Qs is computed with\nthe shooting algorithm and added to g(Qs) to obtain g(Qs(cid:48)). The heuristic score h(Qs(cid:48)) can be\nprecomputed as LassoScore(vj|V \\vj) for all vj \u2208 V for a simple look-up during A* search.\n3.3 Heuristic Schemes for A* Lasso to Improve Scalability\nAlthough A* lasso substantially prunes the search space compared to DP, it is not suf\ufb01ciently ef\ufb01-\ncient for large graphs, because it still considers a large number of states in the exponentially large\nsearch space. One simple strategy for further pruning the search space would be to limit the size of\nthe priority queue in the OPEN list, forcing A* lasso to discard less promising intermediate results\n\ufb01rst. In this case, limiting the queue size to one is equivalent to a greedy algorithm with a scoring\nfunction in Equation (4). In our experiments, we found that such a naive strategy substantially re-\nduced the quality of solutions because the best-scoring intermediate results tend to be the results at\nthe early stage of the exploration. They are at the shallow part of the search space near the start state\nbecause the admissible heuristic underestimates the true cost.\nInstead, given a limited queue size, we propose to distribute the intermediate results to be discarded\nacross different depths/layers of the search space. For example, given the depth of the search space\n\n5\n\n\fTable 1: Comparison of computation time of different methods\n\nDataset (Nodes) DP\n0.20 (64)\nDsep (6)\n1.07 (256)\nAsia (8)\nBowling (9)\n2.42 (512)\nInversetree (11) 8.44 (2048)\nRain (14)\nCloud (16)\nFunnel (18)\nGalaxy (20)\nFactor (27)\nInsurance (27)\nWater (32)\nMildew (35)\nAlarm (37)\nBarley (48)\nHail\ufb01nder (56)\n\nA* lasso\n0.14 (15)\n0.26 (34)\n0.48 (94)\n1.68 (410)\n\n\u2013 (\u2013)\n\u2013 (\u2013)\n\u2013 (\u2013)\n\u2013 (\u2013)\n\u2013 (\u2013)\n\u2013 (\u2013)\n\u2013 (\u2013)\n\n\u2013 (\u2013)\n\u2013 (\u2013)\n\u2013 (\u2013)\n1.8 (423)\n13.97 (461)\n1216 (1.60e4) 76.64 (2938) 64.38 (1811)\n26.16 (526)\n1.6e4 (6.6e4) 137.36 (2660) 108.39 (1945)\n25.19 (513)\n4.2e4 (2.6e5) 1527.0 (2.3e4) 88.87 (2310)\n27.59 (642)\n1.3e5 (1.0e6) 2.40e4 (8.2e4) 110.05 (3093)\n125.91 (801)\n1389.7 (3912)\n\u2013 (\u2013)\n442.65 (720)\n2874.2 (3448)\n\u2013 (\u2013)\n301.67 (687)\n2397.0 (3442)\n\u2013 (\u2013)\n802.76 (715)\n\u2013 (\u2013)\n3928.8 (3737)\n2732.3 (3426)\n\u2013 (\u2013)\n384.87 (738)\n10766.0 (4072) 1869.4 (807)\n\u2013 (\u2013)\n\u2013 (\u2013)\n9752.0 (3939)\n2580.5 (816)\nTable 2: A* lasso computation time under different edge strengths \u03b2j\u2019s\n\nA* Qlimit 1000 A* Qlimit 200 A* Qlimit 100 A* Qlimit 5 L1MB SBN\n8.76\n\u2013 (\u2013)\n2.65\n0.17 (11)\n\u2013 (\u2013)\n8.9\n2.79\n0.22 (12)\n8.75\n\u2013 (\u2013)\n2.85\n0.23 (13)\n3.03\n\u2013 (\u2013)\n8.56\n0.2 (16)\n12.26 10.19\n1.67 (17)\n14.56\n4.72\n2.14 (19)\n4.76\n10.08\n2.73 (21)\n11.0\n6.59\n3.03 (23)\n9.04\n13.91\n3.96 (30)\n10.96 29.45\n16.31 (33)\n32.73 14.96\n12.14 (38)\n15.25 116.33\n29.3 (36)\n12.42 (42)\n39.78\n7.91\n109.14 (52) 23.25 483.33\n112.61 (57) 44.36 826.41\n\n\u2013 (\u2013)\n\u2013 (\u2013)\n\u2013 (\u2013)\n1.16 (248)\n7.88 (270)\n9.92 (244)\n11.53 (248)\n12.02 (323)\n59.92 (397)\n202.9 (395)\n130.71 (343)\n339.04 (368)\n158.0 (378)\n913.46 (430)\n1058.3 (390)\n\nDataset (Nodes)\nDsep (6)\nAsia (8)\nBowling (9)\nInversetree (11)\nRain (14)\nCloud (16 )\nFunnel (18)\nGalaxy (20)\n\n(1.2,1.5)\n0.14 (15)\n0.26 (34)\n0.48 (94)\n1.68 (410)\n76.64 (2938)\n137.36 (2660)\n1526.7 (22930)\n24040 (82132)\n\n(1,1.2)\n0.14 (16)\n0.23 (37)\n0.49 (103)\n2.09 (561)\n66.93 (2959)\n229.12 (7805)\n2060.2 (33271)\n66710 (168492)\n\n(0.8,1)\n0.17 (30)\n0.29 (59)\n0.54 (128)\n2.25 (620)\n97.26 (4069)\n227.43 (8858)\n3744.4 (40644)\n256490 (220821)\n\n|V |, if we need to discard k intermediate results, we discard k/|V | intermediate results at each depth.\nIn our experiments, we found that this heuristic scheme substantially improves the computation time\nof A* lasso with a small reduction in the quality of the solution. We also considered other strategies\nsuch as in\ufb02ating heuristics [10] and pruning edges in preprocessing with lasso, but such strategies\nsubstantially reduced the quality of solutions.\n\n4 Experiments\n4.1 Simulation Study\n\nWe perform simulation studies in order to evaluate the accuracy of the estimated structures and\nmeasure the computation time of our method. We created several small networks under 20 nodes and\nobtained the structure of several benchmark networks between 27 and 56 nodes from the Bayesian\nNetwork Repository (the left-most column in Table 1). In addition, we used the tiling technique [18]\nto generate two networks of approximately 300 nodes so that we could evaluate our method on\nlarger graphs. Given the Bayesian network structures, we set the parameters \u03b2j for each conditional\nprobability distribution of node vj such that \u03b2jk \u223c \u00b1U nif orm[l, u] for predetermined values for u\nand l if node vk is a parent of node vj and \u03b2jk = 0 otherwise. We then generated data from each\nBayesian network by forward sampling with noise \u0001 \u223c N (0, 1) in the regression model, given the\ntrue variable ordering. All data were mean-centered.\nWe compare our method to several other methods including DP with lasso for an exact method,\nL1MB for heuristic search, and SBN for an optimization-based approximate method. We down-\nloaded the software implementations of L1MB and SBN from the authors\u2019 website. For L1MB,\nwe increased the authors\u2019 recommended number of evaluations 2500 to 10 000 in Stage 2 heuristic\nsearch for all networks except the two larger networks of around 300 nodes (Alarm 2 and Hail\ufb01nder\n2), where we used two different settings of 50 000 and 100 000 evaluations. We also evaluated A*\nlasso with the heuristic scheme with the queue sizes of 5, 100, 200, and 1000.\nDP, A* lasso, and A* lasso with a limited queue size require a selection of the regularization pa-\nrameter \u03bb with cross-validation. In order to determine the optimal value for \u03bb, for different values\nof \u03bb, we trained a model on a training set, performed an ordinary least squares re-estimation of the\nnon-zero elements of \u03b2j to remove the bias introduced by the L1 penalty, and computed prediction\nerrors on the validation set. Then, we selected the value of \u03bb that gives the smallest prediction error\nas the optimal \u03bb. We used a training set of 200 samples for relatively small networks with under\n\n6\n\n\fFigure 2: Precision/recall curves for the recovery of skeletons of benchmark Bayesian networks.\n\nFigure 3: Precision/recall curves for the recovery of v-structures of benchmark Bayesian networks.\n60 nodes and a training set of 500 samples for the two large networks with around 300 nodes. We\nused a validation set of 500 samples. For L1MB and SBN, we used a similar strategy to select the\nregularization parameters, while mainly following the strategy suggested by the authors and in their\nsoftware implementation.\nWe present the computation time for the different methods in Table 1. For DP, A* lasso, and A* lasso\nwith limited queue sizes, we also record the number of states visited in the search space in paren-\ntheses in Table 1. All methods were implemented in Matlab and were run on computers with 2.4\nGHz processors. We used a dataset generated from a true model with \u03b2jk \u223c \u00b1U nif orm[1.2, 1.5].\nIt can be seen from Table 1 that DP considers all possible states 2|V | in the search space that grows\nexponentially with the number of nodes. It is clear that A* lasso visits signi\ufb01cantly fewer states\nthan DP, visiting about 10% of the number of states in DP for the funnel and galaxy networks. We\nwere unable to obtain the computation time for A* lasso and DP for some of the larger graphs in\nTable 1 as they required signi\ufb01cantly more time. Limiting the size of the queue in A* lasso reduces\nboth the computation time and the number of states visited. For smaller graphs, we do not report the\ncomputation time for A* lasso with limited queue size, since it is identical to the full A* lasso. We\nnotice that the computation time for A* lasso with a small queue of 5 or 100 is comparable to that\nof L1MB and SBN.\nIn general, we found that the extent of pruning of the search space by A* lasso compared to DP\ndepends on the strengths of edges (\u03b2j values) in the true model. We applied DP and A* lasso to\ndatasets of 200 samples generated from each of the networks under each of the three settings for the\ntrue edge strengths, \u00b1U nif orm[1.2, 1.5], \u00b1U nif orm[1, 1.2], and \u00b1U nif orm[0.8, 1]. As can be\nseen from the computation time and the number of states visited by DP and A* lasso in Table 2, as\nthe strengths of edges increase, the number of states visited by A* lasso and the computation time\ntend to decrease. The results in Table 2 indicate that the ef\ufb01ciency of A* lasso is affected by the\nsignal-to-noise ratio.\n\n7\n\n00.5100.51RecallPrecisionFactors  L1MBSBNA*\u2212Qlim=100A*\u2212Qlim=200A*\u2212Qlim=100000.5100.51RecallPrecisionAlarm  L1MBSBNA*\u2212Qlim=100A*\u2212Qlim=200A*\u2212Qlim=100000.5100.51RecallPrecisionBarley  L1MBSBNA*\u2212Qlim=100A*\u2212Qlim=200A*\u2212Qlim=100000.5100.51RecallPrecisionHailfinder  L1MBSBNA*\u2212Qlim=100A*\u2212Qlim=200A*\u2212Qlim=100000.5100.51RecallPrecisionInsurance  L1MBSBNA*\u2212Qlim=100A*\u2212Qlim=200A*\u2212Qlim=100000.5100.51RecallPrecisionMildew  L1MBSBNA*\u2212Qlim=100A*\u2212Qlim=200A*\u2212Qlim=100000.5100.51RecallPrecisionWater  L1MBSBNA*\u2212Qlim=100A*\u2212Qlim=200A*\u2212Qlim=100000.5100.51RecallPrecisionAlarm 2  L1MB\u22125e4L1MB\u22121e5SBNA*\u2212Qlim=5A*\u2212Qlim=10000.5100.51RecallPrecisionHailfinder 2  L1MB\u22125e4L1MB\u22121e5SBNA*\u2212Qlim=5A*\u2212Qlim=10000.5100.51RecallPrecisionFactors  L1MBSBNA*\u2212Qlim=100A*\u2212Qlim=200A*\u2212Qlim=100000.5100.51RecallPrecisionAlarm  L1MBSBNA*\u2212Qlim=100A*\u2212Qlim=200A*\u2212Qlim=100000.5100.51RecallPrecisionBarley  L1MBSBNA*\u2212Qlim=100A*\u2212Qlim=200A*\u2212Qlim=100000.5100.51RecallPrecisionHailfinder  L1MBSBNA*\u2212Qlim=100A*\u2212Qlim=200A*\u2212Qlim=100000.5100.51RecallPrecisionInsurance  L1MBSBNA*\u2212Qlim=100A*\u2212Qlim=200A*\u2212Qlim=100000.5100.51RecallPrecisionMildew  L1MBSBNA*\u2212Qlim=100A*\u2212Qlim=200A*\u2212Qlim=100000.5100.51RecallPrecisionWater  L1MBSBNA*\u2212Qlim=100A*\u2212Qlim=200A*\u2212Qlim=100000.5100.51RecallPrecisionAlarm 2  L1MB\u22125e4L1MB\u22121e5SBNA*\u2212Qlim=5A*\u2212Qlim=10000.5100.51RecallPrecisionHailfinder 2  L1MB\u22125e4L1MB\u22121e5SBNA*\u2212Qlim=5A*\u2212Qlim=100\fFigure 4:\nPrediction errors\nfor benchmark Bayesian net-\nworks.\nThe x-axis labels\nindicate different benchmark\nBayesian networks for 1: Fac-\ntors, 2: Alarm, 3: Barley, 4:\nHail\ufb01nder, 5:\nInsurance, 6:\nMildew, 7: Water, 8: Alarm 2,\nand 9: Hail\ufb01nder 2.\n\nIn order to evaluate the accuracy of the Bayesian network struc-\ntures recovered by each method, we make use of the fact that two\nBayesian network structures are indistinguishable if they belong to\nthe same equivalence class, where an equivalence class is de\ufb01ned\nas the set of networks with the same skeleton and v-structures. The\nskeleton of a Bayesian network is de\ufb01ned as the edge connectiv-\nities ignoring edge directions and a v-structure is de\ufb01ned as the\nlocal graph structure over three variables, with two variables point-\ning to the other variables (i.e., A \u2192 B \u2190 C). We evaluated the\nperformance of the different methods by comparing the estimated\nnetwork structure with the true network structure in terms of skele-\nton and v-structures and computing the precision and recall.\nThe precision/recall curves for the skeleton and v-structures of\nthe models estimated by the different methods are shown in Fig-\nures 2 and 3, respectively. Each curve was obtained as an average\nover the results from 30 different datasets for the two large graphs\n(Alarm 2 and Hail\ufb01nder 2) and from 50 different datasets for all\nthe other Bayesian networks. All data were simulated under the\nsetting \u03b2jk \u223c \u00b1U nif orm[0.4, 0.7]. For the benchmark Bayesian\nnetworks, we used A* lasso with different queue sizes, including 100, 200, and 1000, whereas for\nthe two large networks (Alarm 2 and Hail\ufb01nder 2) that require more computation time, we used A*\nlasso with queue size of 5 and 100. As can be seen in Figures 2 and 3, all methods perform relatively\nwell on identifying the true skeletons, but \ufb01nd it signi\ufb01cantly more challenging to recover the true\nv-structures. We \ufb01nd that although increasing the size of queues in A* lasso generally improves the\nperformance, even with smaller queue sizes, A* lasso outperforms L1MB and SBN in most of the\nnetworks. While A* lasso with a limited queue size preforms consistently well on smaller networks,\nit signi\ufb01cantly outperforms the other methods on the larger graphs such as Alarm 2 and Hail\ufb01nder 2,\neven with a queue size of 5 and even when the number of evaluations for L1MB has been increased\nto 50 000 and 100 000. This demonstrates that while limiting the queue size in A* lasso will not\nguarantee the optimality of the solution, it still reduces the computation time of A* lasso dramati-\ncally without substantially compromising the quality of the solution. In addition, we compare the\nperformance of the different methods in terms of prediction errors on independent test datasets in\nFigure 4. We \ufb01nd that the prediction errors of A* lasso are consistently lower even with a limited\nqueue size.\n4.2 Analysis of S&P Stock Data\nWe applied the methods on the daily stock price data of the S&P 500\ncompanies to learn a Bayesian network that models the dependencies\nin prices among different stocks. We obtained the stock prices of 125\ncompanies over 1500 time points between Jan 3, 2007 and Dec 17, 2012.\nWe estimated a Bayesian network using the \ufb01rst 1000 time points with\nthe different methods, and then computed prediction errors on the last\n500 time points. For L1MB, we used two settings for the number of\nevaluations, 50 000 and 100 000. We applied A* lasso with different\nqueue limits of 5, 100, and 200. The prediction accuracies for the various\nmethods are shown in Figure 5. Our method obtains lower prediction\nerrors than the other methods, even with the smaller queue sizes.\n\nFigure 5: Prediction er-\nrors for S&P stock price\ndata.\n\n5 Conclusions\nIn this paper, we considered the problem of learning a Bayesian network structure and proposed\nA* lasso that guarantees the optimality of the solution while reducing the computational time of\nthe well-known exact methods based on DP. We proposed a simple heuristic scheme that further\nimproves the computation time but does not signi\ufb01cantly reduce the quality of the solution.\n\nAcknowledgments\n\nThis material is based upon work supported by an NSF CAREER Award No. MCB-1149885, Sloan\nResearch Fellowship, and Okawa Foundation Research Grant to SK and by a NSERC PGS-D to JX.\n\n8\n\n12345678951015202530NetworkPrediction Error  L1MB\u22125e4L1MB\u22121e5L1MBSBNA*\u2212Qlim=5A*\u2212Qlim=100A*\u2212Qlim=200A*\u2212Qlim=1000Prediction Error5.05.25.45.65.86.0L1MB\u22125e4L1MB\u22121e5SBNA*\u2212Q5A*\u2212Q100A*\u2212Q200\fReferences\n[1] David Maxwell Chickering. Learning Bayesian networks is NP-complete. In Learning from\n\ndata, pages 121\u2013130. Springer, 1996.\n\n[2] Nir Friedman, Iftach Nachman, and Dana Pe\u00b4er. Learning Bayesian network structure from\nmassive datasets: the \u201cSparse Candidate\u201d algorithm. In Proceedings of the Fifteenth conference\non Uncertainty in Arti\ufb01cial Intelligence, pages 206\u2013215. Morgan Kaufmann Publishers Inc.,\n1999.\n\n[3] Wenjiang J Fu. Penalized regressions: the bridge versus the lasso. Journal of Computational\n\nand Graphical Statistics, 7(3):397\u2013416, 1998.\n\n[4] David Heckerman, Dan Geiger, and David M Chickering. Learning Bayesian networks: The\n\ncombination of knowledge and statistical data. Machine learning, 20(3):197\u2013243, 1995.\n\n[5] Shuai Huang, Jing Li, Jieping Ye, Adam Fleisher, Kewei Chen, Teresa Wu, and Eric Reiman.\nA sparse structure learning algorithm for Gaussian Bayesian network identi\ufb01cation from\nIEEE Transactions on Pattern Analysis and Machine Intelligence,\nhigh-dimensional data.\n35(6):1328\u20131342, 2013.\n\n[6] Tommi Jaakkola, David Sontag, Amir Globerson, and Marina Meila. Learning Bayesian net-\nwork structure using LP relaxations. In Proceedings of the Thirteenth International Conference\non Arti\ufb01cial intelligence and Statistics (AISTATS), 2010.\n\n[7] Mikko Koivisto and Kismat Sood. Exact Bayesian structure discovery in Bayesian networks.\n\nJournal of Machine Learning Research, 5:549\u2013573, 2004.\n\n[8] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques.\n\nMIT press, 2009.\n\n[9] Wai Lam and Fahiem Bacchus. Learning Bayesian belief networks: An approach based on the\n\nMDL principle. Computational intelligence, 10(3):269\u2013293, 1994.\n\n[10] Maxim Likhachev, Geoff Gordon, and Sebastian Thrun. ARA*: Anytime A* with provable\nbounds on sub-optimality. Advances in Neural Information Processing Systems (NIPS), 16,\n2003.\n\n[11] Jean-Philippe Pellet and Andr\u00b4e Elisseeff. Using Markov blankets for causal structure learning.\n\nThe Journal of Machine Learning Research, 9:1295\u20131342, 2008.\n\n[12] Stuart Jonathan Russell, Peter Norvig, John F Canny, Jitendra M Malik, and Douglas D Ed-\nwards. Arti\ufb01cial intelligence: a modern approach, volume 74. Prentice hall Englewood Cliffs,\n1995.\n\n[13] Mark Schmidt, Alexandru Niculescu-Mizil, and Kevin Murphy. Learning graphical model\nstructure using L1-regularization paths. In Proceedings of the National Conference on Arti\ufb01cial\nIntelligence, volume 22, page 1278, 2007.\n\n[14] Gideon Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6(2):461\u2013464,\n\n1978.\n\n[15] Ajit Singh and Andrew Moore. Finding optimal Bayesian networks by dynamic programming.\n\nTechnical Report 05-106, School of Computer Science, Carnegie Mellon University, 2005.\n\n[16] Marc Teyssier and Daphne Koller. Ordering-based search: A simple and effective algorithm\nfor learning Bayesian networks. In Proceedings of the Twentieth conference on Uncertainty in\nArti\ufb01cial Intelligence, pages 584\u2013590, 2005.\n\n[17] Ioannis Tsamardinos, Laura E Brown, and Constantin F Aliferis. The max-min hill-climbing\n\nBayesian network structure learning algorithm. Machine Learning, 65(1):31\u201378, 2006.\n\n[18] Ioannis Tsamardinos, Alexander Statnikov, Laura E Brown, and Constantin F Aliferis. Gen-\nerating realistic large Bayesian networks by tiling. In the Nineteenth International FLAIRS\nconference, pages 592\u2013597, 2006.\n\n[19] Changhe Yuan, Brandon Malone, and Xiaojian Wu. Learning optimal Bayesian networks using\nA* search. In Proceedings of the Twenty-Second international joint conference on Arti\ufb01cial\nIntelligence, pages 2186\u20132191. AAAI Press, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1142, "authors": [{"given_name": "Jing", "family_name": "Xiang", "institution": "CMU"}, {"given_name": "Seyoung", "family_name": "Kim", "institution": "CMU"}]}