{"title": "Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 100, "page_last": 108, "abstract": "This paper presents four major results towards solving decentralized partially observable Markov decision problems (DecPOMDPs) culminating in an algorithm that outperforms all existing algorithms on all but one standard infinite-horizon benchmark problems. (1) We give an integer program that solves collaborative Bayesian games (CBGs). The program is notable because its linear relaxation is very often integral. (2) We show that a DecPOMDP with bounded belief can be converted to a POMDP (albeit with actions exponential in the number of beliefs). These actions correspond to strategies of a CBG. (3) We present a method to transform any DecPOMDP into a DecPOMDP with bounded beliefs (the number of beliefs is a free parameter) using optimal (not lossless) belief compression. (4) We show that the combination of these results opens the door for new classes of DecPOMDP algorithms based on previous POMDP algorithms. We choose one such algorithm, point-based valued iteration, and modify it to produce the first tractable value iteration method for DecPOMDPs which outperforms existing algorithms.", "full_text": "Point Based Value Iteration with Optimal Belief\n\nCompression for Dec-POMDPs\n\nLiam MacDermed\nCollege of Computing\n\nGeorgia Institute of Technology\n\nAtlanta, GA 30332\n\nliam@cc.gatech.edu\n\nCharles L. Isbell\n\nCollege of Computing\n\nGeorgia Institute of Technology\n\nAtlanta, GA 30332\n\nisbell@cc.gatech.edu\n\nAbstract\n\nWe present four major results towards solving decentralized partially observable\nMarkov decision problems (DecPOMDPs) culminating in an algorithm that out-\nperforms all existing algorithms on all but one standard in\ufb01nite-horizon bench-\nmark problems. (1) We give an integer program that solves collaborative Bayesian\ngames (CBGs). The program is notable because its linear relaxation is very often\nintegral. (2) We show that a DecPOMDP with bounded belief can be converted\nto a POMDP (albeit with actions exponential in the number of beliefs). These ac-\ntions correspond to strategies of a CBG. (3) We present a method to transform any\nDecPOMDP into a DecPOMDP with bounded beliefs (the number of beliefs is a\nfree parameter) using optimal (not lossless) belief compression. (4) We show that\nthe combination of these results opens the door for new classes of DecPOMDP al-\ngorithms based on previous POMDP algorithms. We choose one such algorithm,\npoint-based valued iteration, and modify it to produce the \ufb01rst tractable value it-\neration method for DecPOMDPs that outperforms existing algorithms.\n\n1\n\nIntroduction\n\nDecentralized partially observable Markov decision processes (DecPOMDPs) are a popular model\nfor cooperative multi-agent decision problems; however, they are NEXP-complete to solve [15].\nUnlike single agent POMDPs, DecPOMDPs suffer from a doubly-exponential curse of history [16].\nNot only do agents have to reason about the observations they see, but also about the possible\nobservations of other agents. This causes agents to view their world as non-Markovian because\neven if an agent returns to the same underlying state of the world, the dynamics of the world may\nappear to change due to other agent\u2019s holding different beliefs and taking different actions. Also,\nfor POMDPs, a suf\ufb01cient belief space is the set of probability distributions over possible states. In\nthe case of DecPOMDPs an agent must reason about the beliefs of other agents (who are recursively\nreasoning about beliefs as well), leading to nested beliefs that can make it impossible to losslessly\nreduce an agent\u2019s knowledge to less than its full observation history.\nThis lack of a compact belief-space has prevented value-based dynamic programming methods from\nbeing used to solve DecPOMDPs. While value methods have been quite successful at solving\nPOMDPs, all current DecPOMDP approaches are policy-based methods, where policies are se-\nquentially improved and evaluated at each iteration. Even using policy methods, the curse of history\nis still a big problem, and current methods deal with it in a number of different ways. [5] simply\nremoved beliefs with low probability. Some use heuristics to prune (or never explore) particular\nbelief regions [19, 17, 12, 11]. Other approaches merge beliefs together (i.e., belief compression)\n[5, 6]. This can sometimes be done losslessly [13], but such methods have limited applicability\nand still usually result in an exponential explosion of beliefs. There have also been approaches that\nattempt to operate directly on the in\ufb01nitely nested belief structure [4], but these are approximations\n\n1\n\n\fof unknown accuracy (if we stop at the nth nested belief the nth + 1 could dramatically change the\noutcome). All of these approaches have gotten reasonable empirical results in a few limited domains\nbut ultimately scale and generalize poorly.\nOur solution to the curse of history is simple: to assume that it doesn\u2019t exist, or more precisely, that\nthe number of possible beliefs at any point in time is bounded. While simple, the consequences of\nthis assumption turn out to be quite powerful: a bounded-belief DecPOMDP can be converted into\nan equivalent POMDP. This conversion is accomplished by viewing the problem as a sequence of\ncooperative Bayesian Games (CBGs). While this view is well established, our use of it is novel. We\ngive an ef\ufb01cient method for solving these CBGs and show that any DecPOMDP can be accurately\napproximated by a DecPOMDP with bounded beliefs. These results enable us to utilize existing\nPOMDP algorithms, which we explore by modifying the PERSEUS algorithm [18]. Our resulting\nalgorithm is the \ufb01rst true value-iteration algorithm for DecPOMDPs (where no policy information\nneed be retained from iteration to iteration) and outperforms existing algorithms.\n\n2 DecPOMDPs as a sequence of cooperative Bayesian Games\n\nplayers. A =(cid:81)n\n\ni=1 Ai is the set of joint-actions. S is the set of states. O =(cid:81)n\n\nMany current approaches for solving DecPOMDPs view the decision problem faced by agents as\na sequence of CBGs [5]. This view arises from \ufb01rst noting that a complete policy must prescribe\nan action for every belief-state of an agent. This can take the form of a strategy (a mapping from\nbelief to action) for each time-step; therefore at each time-step agents must choose strategies such\nthat the expectation over their joint-actions and beliefs maximizes the sum of their immediate reward\ncombined with the utility of their continuation policy. This decision problem is equivalent to a CBG.\nFormally, we de\ufb01ne a DecPOMDP as the tuple (cid:104)N, A, S, O, P, R, s(0)(cid:105) where: N is the set of n\ni=1 Oi is the set of\njoint-observations. P : S \u00d7 A \u2192 \u2206(S \u00d7 O) is the probability transition function with P (s(cid:48), (cid:126)o|s, (cid:126)a)\nbeing the probability of ending up in state s(cid:48) with observations (cid:126)o after taking joint-action (cid:126)a in state\ns. R : S \u00d7 A \u2192 R is the shared reward function. And s(0) \u2208 \u2206(S) is the initial state distribution.\nThe CBG consists of a common-knowledge distribution over all possible joint-beliefs along with\na reward for each joint-action/belief. Naively, each observation history corresponds to a belief. If\nbeliefs are more compact (i.e., through belief compression), then multiple histories can correspond\nto the same belief. The joint-belief distribution is commonly known because it depends only on\nthe initial state and player\u2019s policies, which are both commonly known (due to common planning or\nrationality). These beliefs are the types of the Bayesian game. We can compute the current common-\nknowledge distribution (without belief compression) recursively: The probability that joint-type\n\n\u03b8t+1 = (cid:104)(cid:126)o, \u03b8t(cid:105) at time t + 1 is given by: \u03c4 t+1((cid:104)(cid:126)o, \u03b8t(cid:105)) =(cid:80)\n(cid:88)\n(cid:88)\ngame are ideally the immediate reward R =(cid:80)\n\nP r[st+1,(cid:104)(cid:126)o, \u03b8t(cid:105)] =\n\n(cid:80)\n\n\u03b8t\n\nst\n\nst+1 P r[st+1,(cid:104)(cid:126)o, \u03b8t(cid:105)] where:\n\nP (st+1, (cid:126)o|st, \u03c0\u03b8t)P r[st, \u03b8t]\n\n(1)\n\nThe actions of the Bayesian game are the same as in the DecPOMDP. The rewards to the Bayesian\n\u03b8t R(st, \u03c0\u03b8t)P r[st, \u03b8t] along with the utility of\nthe best continuation policy. However, knowing the best utility is tantamount to solving the problem.\nInstead, an estimation can be used. Current approaches estimate the value of the DecPOMDP as if it\nwere an MDP [19], a POMDP [17], or with delayed communication [12]. In each case, the solution\nto the Bayesian game is used as a heuristic to guide policy search.\n\nst\n\n3 An Integer Program for Solving Collaborative Bayesian Games\n\nMany DecPOMDP algorithms use the idea that a DecPOMDP can be viewed as a sequence of\nCBGs to divide policy optimization into smaller sub-problems; however, solving CBGs themselves\nis NP-complete [15]. Previous approaches have solved Bayesian games by enumerating all strate-\ngies, with iterated best response (which is only locally optimal) or branch and bound search [10].\nHere we present a novel integer linear program that solves for an optimal pure strategy Bayes-Nash\nequilibrium (which always exists for games with common payoffs). While integer programming is\nstill NP-complete, our formulation has a huge advantage: the linear relaxation is itself a correlated\ncommunication equilibrium [7], empirically very often integral (above 98% for our experiments in\nsection 7). This allows us to optimally solve our Bayesian games very ef\ufb01ciently.\n\n2\n\n\fOur integer linear program for Bayesian game (cid:104)N, A, \u0398, \u03c4, R(cid:105) optimizes over Boolean variables\nx(cid:126)a,\u03b8, one for each joint-action for each joint-type. Each variable represents the probability of joint-\naction (cid:126)a being taken if the agent\u2019s types are \u03b8. Constraints must be imposed on these variables to\ninsure that they form a proper probability distribution and that from each agent\u2019s perspective, its\naction is conditionally independent of other agents\u2019 types.\nThese restrictions can be expressed by the following linear constraints (equation (2)):\n\nFor each agent i, joint-type \u03b8, and partial joint-actions of other agents (cid:126)a\u2212i\n\n(cid:88)\n\nfor each \u03b8 \u2208 \u0398: (cid:88)\n\n(cid:126)a\u2208A\n\nx(cid:126)a,\u03b8 = x(cid:104)(cid:126)a\u2212i,\u03b8\u2212i(cid:105)\n\nai\u2208Ai\nx(cid:126)a,\u03b8 = 1 and for each \u03b8 \u2208 \u0398, (cid:126)a \u2208 A: x(cid:126)a,\u03b8 \u2265 0\n\n(2)\n\nIn order to make the description of the conditional independence constraints more concise, we use\nthe additional variables x(cid:104)(cid:126)a\u2212i,\u03b8\u2212i(cid:105). These can be immediately substituted out. These represent the\nposterior probability that agent i, after becoming type \u03b8i, thinks other agents will take actions (cid:126)a\u2212i\nwhen having types \u03b8\u2212i. These constraints enforce that an agent\u2019s posterior probabilities are unaf-\nfected by other agent\u2019s observations. Any feasible assignment of variables x(cid:126)a,\u03b8 represents a valid\nagent-normal-form correlated equilibria (ANFCE) strategy for the agents, and any integral solution\nis a valid pure strategy BNE. In order to \ufb01nd the optimal solution for a game with distribution over\ntypes \u03c4 \u2208 \u2206(\u0398) and rewards R : \u0398 \u00d7 A \u2192 Rn we can solve the integer program: Maximize\n\n(cid:80) \u03c4\u03b8R(\u03b8, (cid:126)a)x(cid:126)a,\u03b8 over variables x(cid:126)a,\u03b8 \u2208 {0, 1} subject to constraints (2).\n\nAn ANFCE generalizes Bayes-Nash equilibria: a pure strategy ANFCE is a BNE. We can view\nANFCEs as having a mediator that each agent tells its type to and receives an action recommendation\nfrom. An ANFCE is then a probability distribution across joint type/actions such that agents do not\nwant to lie to the mediator nor deviate from the mediator\u2019s recommendation. More importantly, they\ncannot deduce any information about other agent\u2019s types from the mediator\u2019s recommendation. We\ncannot use an ANFCE directly, because it requires communication; however, a deterministic (i.e.,\nintegral) ANFCE requires no communication and is a BNE.\n\n4 Bounded Belief DecPOMDPs\n\nHere we show that we can convert a bounded belief DecPOMDP (BB-DecPOMDP) into an equiva-\nlent POMDP (that we call the belief-POMDP). A BB-DecPOMDP is a DecPOMDP where each\nagent i has a \ufb01xed upper bound |\u0398i| for the number of beliefs at each time-step. The belief-\nPOMDP\u2019s states are factored, containing each agent\u2019s belief along with the DecPOMDP\u2019s state.\nThe POMDP\u2019s actions are joint-strategies. Recently, Dibangoye et al. [3] showed that a \ufb01nite hori-\nzon DecPOMDP can be converted into a \ufb01nite horizon POMDP where a probability distribution\nover histories is a suf\ufb01cient statistic that can be used as the POMDP\u2019s state. We extend this result to\nin\ufb01nite horizon problems when beliefs are bounded (note that a \ufb01nite horizon problem always has\nbounded belief). The main insight here is that we do not have to remember histories, only a distri-\nbution over belief-labels (without any (cid:127)a priori connection to the belief itself) as a suf\ufb01cient statistic.\nAs such the same POMDP states can be used for all time-steps, enabling in\ufb01nite horizon problems\nto be solved.\nIn order to create the belief-POMDP we \ufb01rst transform observations so that they correspond one-to-\none with beliefs for each agent. This can be achieved naively by folding the previous belief into the\nnew observation so that each agent receives a [previous-belief, observation] pair; however, because\nan agent has at most |\u0398i| beliefs we can partition these histories into at most |\u0398i| information-\nequivalent groups. Each group corresponds to a distinct belief and instead of the [previous-belief,\nobservation] pair we only need to provide the new belief\u2019s label.\nSecond, we factor our state to include each agent\u2019s observation (now a belief-label) along with\nthe original underlying state. This transformation increases the state space polynomially. Third,\nrecall that a belief is the sum of information that an agent uses to make decisions. If agents know\neach other\u2019s policies (e.g., by constructing a distributed policy together) then our modi\ufb01ed state\n(which includes beliefs for each agent) fully determines the dynamics of the system. States now\n\n3\n\n\fappear Markovian again. Therefore, a probability distribution across states is once again a suf\ufb01cient\nplan-time statistic (as proven in [3] and [9]). This distribution exactly corresponds to the Bayesian\nprior (after receiving the belief-observation) of the common knowledge distribution of the current\nBayesian game being played as given by equation (1).\nFinally, its important to note that beliefs do not directly affect rewards or transitions. They therefore\nhave no meaning beyond the prior distribution they induce. We can therefore freely relabel and\nreorder beliefs without changing the decision problem. This allows belief-observations in one time-\nstep to use the same observation labels in the next time-step, even if the beliefs are different (in which\ncase the distribution will be different). We can use this fact to fold our mapping from histories to\nbelief-labels into the belief-POMDP\u2019s transition function.\nWe now formally de\ufb01ne the belief-POMDP (cid:104)A(cid:48), S(cid:48), O(cid:48), P (cid:48), R(cid:48), s(cid:48)(0)(cid:105) converted from BB-\nDecPOMDP (cid:104)N, A, S, O, P, R, s(0)(cid:105) (with belief labels \u0398i for each agent). The belief-POMDP\n(cid:81)|\u0398i|\nhas factored states (cid:104)\u03c9, \u03b81,\u00b7\u00b7\u00b7 , \u03b8n(cid:105) \u2208 S(cid:48) where \u03c9 \u2208 S is the underlying state and \u03b8i \u2208 \u0398i is\nj=1 Ai is the set of actions (one ac-\n[\u03b8,o]=\u03b8(cid:48) P (\u03c9(cid:48), o|\u03c9,(cid:104)a\u03b81 ,\u00b7\u00b7\u00b7 , a\u03b8n(cid:105)) (a sum\nis the action agent i would take if holding belief \u03b8i.\n(cid:48)(0)\n\u03c9 = s(0) is the initial state distribution with each agent\n\nagent i\u2019s belief. O(cid:48) = {} (no observations). A(cid:48) = (cid:81)n\ntion for each agent for each belief). P (cid:48)(s(cid:48)|s, a), = (cid:80)\n\ni=1\n\n(A(cid:48) = {\u0398 \u2192 A}). The action space thus has size(cid:81)\n\nover equivalent joint-beliefs) where a\u03b8i\nR(cid:48)(s, a) = R(\u03c9,(cid:104)a\u03b81,\u00b7\u00b7\u00b7 , a\u03b8n(cid:105)) and s\nhaving the same belief.\nActions in this belief-POMDP are pure strategies for each agent specifying what each agent should\ndo for every belief they might have.\nIn other words it is a mapping from observation to action\ni |Ai||\u0398i| which is exponentially more actions\nthan the number of joint-actions in the BB-DecPOMDP. Both the transition and reward functions\nuse the modi\ufb01ed joint-action (cid:104)a\u03b81,\u00b7\u00b7\u00b7 , a\u03b8n(cid:105) which is the action that would be taken once agents see\ntheir beliefs and follow action a \u2208 A. This makes the single agent in the belief-POMDP act like\na centralized mediator playing the sequence of Bayesian games induced by the BB-DecPOMDP.\nAt every time-step this centralized mediator must give a strategy to each agent (a solution to the\ncurrent Bayesian game). The mediator only knows what is commonly known and thus receives no\nobservations.\nThis belief-POMDP is decision equivalent to the BB-DecPOMDP. The two models induce the same\nsequence of CBGs; therefore, there is a natural one-to-one mapping between polices of the two\nmodels that yield identical utilities. We show this constructively by providing the mapping:\nLemma 4.1. Given BB-DecPOMDP (cid:104)N, A, O, \u0398, S, P, R, s(0)(cid:105) with policy \u03c0 : \u2206(S)\u00d7O \u2192 A and\nbelief-POMDP (cid:104)A(cid:48), S(cid:48), O(cid:48), P (cid:48), R(cid:48), s(cid:48)(0)(cid:105) as de\ufb01ned above, with policy \u03c0(cid:48) : \u2206(S(cid:48)) \u2192 {O(cid:48) \u2192 A(cid:48)}\nthen if \u03c0(s(t))i,o = \u03c0(cid:48)\n\u03c0(cid:48)(s(cid:48)(0)) (the expected utility of the two policies\nare equal).\n\ni(s(t), o), then V\u03c0(s(0)) = V (cid:48)\n\nWe have shown that BB-DecPOMDPs can be turned into POMDPs but this does not mean that we\ncan easily solve these POMDPs using existing methods. The action-space of the belief-POMDP\nis exponential with respect to the number of observations of the BB-DecPOMDP. Most existing\nPOMDP algorithms assume that actions can be enumerated ef\ufb01ciently, which isn\u2019t possible beyond\nthe simplest belief-POMDP. One notable exception is an extension to PERSEUS that randomly\nsamples actions [18]. This approach works for some domains; however, often for decentralized\nproblems only one particular strategy proves effective, making a randomized approach less useful.\nLuckily, the optimal strategy is the solution to a CBG, which we have already shown how to solve.\nWe can then use existing POMDP algorithms and replace action maximization with an integer linear\nprogram. We show how to do this for PERSEUS below. First, we make this result more useful by\ngiving a method to convert a DecPOMDP into a BB-DecPOMDP.\n\n5 Optimal belief compression\n\nWe present here a novel and optimal belief compression method that transforms any DecPOMDP\ninto a BB-DecPOMDP. The idea is to let agents themselves decide how they want to merge their\nbeliefs and to add this decision directly into the problem\u2019s structure. This pushes the onus of belief\ncompression onto the BB-DecPOMDP solver instead of an explicit approximation method. We give\nagents the ability to optimally compress their own beliefs by interleaving each normal time-step\n\n4\n\n\f(where we fully expand each belief) with a compression time-step (where the agents must explicitly\ndecide how to best merge beliefs). We call these phases belief expansion and belief compression,\nrespectively.\nThe \ufb01rst phase acts like the original DecPOMDP without any belief compression: the observation\ngiven to each agent is its previous belief along with the DecPOMDP\u2019s observation. No information\nis lost during this phase; each observation for each agent-type (agents holding the same belief are\nthe same type) results in a distinct belief. This belief expansion occurs with the same transitions and\nrewards as the original DecPOMDP.\nThe dynamics of the second phase are unrelated to the DecPOMDP. Instead, an agent\u2019s actions are\ndecisions about how to compress its belief. In this phase, each agent-type must choose its next\nbelief but they only have a \ufb01xed number of beliefs to choose from (the number of beliefs ti is a free\nparameter). All agent-types that choose the same belief will be unable to distinguish themselves in\nthe next time-step; the belief label in the next time-step will equal the action index they take in the\nbelief compression phase. All rewards are zero. This second phase can be seen as a purely mental\nphase and does not affect the environment beyond changes to beliefs although as a technical matter,\nwe convert our discount factor to its square root to account for these new interleaved states.\nGiven a DecPOMDP (cid:104)N, A, S, O, P, R, s(0)(cid:105) (with states \u03c9 \u2208 S and observations \u03c3 \u2208 O) we\nformally de\ufb01ne the BB-DecPOMDP approximation model (cid:104)N(cid:48), A(cid:48), O(cid:48), S(cid:48), P (cid:48), R(cid:48), s(cid:48)(0)(cid:105) with belief\nset size parameters t1,\u00b7\u00b7\u00b7 , tn as:\n\ni = {a1,\u00b7\u00b7\u00b7 , amax(|Ai|, ti)}\n\n\u2022 N(cid:48) = N and A(cid:48)\n\u2022 O(cid:48)\n\u2022 S(cid:48) = S\u00d7O(cid:48) with factored state s = (cid:104)\u03c9, \u03c31, \u03b81,\u00b7\u00b7\u00b7 , \u03c3n, \u03b8n(cid:105) \u2208 S(cid:48).\nif \u2200i : \u03c3i = \u2205, \u03c3(cid:48)\nif \u2200i : \u03c3i (cid:54)= \u2205, \u03c3(cid:48)\n\u2022 P (cid:48)(s(cid:48)|s, a) =\notherwise\n\ni = {Oi,\u2205}\u00d7{1, 2,\u00b7\u00b7\u00b7 , ti} with factored observation oi = (cid:104)\u03c3i, \u03b8i(cid:105) \u2208 O(cid:48)\ni (cid:54)= \u2205 and \u03b8(cid:48)\ni = \u2205 and \u03b8(cid:48)\n\n1,\u00b7\u00b7\u00b7 , \u03c3(cid:48)\n\nn(cid:105)|\u03c9, a)\n\n(cid:40) P (\u03c9(cid:48),(cid:104)\u03c3(cid:48)\n(cid:26) R(\u03c9, a)\n\n1\n0\n\ni = \u03b8i\ni = ai, \u03c9(cid:48) = \u03c9\n\n\u2022 R(cid:48)(s, a) =\n\n\u2022 s(cid:48)(0) =(cid:10)s(0),\u22051, 1,\u00b7\u00b7\u00b7 ,\u2205n, 1(cid:11) is the initial state distribution\n\nif \u2200i : \u03c3i = \u2205\notherwise\n\n0\n\nWe have constructed the BB-DecPOMDP such that at each time-step agents receive two observa-\ntions: an observation factor \u03c3, and their belief factor \u03b8 (i.e., type). The observation factor is \u2205 at\nthe expansion phase and the most recent observation, as given by the DecPOMDP, when starting the\ncompression phase. The observation factor therefore distinguishes which phase the model is cur-\nrently in. Agents should either all have \u03c3 = \u2205 or none of them. The probability of transitioning to a\nstate where some agents have the empty set observation while others don\u2019t is always zero. Note that\ntransitions during the contraction phase are deterministic (probability one) and the underlying state\n\u03c9 does not change. The action set sizes may be different in the two phases, however we can easily\nget around this problem by mapping any action outside of the designated actions to an equivalent\none inside the designated action. The new BB-DecPOMDP\u2019s state size is |S(cid:48)| = |S|(|O| + 1)ntn.\n\n6 Point-based value iteration for BB-DecPOMDPs\n\nWe have now shown that a DecPOMDP can be approximated by a BB-DecPOMDP (using optimal\nbelief-compression) and that this BB-DecPOMDP can be converted into a belief-POMDP where\nselecting an optimal action is equivalent to solving a collaborative Bayesian game (CBG). We have\ngiven a (relatively) ef\ufb01cient integer linear program for solving these CBG. The combination of\nthese three results opens the door for new classes of DecPOMDP algorithms based on previous\nPOMDP algorithms. The only difference between existing POMDP algorithms and one tailored for\nBB-DecPOMDPs is that instead of maximizing over actions (which are exponential in the belief-\nPOMDP), we must solve a stage-game CBG equivalent to the stage decision problem.\nHere, we develop an algorithm for DecPOMDPs based on the PERSEUS algorithm [18] for\nPOMDPs, a speci\ufb01c version of point-based value iteration (PBVI) [16]. Our value function rep-\nresentation is a standard convex and piecewise linear value-vector representation over the belief\n\n5\n\n\fsimplex. This is the same representation that PERSEUS and most other value based POMDP al-\ngorithms use. It consists of a set of hyperplanes \u0393 = {\u03b11, \u03b12,\u00b7\u00b7\u00b7 , \u03b1m} where \u03b1i \u2208 R|S|. These\nhyperplanes each represent the value of a particular policy across beliefs. The value function is then\nthe maximum over all hyperplanes. For a belief b \u2208 R|S| its value as given by the value function \u0393\nis V\u0393(b) = max\u03b1\u2208\u0393 \u03b1 \u00b7 b. Such a representation acts as both a value function and an implicit policy.\nWhile each \u03b1 vector corresponds to the value achieved by following an unspeci\ufb01ed policy, we can\nreconstruct that policy by computing the best one-step strategy, computing the successor state, and\nrepeating the process.\nThe high-level outline of our point-based algorithm is the same as PERSEUS. First, we sample\ncommon-knowledge beliefs and collect them into a belief set B. This is done by taking random\nactions from a given starting belief and recording the resulting belief states. We then start with a\npoor approximation of the value function and improve it over successive iterations by performing\na one-step backup for each belief b \u2208 B. Each backup produces a policy which yields a value-\nvector to improve the value function. PERSEUS improves standard PBVI during each iteration by\nskipping beliefs already improved by another backup. This reduces the number of backups needed.\nIn order to operate on belief-POMDPs, we replace PERSEUS\u2019 backup operation with one that uses\nour integer program.\nIn order to backup a particular belief point we must maximize the utility of a strategy x. The\nestimated value(cid:80)\nutility is computed using the immediate reward combined with our value-function\u2019s current estimate\n(cid:80)\nof a chosen continuation policy that has value vector \u03b1. Thus, a resulting belief b(cid:48) will achieve\ns b(cid:48)(s)\u03b1(s). The resulting belief b(cid:48) after taking action (cid:126)a from belief b is b(cid:48)(s(cid:48)) =\ns b(s)P (s(cid:48)|s, (cid:126)a). Putting these together, along with the probabilities x(cid:126)a,s of taking action (cid:126)a in\n\nstate s we get the value of a strategy x from belief b followed by continuation utility \u03b1:\n\nV(cid:126)a,\u03b1(b) =\n\nb(s)\n\nP (s(cid:48)|s, (cid:126)a)\u03b1(s(cid:48))x(cid:126)a,s\n\n(3)\n\n(cid:88)\n\ns\u2208S\n\n(cid:88)\n\n(cid:88)\n\n(cid:126)a\u2208A\n\ns(cid:48)\u2208S\n\nThis is the quantity that we wish to maximize, and can combine with constraints (2) to form an\ninteger linear program that returns the best action for each agent for each observation (strategy)\ngiven a continuation policy \u03b1. To \ufb01nd the best strategy/continuation policy pair, we can perform this\nsearch over all continuation vectors in \u0393:\n\nMaximize:\nOver:\nSubject to:\n\nequation 3\nx \u2208 {0, 1}|S||A| , \u03b1 \u2208 \u0393\ninequalities 2\n\n(4)\n\nEach integer program has |S||A| variables but for underlying state factors (nature\u2019s type) there\nare only |\u0398||A| linearly independent constraints - one for each unobserved factor and joint-action.\nTherefore the number of unobserved states does not increase the number of free variables. Taking\nthis into account, the number of free variables is O(|O||A|). The optimization problem given above\nrequires searching over all \u03b1 \u2208 \u0393 to \ufb01nd the best continuation value-vector. We could solve this\nproblem as one large linear program with a different set of variables for each \u03b1, however each set of\nvariables would be independent and thus can be solved faster as separate individual problems.\nWe initialize our starting value function \u03930 to have a single low conservative value vector (such as\n{Rmin/\u03b3)}n). Every iteration then attempts to improve the value function at each belief in our belief\nset B. A random common-knowledge belief b \u2208 B is selected and we compute an improved policy\nfor that belief by performing a one step backup. This backup involves \ufb01nding the best immediate\nstrategy-pro\ufb01le (an action for each observation of each agent) at belief b along with the best con-\ntinuation policy from \u0393t. We then compute the value of the resulting strategy + continuation policy\n(which is itself a policy) and insert this new \u03b1-vector into \u0393t+1. Any belief that is improved by \u03b1\n(including b) is removed from B. We then select a new common-knowledge belief and iterate until\nevery belief in B has been improved. We give this as algorithm 1.\nThis algorithm will iteratively improve the value function at all beliefs. The algorithm stops when\nthe value function improves less than the stopping criterion \u0001\u0393. Therefore, at every iteration at least\none of the beliefs must improve by at least \u0001\u0393. Because the value function at every belief is bounded\n\n6\n\n\fDecPOMDP M, discount \u03b3, belief bounds |\u0398i|, stopping criterion \u0001\u0393\nvalue function \u0393\n\nAlgorithm 1 The modi\ufb01ed point-based value iteration for DecPOMDPs\nInputs:\nOutput:\n1: (cid:104)N, A, O, S, P, R, s(0)(cid:105) \u21d0 BB-DecPOMDP approximation of M as described in section 5\n2: B\u2200 \u21d0 sampling of states using a random walk from s(0)\n3: \u0393(cid:48) \u21d0 {(cid:104)Rmin/\u03b3,\u00b7\u00b7\u00b7 , Rmin/\u03b3(cid:105)}\n4: repeat\n5: B \u21d0 B\u2200; \u0393 \u21d0 \u0393(cid:48); \u0393(cid:48) \u21d0 \u2205\n6:\n7:\n8:\n9:\n10:\n11:\n12:\nfor all b \u2208 B do\n13:\n14:\nB \u21d0 B/b\n15:\n16: until \u0393(cid:48) \u2212 \u0393 < \u0001\u0393\n17: return \u0393\n\nb \u21d0 Rand(b \u2208 B)\n\u03b1 \u21d0 \u0393(b)\n\u03b1(cid:48) \u21d0 optimal point of integer program (4)\nif \u03b1(cid:48)(b) > \u03b1(b) then\n\nif \u03b1(b) > \u0393(b) then\n\nwhile B (cid:54)= \u2205 do\n\n\u0393(cid:48) \u21d0 \u0393(cid:48)(cid:83) \u03b1\n\n\u03b1 \u21d0 \u03b1(cid:48)\n\nabove by Rmax/\u03b3 we can guarantee that the algorithm will take fewer than (|B|Rmax)/(\u03b3 \u00b7 \u0001\u0393)\niterations.\nAlgorithm 1 returns a value function. Ultimately we want a policy. Using the value function a policy\ncan be constructed in a greedy manner for each player. This is accomplished using a very similar\nprocedure to how we construct a policy greedily in the fully observable case. Every time-step the\nactors in the world can dynamically compute their next action without needing to plan their entire\npolicy.\n\n7 Experiments\n\nWe tested our algorithm on six well known benchmark problems [1, 14]: DecTiger, Broadcast, Grid-\nsmall, Cooperative Box Pushing, Recycling Robots, and Wireless Network. On all of these problems\nwe met or exceeded the current best solution. This is particularly impressive considering that some\nof the algorithms were designed to take advantage of speci\ufb01c problem structure, while our algorithm\nis general. We also attempted to solve the Mars Rovers problem, except its belief-POMDP transition\nmodel was too large for our 8GB memory limit.\nWe implemented our PBVI for BB-DecPOMDP algorithm in Java using the GNU Linear Program-\nming Kit to solve our integer programs. We ran the algorithm on all six benchmark problems using\nthe dynamic belief compression approximation scheme to convert each of the DecPOMDP problems\ninto BB-DecPOMDPs. For each problem we converted them into a BB-DecPOMDP with one, two,\nthree, four, and \ufb01ve dynamic beliefs (the value of ti).\nWe used the following \ufb01xed parameters while running the PBVI algorithm: \u0001\u0393 = 0.0005. We\nsampled 3,000 belief points to a maximum depth of 36. All of the problems, except Wireless,\n0.9 for our dynamic\nwere solved using a discount factor \u03b3 of 0.9 for the original problem and\napproximation (recall that an agent visits two states for every one of the original problem). Wireless\nhas a discount factor of 0.99. To compensate for this low discount factor in this domain, we sampled\n30,000 beliefs to a depth of 360. Our empirical evaluations were run on a six-core 3.20GHz Phenom\nprocessor with 8GB of memory. We terminated the algorithm and used the current value if it ran\nlonger than a day (only Box Pushing and Wireless took longer than \ufb01ve hours). The \ufb01nal value\nreported is the value of the computed decentralized policy on the original DecPOMDP run to a\nhorizon which pushed the utility error below the reported precision.\nOur algorithms performed very well on all benchmark problems (table 7). Surprisingly, most of\nthe benchmark problems only require two approximating beliefs in order to beat the previously best\n\n\u221a\n\n7\n\n\fDec-Tiger\nBroadcast\nRecycling\nGrid small\nBox Pushing\nWireless\n\n|S|\n2\n4\n4\n16\n100\n64\n\n|Ai|\n2\n2\n3\n5\n4\n2\n\n|Oi|\n2\n2\n2\n2\n5\n6\n\nPrevious Best\n\nUtility\n\n13.4486 [14]\n\n9.1 [1]\n\n31.92865 [1]\n\n6.89 [14]\n149.854 [2]\n-175.40 [8]\n\n1-Belief\n\nUtility\n-20.000\n9.2710\n26.3158\n5.2716\n127.1572\n-208.0437\n\n|\u0393|\n2\n36\n8\n168\n258\n99\n\n2-Beliefs\n\nUtility\n4.6161\n9.2710\n31.9291\n6.8423\n223.8674\n-167.1025\n\n|\u0393|\n187\n44\n13\n206\n357\n374\n\n4-Beliefs\n\n5-Beliefs\n\n3-Beliefs\n\nUtility\n13.4486\n9.2710\n31.9291\n6.9826\n224.1387\n\n|\u0393|\n231\n75\n37\n276\n305\n\nDec-Tiger\nBroadcast\nRecycling\nGrid small\nBox Pushing\n\nUtility\n13.4486\n9.2710\n31.9291\n6.9896\n\n-\n\n|\u0393|\n801\n33\n498\n358\n-\n\nUtility\n13.4486\n9.2710\n31.9291\n6.9958\n\n-\n\n|\u0393|\n809\n123\n850\n693\n-\n\nTable 1: Utility achieved by our PBVI-BB-DecPOMDP algorithm compared to the previously best\nknown policies on a series of standard benchmarks. Higher is better. Our algorithm beats all previous\nresults except on Dec-Tiger where we believe an optimal policy has already been found.\n\nknown solution. Only Dec-Tiger needs three beliefs. None of the problems bene\ufb01ted substantially\nfrom using four or \ufb01ve beliefs. Only grid-small continued to improve slightly when given more\nbeliefs. This lack of improvement with extra beliefs is strong evidence that our BB-DecPOMDP\napproximation is quite powerful and that the policies found are near optimal. It also suggests that\nthese problems do not have terribly complicated optimal policies and new benchmark problems\nshould be proposed that require a richer belief set.\nThe belief-POMDP state-space size is the primary bottleneck of our algorithm. Recall that this state-\nspace is factored causing its size to be O(|S||O|n|t|n). This number can easily become intractably\nlarge for problems with a moderate number of states and observations, such as the Mars Rovers\nproblem. Taking advantage of sparsity can mitigate this problem (our implementation uses sparse\nvectors), however value-vectors tend to be dense and thus sparsity is only a partial solution. A\nlarge state-space also requires a greater number of belief samples to adequately cover and represent\nthe value-function; with more states it becomes increasingly likely that a random walk will fail to\ntraverse a desirable region of the state-space. This problem is not nearly as bad as it would be for\na normal POMDP because much of the belief-space is unreachable and a belief-POMDP\u2019s value\nfunction has a great deal of symmetry due to the label invariance of beliefs (a relabeling of beliefs\nwill still have the same utility).\n\n8 Conclusion\n\nThis paper presented three relatively independent contributions towards solving DecPOMDPs. First,\nwe introduce an ef\ufb01cient integer program for solving collaborative Bayesian games. Other ap-\nproaches require solving CBGs as a sub-problem, and this could directly improve those algorithms.\nSecond, we showed how a DecPOMDP with bounded belief can be converted into a POMDP. Al-\nmost all methods bound the beliefs in some way (through belief compression or \ufb01nite horizons),\nand viewing these problems as POMDPs with large action spaces could precipitate new approaches.\nThird, we showed how to achieve optimal belief compression by allowing agents themselves to de-\ncide how best to merge beliefs. This allows any DecPOMDP to be converted into a BB-DecPOMDP.\nFinally, these independent contributions can be combined together to permit existing POMDP algo-\nrithm (here we choose PERSEUS) to be used to solve DecPOMDPs. We showed that this approach\nis a signi\ufb01cant improvement over existing in\ufb01nite-horizon algorithms. We believe this opens the\ndoor towards a large and fruitful line of research into modifying and adapting existing value-based\nPOMDP algorithms towards the speci\ufb01c dif\ufb01culties of belief-POMDPs.\n\n8\n\n\fReferences\n[1] C. Amato, D. S. Bernstein, and S. Zilberstein. Optimizing memory-bounded controllers for\nIn Proceedings of the Twenty-Third Conference on Uncertainty in\n\ndecentralized POMDPs.\nArti\ufb01cial Intelligence, pages 1\u20138, Vancouver, British Columbia, 2007.\n\n[2] C. Amato and S. Zilberstein. Achieving goals in decentralized pomdps. In Proceedings of The\n8th International Conference on Autonomous Agents and Multiagent Systems-Volume 1, pages\n593\u2013600, 2009.\n\n[3] J. S. Dibangoye, C. Amato, O. Buffet, F. Charpillet, S. Nicol, T. Iwamura, O. Buffet, I. Chades,\nM. Tagorti, B. Scherrer, et al. Optimally solving dec-pomdps as continuous-state mdps. In Pro-\nceedings of the Twenty-Third International Joint Conference on Arti\ufb01cial Intelligence, 2013.\n\n[4] P. Doshi and P. Gmytrasiewicz. Monte Carlo sampling methods for approximating interactive\n\nPOMDPs. Journal of Arti\ufb01cial Intelligence Research, 34(1):297\u2013337, 2009.\n\n[5] R. Emery-Montemerlo, G. Gordon, J. Schneider, and S. Thrun. Approximate solutions for\npartially observable stochastic games with common payoffs. In Proceedings of the Third In-\nternational Joint Conference on Autonomous Agents and Multiagent Systems-Volume 1, pages\n136\u2013143. IEEE Computer Society, 2004.\n\n[6] R. Emery-Montemerlo, G. Gordon, J. Schneider, and S. Thrun. Game theoretic control for\nrobot teams. In Robotics and Automation, 2005. ICRA 2005. Proceedings of the 2005 IEEE\nInternational Conference on, pages 1163\u20131169. IEEE, 2005.\n\n[7] F. Forges. Correlated equilibrium in games with incomplete information revisited. Theory and\n\ndecision, 61(4):329\u2013344, 2006.\n\n[8] A. Kumar and S. Zilberstein. Anytime planning for decentralized pomdps using expectation\nIn Proceedings of the Twenty-Sixth Conference on Uncertainty in Arti\ufb01cial\n\nmaximization.\nIntelligence, 2012.\n\n[9] F. A. Oliehoek. Suf\ufb01cient plan-time statistics for decentralized pomdps. In Proceedings of the\n\nTwenty-Third International Joint Conference on Arti\ufb01cial Intelligence, 2013.\n\n[10] F. A. Oliehoek, M. T. Spaan, J. S. Dibangoye, and C. Amato. Heuristic search for identical\npayoff bayesian games. In Proceedings of the 9th International Conference on Autonomous\nAgents and Multiagent Systems, 2010.\n\n[11] F. A. Oliehoek, M. T. Spaan, and N. Vlassis. Optimal and approximate q-value functions for\n\ndecentralized pomdps. Journal of Arti\ufb01cial Intelligence Research, 32(1):289\u2013353, 2008.\n\n[12] F. A. Oliehoek and N. Vlassis. Q-value functions for decentralized pomdps. In Proceedings\nof the 6th international joint conference on Autonomous agents and multiagent systems, page\n220. ACM, 2007.\n\n[13] F. A. Oliehoek, S. Whiteson, and M. T. Spaan. Lossless clustering of histories in decentralized\nIn Proceedings of The 8th International Conference on Autonomous Agents and\n\npomdps.\nMultiagent Systems, pages 577\u2013584, 2009.\n\n[14] J. Pajarinen and J. Peltonen. Periodic \ufb01nite state controllers for ef\ufb01cient pomdp and dec-pomdp\nplanning. In Proc. of the 25th Annual Conf. on Neural Information Processing Systems, 2011.\n[15] C. H. Papadimitriou and J. Tsitsiklis. On the complexity of designing distributed protocols.\n\nInformation and Control, 53(3):211\u2013218, 1982.\n\n[16] J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: an anytime algorithm for\npomdps. In Proceedings of the 18th International Joint Conference on Arti\ufb01cial Intelligence,\nIJCAI\u201903, pages 1025\u20131030, 2003.\n\n[17] M. Roth, R. Simmons, and M. Veloso. Reasoning about joint beliefs for execution-time\ncommunication decisions. In Proceedings of the fourth international joint conference on Au-\ntonomous agents and multiagent systems, pages 786\u2013793. ACM, 2005.\n\n[18] M. T. Spaan and N. Vlassis. Perseus: Randomized point-based value iteration for pomdps.\n\nJournal of arti\ufb01cial intelligence research, 24(1):195\u2013220, 2005.\n\n[19] D. Szer, F. Charpillet, S. Zilberstein, et al. Maa*: A heuristic search algorithm for solving\ndecentralized pomdps. In 21st Conference on Uncertainty in Arti\ufb01cial Intelligence-UAI\u20192005,\n2005.\n\n9\n\n\f", "award": [], "sourceid": 104, "authors": [{"given_name": "Liam", "family_name": "MacDermed", "institution": "Georgia Tech"}, {"given_name": "Charles", "family_name": "Isbell", "institution": "Georgia Tech"}]}