{"title": "Regret based Robust Solutions for Uncertain Markov Decision Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 881, "page_last": 889, "abstract": "In this paper, we seek robust policies for uncertain Markov Decision Processes (MDPs). Most robust optimization approaches for these problems have focussed on the computation of {\\em maximin} policies which maximize the value corresponding to the worst realization of the uncertainty. Recent work has proposed {\\em minimax} regret as a suitable alternative to the {\\em maximin} objective for robust optimization.  However, existing algorithms for handling {\\em minimax} regret are restricted to models with uncertainty over rewards only.  We provide algorithms that employ sampling to improve across multiple dimensions: (a) Handle uncertainties over both transition and reward models; (b) Dependence of model uncertainties across state, action pairs and decision epochs; (c) Scalability and quality bounds. Finally, to demonstrate the empirical effectiveness of our sampling approaches, we provide comparisons against benchmark algorithms on two domains from literature. We also provide a Sample Average Approximation (SAA) analysis to compute a posteriori error bounds.", "full_text": "Regret based Robust Solutions for\n\nUncertain Markov Decision Processes\n\nAsrar Ahmed\n\nSingapore Management University\n\nmasrara@smu.edu.sg\n\nPradeep Varakantham\n\nSingapore Management University\n\npradeepv@smu.edu.sg\n\nYossiri Adulyasak\n\nMassachusetts Institute of Technology\n\nyossiri@smart.mit.edu\n\nPatrick Jaillet\n\nMassachusetts Institute of Technology\n\njaillet@mit.edu\n\nAbstract\n\nIn this paper, we seek robust policies for uncertain Markov Decision Processes (MDPs). Most\nrobust optimization approaches for these problems have focussed on the computation of maximin\npolicies which maximize the value corresponding to the worst realization of the uncertainty. Recent\nwork has proposed minimax regret as a suitable alternative to the maximin objective for robust op-\ntimization. However, existing algorithms for handling minimax regret are restricted to models with\nuncertainty over rewards only. We provide algorithms that employ sampling to improve across mul-\ntiple dimensions: (a) Handle uncertainties over both transition and reward models; (b) Dependence\nof model uncertainties across state, action pairs and decision epochs; (c) Scalability and quality\nbounds. Finally, to demonstrate the empirical effectiveness of our sampling approaches, we pro-\nvide comparisons against benchmark algorithms on two domains from literature. We also provide a\nSample Average Approximation (SAA) analysis to compute a posteriori error bounds.\n\nIntroduction\n\nMotivated by the dif\ufb01culty in exact speci\ufb01cation of reward and transition models, researchers have\nproposed the uncertain Markov Decision Process (MDP) model and robustness objectives in solving\nthese models. Given the uncertainty over the reward and transition models, a robust solution can\ntypically provide some guarantees on the worst case performance. Most of the research in comput-\ning robust solutions has assumed a maximin objective, where one computes a policy that maximizes\nthe value corresponding to the worst case realization [8, 4, 3, 1, 7]. This line of work has devel-\noped scalable algorithms by exploiting independence of uncertainties across states and convexity of\nuncertainty sets. Recently, techniques have been proposed to deal with dependence of uncertain-\nties [15, 6].\nRegan et al. [11] and Xu et al. [16] have proposed minimax regret criterion [13] as a suitable alterna-\ntive to maximin objective for uncertain MDPs. We also focus on this minimax notion of robustness\nand also provide a new myopic variant of regret called Cumulative Expected Regret (CER) that\nallows for development of scalable algorithms.\nDue to the complexity of computing optimal minimax regret policies [16] , existing algorithms [12]\nare restricted to handling uncertainty only in reward models and the uncertainties are independent\nacross states. Recent research has shown that sampling-based techniques [5, 9] are not only ef\ufb01cient\nbut also provide a priori (Chernoff-Hoef\ufb01ding bounds) and a posteriori [14] quality bounds for\nplanning under uncertainty.\nIn this paper, we also employ sampling-based approaches to address restrictions of existing ap-\nproaches for obtaining regret-based solutions for uncertain MDPs . More speci\ufb01cally, we make the\n\n1\n\n\ffollowing contributions: (a) An approximate Mixed Integer Linear Programming (MILP) formu-\nlation with error bounds for computing minimum regret solutions for uncertain MDPs, where the\nuncertainties across states are dependent. We further provide enhancements and error bounds to\nimprove applicability. (b) We introduce a new myopic concept of regret, referred to as Cumulative\nExpected Regret (CER) that is intuitive and that allows for development of scalable approaches.\n(c) Finally, we perform a Sample Average Approximation (SAA) analysis to provide experimental\nbounds for our approaches on benchmark problems from literature.\n\nPreliminaries\n\nWe now formally de\ufb01ne the two regret criterion that will be employed in this paper. In the de\ufb01nitions\nbelow, we assume an underlying MDP, M = (cid:104)S,A, T, R, H(cid:105) where a policy is represented as:\n(cid:126)\u03c0t = {\u03c0t, \u03c0t+1, . . . , \u03c0H\u22121}, the optimal policy as (cid:126)\u03c0\u2217 and the optimal expected value as v0((cid:126)\u03c0\u2217).\nThe maximum reward in any state s is denoted as R\u2217(s) = maxa R(s, a). Throughout the paper,\nwe use \u03b1(s) to denote the starting state distribution in state s and \u03b3 to represent the discount factor.\nDe\ufb01nition 1 Regret for any policy (cid:126)\u03c00 is denoted by reg((cid:126)\u03c00) and is de\ufb01ned as:\n\nreg((cid:126)\u03c00) = v0((cid:126)\u03c0\u2217) \u2212 v0((cid:126)\u03c00), where v0((cid:126)\u03c00) =\n\n\u03b1(s) \u00b7 v0(s, (cid:126)\u03c00),\n\n(cid:88)\n\n(cid:88)\n\n\u03c0t(s, a) \u00b7(cid:104)\n\nvt(s, (cid:126)\u03c0t) =\n\nR(s, a) + \u03b3\n\na\n\ns(cid:48)\n\n(cid:88)\n\ns\n\nT (s, a, s(cid:48)) \u00b7 vt+1(s(cid:48), (cid:126)\u03c0t+1)\n\n(cid:105)\n\nExtending the de\ufb01nitions of simple and cumulative regret in stochastic multi-armed bandit prob-\nlems [2], we now de\ufb01ne a new variant of regret called Cumulative Expected Regret (CER).\nDe\ufb01nition 2 CER for policy (cid:126)\u03c00 is denoted by creg((cid:126)\u03c00) and is de\ufb01ned as:\n\n(cid:88)\n(cid:88)\n\ns\n\na\n\ncreg((cid:126)\u03c00) =\n\ncregt(s, (cid:126)\u03c0t) =\n\n\u03b1(s) \u00b7 creg0(s, (cid:126)\u03c00), where\n\n(cid:88)\n\u03c0t(s, a) \u00b7(cid:2)R\u2217(s) \u2212 R(s, a) + \u03b3\nProposition 1 For a policy (cid:126)\u03c00 : 0 \u2264 reg((cid:126)\u03c00)\u2212 creg((cid:126)\u03c00) \u2264(cid:104)\n\ns(cid:48)\n\nThe following properties highlight the dependencies between regret and CER.\n\nT (s, a, s(cid:48)) \u00b7 cregt+1(s(cid:48), (cid:126)\u03c0t+1)(cid:3)\n\n(1)\n\nmaxs R\u2217(s)\u2212 mins R\u2217(s)\n\n(cid:105)\u00b7 (1\u2212\u03b3H )\n\n1\u2212\u03b3\n\nProof Sketch1 By rewriting Equation (1) as creg((cid:126)\u03c00) = v0,#((cid:126)\u03c00) \u2212 v0((cid:126)\u03c00), we provide the proof.\nCorollary 1 If \u2200s, s(cid:48) \u2208 S : R\u2217(s) = R\u2217(s(cid:48)), then \u2200(cid:126)\u03c00 : creg((cid:126)\u03c00) = reg((cid:126)\u03c00).\nProof. Substituting maxs R\u2217(s) = mins R\u2217(s) in the result of Proposition 1, we have creg((cid:126)\u03c00) =\nreg((cid:126)\u03c00). (cid:4)\n\nUncertain MDP\nA \ufb01nite horizon uncertain MDP is de\ufb01ned as the tuple of (cid:104)S,A, T, R, H(cid:105). S denotes the set of\nstates and A denotes the set of actions. T = \u2206\u03c4 (T ) denotes a distribution over the set of transition\nfunctions T , where T t\nk (s, a, s(cid:48)) denotes the probability of transitioning from state s \u2208 S to state s(cid:48) \u2208\nS on taking action a \u2208 A at time step t according to the kth element in T . Similarly, R = \u2206\u03c1(R)\ndenotes the distribution over the set of reward functions R, where Rt\nk(s, a, s(cid:48)) is the reinforcement\nobtained on taking action a in state s and transitioning to state s(cid:48) at time t according to kth element\nin R. Both T and R sets can have in\ufb01nite elements. Finally, H is the time horizon.\nIn the above representation, every element of T and R represent uncertainty over the entire horizon\nand hence this representation captures dependence in uncertainty distributions across states. We now\nprovide a formal de\ufb01nition for the independence of uncertainty distributions that is equivalent to the\nrectangularity property introduced in Iyengar et al. [4].\n\n1Detailed proof provided in supplement under Proposition 1.\n\n2\n\n\fDe\ufb01nition 3 An uncertainty distribution \u2206\u03c4 over the set of transition functions, T is independent\nover state-action pairs at various decision epochs if\n\n\u2206\u03c4 (T ) = \u00d7s\u2208S,a\u2208A,t\u2264H \u2206\u03c4,t\n\ns,a(T t\n\ns,a), i.e. \u2200k, P r\u2206\u03c4 (T k) =\n\nP r\u2206\u03c4,t\n\ns,a\n\n(T t\n\ns,a)\n\n(cid:89)\n\nwhere T = \u00d7s,a,tT t\nthe set T t\n\ns,a,t\ns,a is the set of transition functions for s, a, t; \u2206\u03c4,t\n\ns,a is the distribution over\ns,a and P r\u2206\u03c4 (T k) is the probability of the transition function T k given the distribution \u2206\u03c4 .\n\ns,a, T t\n\nWe can provide a similar de\ufb01nition for the independence of uncertainty distributions over the re-\nward functions.\nIn the following de\ufb01nitions, we include transition, T and reward, R models as\nsubscripts to indicate value (v), regret (reg) and CER (creg) functions corresponding to a speci\ufb01c\nMDP. Existing works on computation of maximin policies have the following objective:\n\n\u03c0maximin = arg max\n(cid:126)\u03c00\n\nmin\n\nT\u2208T ,R\u2208R\n\n\u03b1(s) \u00b7 v0\n\nT,R(s, (cid:126)\u03c00)\n\n(cid:88)\n\ns\n\nOur goal is to compute policies that minimize the maximum regret or cumulative regret over possible\nmodels of transitional and reward uncertainty.\n\n\u03c0reg = arg min\n(cid:126)\u03c00\n\nmax\n\nT\u2208T ,R\u2208R regT,R((cid:126)\u03c00); \u03c0creg = arg min\n\n(cid:126)\u03c00\n\nmax\n\nT\u2208T ,R\u2208R cregT,R((cid:126)\u03c00)\n\nRegret Minimizing Solution\n\nWe will \ufb01rst consider the more general case of dependent uncertainty distributions. Our approach\nto obtaining regret minimizing solution relies on sampling the uncertainty distributions over the\ntransition and reward models. We formulate the regret minimization problem over the sample set as\nan optimization problem and then approximate it as a Mixed Integer Linear Program (MILP).\nWe now describe the representation of a sample and the de\ufb01nition of optimal expected value for\na sample, a key component in the computation of regret. Since there are dependencies amongst\nuncertainties, we can only sample from \u2206\u03c4 , \u2206\u03c1 and not from \u2206\u03c4,t\n\ns,a. Thus, a sample is:\n\n\u03beq = {(cid:10)T 0\n\nq ,R0\n\nq\n\n(cid:11) ,(cid:10)T 1\n\nq ,R1\n\nq\n\n(cid:11) ,\u00b7\u00b7\u00b7(cid:10)T H\u22121\n\nq\n\ns,a , \u2206\u03c1,t\n,RH\u22121\n\nq\n\n(cid:11)}\n\nq and Rt\n\nwhere T t\nq refer to the transition and reward model respectively at time step t in sample q .\nLet (cid:126)\u03c0t represent the policy for each time step from t to H \u2212 1 and the set of samples be denoted\nby \u03be. Intuitively, that corresponds to |\u03be| number of discrete MDPs and our goal is to compute one\npolicy that minimizes the regret over all the |\u03be| MDPs, i.e.\n\u03b1(s) \u00b7 [v\u2217\n\n(s) \u2212 v0\n\n(cid:88)\n\n(s, (cid:126)\u03c00)]\n\n\u03beq\n\n\u03beq\n\n\u03c0reg = arg min\n(cid:126)\u03c00\n\nmax\n\u03beq\u2208\u03be\n\ns\n\n\u03beq\n\nand v0\n\u03beq\n\n(s, (cid:126)\u03c00) denote the optimal expected value and expected value for policy (cid:126)\u03c00 re-\n\nwhere v\u2217\nspectively of the sample \u03beq.\nLet, (cid:126)\u03c00 be any policy corresponding to the sample \u03beq, then the expected value is de\ufb01ned as follows:\n(cid:48)\nvt\n\u03beq (s, (cid:126)\u03c0t) =\nq (s, a, s\n\n\u03beq (s, a, (cid:126)\u03c0t) = Rt\n\n\u03beq (s, a, (cid:126)\u03c0t), where vt\n\n\u03c0t(s, a) \u00b7 vt\n\n, (cid:126)\u03c0t+1) \u00b7 T t\n\n(cid:88)\n\n(cid:88)\n\nq(s, a) + \u03b3\n\n(cid:48)\nvt+1\n\u03beq (s\n\n)\n\na\n\ns(cid:48)\n\nThe optimization problem for computing the regret minimizing policy corresponding to sample set\n\u03be is then de\ufb01ned as follows:\n\n(s, a, (cid:126)\u03c0t) = Rt\n\nvt\n\u03beq\n\nq(s, a) + \u03b3\n\n(s(cid:48), (cid:126)\u03c0t+1) \u00b7 T t\n\nq (s, a, s(cid:48))\n\nvt+1\n\u03beq\n\nThe value function expression in Equation (3) is a product of two variables, \u03c0t(s, a) and\nvt\n\u03beq\n\n(s, a, (cid:126)\u03c0t), which hampers scalability signi\ufb01cantly. We now linearize these nonlinear terms.\n\n3\n\nreg((cid:126)\u03c00)\n\nmin\n(cid:126)\u03c00\ns.t. reg((cid:126)\u03c00) \u2265 v\u2217\n\n\u03b1(s) \u00b7 v0\n\n\u03beq\n\n(s, (cid:126)\u03c00)\n\nvt\n\u03beq\n\n(s, (cid:126)\u03c0t) =\n\n\u03c0t(s, a) \u00b7 vt\n\n(s, a, (cid:126)\u03c0t)\n\n\u2212(cid:88)\n(cid:88)\n\ns\n\n\u03beq\n\na\n\n\u03beq\n\n(cid:88)\n\ns(cid:48)\n\n\u2200\u03beq\n\u2200s, \u03beq, t\n\u2200s, a, \u03beq, t\n\n(2)\n\n(3)\n\n(4)\n\n\fMixed Integer Linear Program\n\nThe optimal policy for minimizing maximum regret in the general case is randomized. However, to\naccount for domains which only allow for deterministic policies, we provide linearization separately\nfor the two cases of deterministic and randomized policies.\nDeterministic Policy: In case of deterministic policies, we replace Equation (3) with the following\nequivalent integer linear constraints:\n\n(s, (cid:126)\u03c0t) \u2264 vt\n\nvt\n\u03beq\n\n(s, a, (cid:126)\u03c0t) ; vt\n\u03beq\n\n\u03beq\n\n(s, (cid:126)\u03c0t) \u2264 \u03c0t(s, a) \u00b7 M\n\n(s, (cid:126)\u03c0t) \u2265 vt\n\n(s, a, (cid:126)\u03c0t) \u2212 (1 \u2212 \u03c0t(s, a)) \u00b7 M \u2200s, a, \u03beq, t\n\n\u03beq\n\nvt\n\u03beq\n\nM is a large positive constant that is an upper bound on vt\n\u03beq\nterms in Equation (3) can be veri\ufb01ed by considering all values of \u03c0t(s, a).\nRandomized Policy: When (cid:126)\u03c00 is a randomized policy, we have a product of two continuous vari-\nables. We provide a mixed integer linear approximation to address the product terms above. Let,\n\n(5)\n(s, a, (cid:126)\u03c0t). Equivalence to the product\n\nvt\n\u03beq\n\n(s, a, (cid:126)\u03c0t) + \u03c0t(s, a)\n\n; Bt\n\u03beq\n\n(s, a, (cid:126)\u03c0t) =\n\nvt\n\u03beq\n\n(s, a, (cid:126)\u03c0t) \u2212 \u03c0t(s, a)\n\n2\n\nAt\n\u03beq\n\n(s, a, (cid:126)\u03c0t) =\n\n2\nEquation (3) can then be rewritten as:\n(s, (cid:126)\u03c0t) =\n\nvt\n\u03beq\n\n(cid:88)\n\na\n\n[At\n\u03beq\n\n(s, a, (cid:126)\u03c0t)2 \u2212 Bt\n\n\u03beq\n\n(s, a, (cid:126)\u03c0t)2]\n\n(6)\n\n(s, a, (cid:126)\u03c0t) and hence for At\n\u03beq\n\nAs discussed in the next subsection on \u201cPruning dominated actions\u201d, we can compute upper and\nlower bounds for vt\n(s, a, (cid:126)\u03c0t). We approximate the\n\u03beq\nsquared terms by using piecewise linear components that provide an upper bound on the squared\nterms. We employ a standard method from literature of dividing the variable range into multiple\nbreak points. More speci\ufb01cally, we divide the overall range of At\n(s, a, (cid:126)\u03c0t)), say\n\u03beq\n[br0, brr] into r intervals by using r+1 points namely (cid:104)br0, br1, . . . , brr(cid:105). We associate a linear vari-\n(s, a, (cid:126)\u03c0t)2)\nable, \u03bbt\n\u03beq\nas follows:\n\n(s, a, w) with each break point w and then approximate At\n\u03beq\n\n(s, a, (cid:126)\u03c0t)2 (and Bt\n\u03beq\n\n(s, a, (cid:126)\u03c0t) and Bt\n\u03beq\n\n(s, a, (cid:126)\u03c0t) (or Bt\n\u03beq\n\n(cid:88)\n(cid:88)\n\nw\n\nAt\n\u03beq\n\n(s, a, (cid:126)\u03c0t) =\n\n\u03bbt\n\u03beq\n\nAt\n\u03beq\n\n(cid:88)\n\n(s, a, (cid:126)\u03c0t)2 =\n\nw\n\n\u03bbt\n\u03beq\n\n(s, a, w) = 1,\n\nw\nSOS2s,a,t\n\n\u03beq\n\n({\u03bbt\n\n\u03beq\n\n(s, a, w)}w\u2264r),\n\n(s, a, w) \u00b7 brw,\n(s, a, w) \u00b7 (brw)2,\n\n\u03bbt\n\u03beq\n\n\u2200s, a, \u03beq, t\n\u2200s, a, \u03beq, t\n\u2200s, a, \u03beq, t\n\u2200s, a, \u03beq, t\n\n(7)\n\n(8)\n\n(9)\n\nwhere SOS2 is a construct which is associated with a set of variables of which at most two variables\ncan be non-zero and if two variables are non-zero they must be adjacent. Since any number in the\nrange lies between at most two adjacent points, we have the above constructs for the \u03bbt\n(s, a, w)\n\u03beq\nvariables. We implement the above adjacency constraints on \u03bbt\n(s, a, w) using the CPLEX Special\n\u03beq\nOrdered Sets (SOS) type 22.\n\nProposition 2 Let [c,d] denote the range of values for At\n\u03beq\npoints that divide At\n\u03beq\nerror \u03b4 < \u0001\n4 .\n\n(s, a, (cid:126)\u03c0t)2 into r equal intervals of size \u0001 = d2\u2212c2\n\nr\n\n(s, a, (cid:126)\u03c0t) and assume we have r + 1\nthen the approximation\n\nProof: Let the r + 1 points be br0, . . . , brr. By de\ufb01nition, we have (brw)2 = (brw\u22121)2 + \u0001. Because\nof the convexity of x2 function, the maximum approximation error in any interval [brw\u22121, brw]\noccurs at its mid-point3. Hence, approximation error \u03b4 is given by:\n\n\u03b4 \u2264 (brw)2 + (brw\u22121)2\n\n2\n\n\u2212\n\n\u0001 + 2 \u00b7 brw\u22121 \u00b7 (brw\u22121 \u2212 brw)\n\n4\n\n<\n\n\u0001\n4\n\n(cid:4)\n\n(cid:20) brw + brw\u22121\n\n(cid:21)2\n\n2\n\n=\n\n2Using CPLEX SOS-2 considerably improves runtime compared to a binary variables formulation.\n3Proposition and proof provided in supplement as footnote 3\n\n4\n\n\fProposition 3 Let \u02c6vt\n\u03beq\n\n(s, (cid:126)\u03c0t) \u2212 |A| \u00b7 \u0001 \u00b7 (1 \u2212 \u03b3H\u22121)\n\n(s, (cid:126)\u03c0t) denote the approximation of vt\n\u03beq\n(s, (cid:126)\u03c0t) \u2264 vt\n\n\u2264 \u02c6vt\n\nvt\n\u03beq\n\n\u03beq\n\n(s, (cid:126)\u03c0t) +\n\n\u03beq\n\n(s, (cid:126)\u03c0t). Then\n\n4 \u00b7 (1 \u2212 \u03b3)\n\n|A| \u00b7 \u0001 \u00b7 (1 \u2212 \u03b3H\u22121)\n\n4 \u00b7 (1 \u2212 \u03b3)\n\nProof Sketch4: We use the approximation error provided in Proposition 2 and propagate it through\nthe value function update. (cid:4)\nCorollary 2 The positive and negative errors in regret are bounded by |A|\u00b7\u0001\u00b7(1\u2212\u03b3H\u22121)\nProof. From Equation (2) and Proposition 3, we have the proof. (cid:4)\nSince the break points are \ufb01xed before hand, we can \ufb01nd tighter bounds (refer to Proof of Proposi-\ntion 2). Also, we can further improve on the performance (both run-time and solution quality) of the\nMILP by pruning out dominated actions and adopting clever sampling strategies as discussed in the\nnext subsections.\n\n4\u00b7(1\u2212\u03b3)\n\nPruning dominated actions\n\nWe now introduce a pruning approach5 to remove actions that will never be assigned a positive\nprobability in a regret minimization strategy. For every state-action pair at each time step, we de\ufb01ne\na minimum and maximum value function as follows:\n\n(cid:110)\n(cid:110)\nvt,max\n\u03beq\nAn action a(cid:48) is pruned if there exists the same action a over all samples \u03beq, such that\n\nq(s, a) + \u03b3(cid:80)\nq(s, a) + \u03b3(cid:80)\n\nq (s, a, s(cid:48)) \u00b7 vt+1,min\nq (s, a, s(cid:48)) \u00b7 vt+1,max\n\n(s, a) = Rt\n(s, a) = Rt\n\n(s(cid:48)) ; vt,min\n(s(cid:48)) ; vt,max\n\ns(cid:48) T t\ns(cid:48) T t\n\n(cid:111)\n(cid:111)\n\n(s) = maxa\n\n(s) = mina\n\nvt,max\n\u03beq\n\nvt,min\n\u03beq\n\nvt,min\n\u03beq\n\n(s, a)\n\n(s, a)\n\n\u03beq\n\n\u03beq\n\n\u03beq\n\n\u03beq\n\nvt,min\n\u03beq\n\n(s, a) \u2265 vt,max\n\n(s, a(cid:48)) \u2203a, \u2200\u03beq\n\n\u03beq\n\nThe above pruning step follows from the observation that an action whose best case payoff is less\nthan the worst case payoff of another action a cannot be part of the regret optimal strategy, since we\ncould switch from a(cid:48) to a without increasing the regret value. It should be noted that an action that\nis not optimal for any of the samples cannot be pruned.\n\nGreedy sampling\n\nThe scalability of the MILP formulation above is constrained by the number of samples Q. So,\ninstead of generating only the \ufb01xed set of Q samples from the uncertainty distribution over models,\nwe generate more than Q samples and then pick a set of size Q so that samples are \u201cas far apart\u201d\nas possible. The key intuition in selecting the samples is to consider distance among samples as\nbeing equivalent to entropy in the optimal policies for the MDPs in the samples. For each decision\nepoch, t, each state s and action a, we de\ufb01ne P rs,a,t\n\u03be (s, a) = 1) to be the probability that a is\nthe optimal action in state s at time t. Similarly, we de\ufb01ne P rs,a,t\n\n(\u03c0\u2217t\n\n(\u03c0\u2217t\n\n\u03be\n\n\u03be (s, a) = 0):\n\n\u03be\n\nP rs,a,t\n\n\u03be\n\n(\u03c0\u2217t\n\n\u03be (s, a) = 1) =\n\n\u2206S(\u03be) = \u2212(cid:88)\n\n\u03beq\n\n\u03c0\u2217t\n\u03beq\nQ\n\n(s, a)\n\n; P rs,a,t\n\n\u03be\n\n(\u03c0\u2217t\n\n\u03be (s, a) = 0) =\n\n\u03be (s, a) = z) \u00b7 ln(cid:0)P rs,a,t\n\n\u03be\n\n(\u03c0\u2217t\n\nP rs,a,t\n\n\u03be\n\nLet the total entropy of sample set, \u03be (|\u03be| = Q) be represented as \u2206S(\u03be), then\n\n(cid:80)\n(cid:88)\n\nt,s,a\n\nz\u2208{0,1}\n\n(cid:17)\n\n(s, a)\n\n\u03beq\n\n1 \u2212 \u03c0\u2217t\n\n(cid:16)\n(cid:80)\n\u03be (s, a) = z)(cid:1)\n\nQ\n\n\u03beq\n\n(\u03c0\u2217t\n\nWe use a greedy strategy to select the Q samples, i.e. we iteratively add samples that maximize\nentropy of the sample set in that iteration.\nIt is possible to provide bounds on the number of samples required for a given error using the\nmethods suggested by Shapiro et al. [14]. However these bounds are conservative and as we show\nin the experimental results section, typically, we only require a small number of samples.\n\n4Detailed proof in supplement under Proposition 3\n5Pseudo code provided in the supplement under \u201dPruning dominated actions\u201d section.\n\n5\n\n\fCER Minimizing Solution\n\nThe MILP based approach mentioned in the previous section can easily be adapted to minimize the\nmaximum cumulative regret over all samples when uncertainties across states are dependent:\n\nmin\n(cid:126)\u03c00\n\ncreg((cid:126)\u03c00)\n\ns.t. creg((cid:126)\u03c00) \u2265(cid:88)\n\n\u03b1(s) \u00b7 creg0\n\n\u03beq (s, (cid:126)\u03c0t),\n\u03c0t(s, a) \u00b7 cregt\n\ns\n\n(cid:88)\n\u03beq (s, (cid:126)\u03c0t) =\n\u03beq (s, a, (cid:126)\u03c0t) = R\u2217,t\n\na\n\ncregt\n\ncregt\n\nq (s) \u2212 Rt\n\nq(s, a) + \u03b3\n\n\u03beq (s, a, (cid:126)\u03c0t),\n\n\u2200\u03beq\n\n\u2200s, t, \u03beq\n\nT t\nq (s, a, s\n\n(cid:48)\n\n) \u00b7 cregt+1\n\n(cid:48)\n\u03beq (s\n\n, (cid:126)\u03c0t+1), \u2200s, a, t, \u03beq\n\n(10)\n\n(11)\n\n(cid:88)\n\ns(cid:48)\n\n\u03beq\n\n(s, a, (cid:126)\u03c0t) is approximated as described earlier.\n\nwhere the product term \u03c0t(s, a) \u00b7 cregt\nWhile we were unable to exploit the independence of uncertainty distributions across states with\nminimax regret, we are able to exploit the independence with minimax CER. In fact, a key advantage\nof the CER robustness concept in the context of independent uncertainties is that it has the optimal\nsubstructure over time steps and hence a Dynamic Programming(DP) algorithm can be used to solve\nit.\nIn the case of independent uncertainties, samples at each time step can be drawn independently and\nwe now introduce a formal notation to account for samples drawn at each time step. Let \u03bet denote\nthe set of samples at time step t, then \u03be = \u00d7t\u2264H\u22121\u03bet. Further, we use (cid:126)\u03bet to indicate cross product\nof samples from t to H \u2212 1, i.e. (cid:126)\u03bet = \u00d7t\u2264e\u2264H\u22121\u03bee. Thus, (cid:126)\u03be0 = \u03be. To indicate the entire horizon\nsamples corresponding to a sample p from time step t, we have (cid:126)\u03bet\n(s, a) = R\u2217,t\u22121\nFor notational compactness, we use \u2206Rt\u22121\n(cid:104)\ndence in uncertainties across time steps, for a sample set (cid:126)\u03bet\u22121\n\np \u00d7 (cid:126)\u03bet+1.\np = \u03bet\n(s) \u2212 Rt\u22121\n(s, a). Because of indepen-\n(cid:105)\np \u00d7 (cid:126)\u03bet, we have the following:\np = \u03bet\u22121\n(cid:88)\nT t\n(cid:48)\n(cid:105)\n(cid:88)\np (s, a, s\n\n(s, (cid:126)\u03c0t\u22121) = max\n\u03bet\u22121\np \u00d7\u03bet\n\n(cid:88)\n(cid:88)\n\ncregt\u22121\n(cid:126)\u03bet\u22121\n\n\u03c0t\u22121(s, a)\n\n) \u00b7 cregt\n\n\u2206Rt\u22121\n\n(s, a) + \u03b3\n\n(cid:48)\n(cid:126)\u03bet (s\n\nmax\n(cid:126)\u03bet\u22121\np\n\n, (cid:126)\u03c0t)\n\n(cid:104)\n\np\n\na\n\np\n\np\n\np\n\np\n\np\n\n\u03c0t\u22121(s, a)\n\n\u2206Rt\u22121\n\np\n\n(s, a) + \u03b3\n\ns(cid:48)\nT t\n(cid:48)\np (s, a, s\n\n) \u00b7 max\nq\u2208(cid:126)\u03bet\n(cid:126)\u03bet\n\n(cid:48)\n(s\n\n, (cid:126)\u03c0t)\n\ncregt\n(cid:126)\u03bet\nq\n\n= max\n\u03bet\u22121\n\np\n\na\n\n(12)\nProposition 4 At time step t \u2212 1, the CER corresponding to any policy \u03c0t\u22121 will have least regret\nif it includes the CER minimizing policy from t. Formally, if (cid:126)\u03c0\u2217,t represents the CER minimizing\npolicy from t and (cid:126)\u03c0t represents any arbitrary policy, then:\n\ns(cid:48)\n\n\u2200s : max\n\n(cid:126)\u03bet\u22121\np \u2208(cid:126)\u03bet\u22121\n\n(cid:16)\ns,(cid:10)\u03c0t\u22121, (cid:126)\u03c0\u2217,t(cid:11)(cid:17) \u2264 max\n\n(cid:126)\u03bet\u22121\np \u2208(cid:126)\u03bet\u22121\n(s, (cid:126)\u03c0\u2217,t) \u2264 max\nq\u2208(cid:126)\u03bet\n(cid:126)\u03bet\n\ncregt\n(cid:126)\u03bet\nq\n\ncregt\n(cid:126)\u03bet\nq\n\np\n\ncregt\u22121\n(cid:126)\u03bet\u22121\nif, \u2200s : max\nq\u2208(cid:126)\u03bet\n(cid:126)\u03bet\n\ncregt\u22121\n(cid:126)\u03bet\u22121\n\np\n\n(s, (cid:126)\u03c0t)\n\n(cid:16)\n\ns,(cid:10)\u03c0t\u22121, (cid:126)\u03c0t(cid:11)(cid:17)\n\n(13)\n\n(14)\n\n(15)\n\nProof Sketch6 We prove this by using Equations (14) and (12) in LHS of Equation (13). (cid:4)\nIt is easy to show that minimizing CER also has an optimal substructure:\n\n(cid:88)\n\ns\n\nmin\n(cid:126)\u03c00\n\nmax\n\n(cid:126)\u03be0\np\n\n\u03b1(s) \u00b7 creg0\n(cid:126)\u03be0\np\n\n(s, (cid:126)\u03c00) =\u21d2 min\n\n(cid:126)\u03c00\n\n\u03b1(s) \u00b7(cid:2) max\n\n(cid:88)\n\ns\n\ncreg0\n(cid:126)\u03be0\np\n\n(cid:126)\u03be0\np\n\n(s, (cid:126)\u03c00)(cid:3)\n\nIn Proposition 4 (extending the reasoning to t = 1), we have already shown that max(cid:126)\u03be0\ncreg0\n(cid:126)\u03be0\np\nhas an optimal substructure. Thus, Equation (15) can also exploit the optimal substructure.\nMINIMIZECER function below provides the pseudo code for a DP algorithm that exploits this struc-\nture. At each stage, t we calculate the creg for each state-action pair corresponding to each of the\n\n(s, (cid:126)\u03c00)\n\np\n\n6Detailed proof in supplement under Proposition 4.\n\n6\n\n\fsamples at that stage, i.e. \u03bet (lines 6-9). Once these are computed, we obtain the maximum creg and\nthe policy corresponding to it (line 10) using the GETCER() function. In the next iteration, creg\ncomputed at t is then used in the computation of creg at t\u2212 1 using the same update step (lines 6-9).\nMINIMIZECER()\n1: for all t \u2264 H \u2212 1 do\n\u03bet \u2190 GENSAMPLES(T, R)\n2:\n3: for all s \u2208 S do\ncregH (s) \u2190 0\n4:\n5: while t >= 0 do\nfor all s \u2208 S do\n6:\n7:\n8:\n9:\n10:\n11:\n\nGETCER (s,{cregt\ncregt(s) \u2265(cid:88)\n(cid:88)\n\nq \u2208 \u03bet, a \u2208 A do\n(s, a) \u2190 \u2206Rt\n\n0 \u2264 \u03c0t(s, a) \u2264 1,\u2200a\n\nfor all \u03bet\ncregt\n\u03bet\nq\n\n(cid:10)\u03c0t, cregt(s)(cid:11) \u2190 GETCER (s, {cregt\n\nq (s, a, s(cid:48)) \u00b7 cregt+1(s(cid:48))\n\ns(cid:48) T t\n\nq(s, a)+\n\n\u03c0t(s, a) \u00b7 cregt\n\n(s, a)})\n\n\u03b3(cid:80)\n\n(s, a), \u2200\u03bet\n\nq\n\n\u03bet\nq\n\n(s, a)})\n\n\u03bet\nq\n\ncregt(s)\n\nmin\n\n\u03c0\n\nt \u2190 t \u2212 1\n\n\u03c0t(s, a) = 1\n\n\u03bet\nq\n\na\n\na\n\nreturn ( (cid:126)creg0, (cid:126)\u03c00)\n\nIt can be noted that MINIMIZECER() makes only H \u00b7 |S| calls to the LP in GETCER() function,\neach of which has only |A| continuous variables and at most [1 + maxt |\u03bet|] number of constraints.\nThus, the overall complexity of MinimizeCER() is polynomial in the number of samples given \ufb01xed\nvalues of other attributes.\nLet creg\u2217,H\u22121(s, a) denote the optimal cumulative regret at time step H \u2212 1 for taking action a in\n\u2217,H\u22121\n(s, a) denote the optimal cumulative regret over the sample set \u03be. Let indicator\nstate s and creg\n\u03be\n(s, a) \u2264 \u03bb\n\n(cid:40)\n\nrandom variable, X be de\ufb01ned as follows: X =\n\nif creg\u2217,H\u22121(s, a) \u2212 creg\notherwise\n\n\u2217,H\u22121\n\u03be\n\n1\n0\n\nBy using Chernoff and Hoeffding bounds on X, it is possible to provide bounds on deviation from\nmean and on the number of samples at H \u2212 1. This can then be propagated to H \u2212 2 and so on.\nHowever, these bounds can be very loose and they do not exploit the properties of creg functions.\nBounds developed on spacings of order statistics can help exploit the properties of creg functions.\nWe will leave this for future work.\n\nExperimental Results\n\nIn this section, we provide performance comparison of various algorithms introduced in previous\nsections over two domains. MILP-Regret refers to the randomized policy variant of the MILP ap-\nproximation algorithm for solving uncertain MDPs with dependent uncertainties. Similar one for\nminimizing CER is referred to as MILP-CER. We refer to the dynamic programming algorithm\nfor minimizing CER in the independent uncertainty case as DP-CER and \ufb01nally, we refer to the\nmaximin value algorithm as \u201cMaximin\u201d. All the algorithms \ufb01nished within 15 minutes on all the\nproblems. DP-CER was much faster than other algorithms and \ufb01nished within a minute on the\nlargest problems.\nWe provide the following results in this section:\n\n(1) Performance comparison of Greedy sampling and Random sampling strategies in the context of\n\nMILP-Regret as we increase the number of samples.\n\n(2) SAA analysis of the results obtained using MILP-Regret.\n(3) Comparison of MILP-Regret and MILP-CER policies with respect to simulated regret.\n(4) Comparison of DP-CER and Maximin.\n\nThe \ufb01rst three comparisons correspond to the dependent uncertainties case and the results are based\non a path planning problem that is motivated by disaster rescue and is a modi\ufb01cation of the one\nemployed in Bagnell et al. [1]. On top of normal transitional uncertainty, we have uncertainty over\ntransition and reward models due to random obstacles and random reward cells. Furthermore, these\nuncertainties are dependent on each other due to patterns in terrains. Each sample of the various\nuncertainties represents an individual map and can be modelled as an MDP. We experimented with\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: In (a),(b) we have 4 \u00d7 4 grid, H = 5. In (c), the maximum inventory size (X) = 50,\nH = 20, |\u03bet| = 50. The normal distribution mean \u00b5 = {0.3, 0.4, 0.5} \u00b7 X and \u03c3 \u2264 min{\u00b5,X\u2212\u00b5}\n\n3\n\na grid world of size 4x4 while varying numbers of obstacles, reward cells, horizon and the number\nof break points employed (3-6).\nIn Figure 1a, we show the effect of using greedy sampling strategy on the MILP-Regret policy. On\nX-axis, we represent the number of samples used for computation of policy (learning set). The test\nset from which the samples were selected consisted of 250 samples. We then obtained the policies\nusing MILP-Regret corresponding to the sample sets (referred to as learning set) generated by using\nthe two sampling strategies. On Y-axis, we show the percentage difference between simulated regret\nvalues on test and learning sample sets. We observe that for a \ufb01xed difference, the number of samples\nrequired by greedy is signi\ufb01cantly lower in comparison to random. Furthermore, the variance in\ndifference is also much lower for greedy. A key result from this graph is that even with just 15\nsamples, the difference with actual regret is less than 10%.\nFigure 1b shows that even the gap obtained using SAA analysis7 is near zero (< 0.1) with 15\nsamples. We have shown the gap and variance on the gap over three different settings of uncertainty\nlabeled 1,2 and 3. Setting 3 has the highest uncertainty over the models and Setting 1 has the least\nuncertainty. The variance over the gap was higher for higher uncertainty settings.\nWhile MILP-CER obtained a simulated regret value (over 250 samples) within the bound provided\nin Proposition 1, we were unable to \ufb01nd any correlation in the simulated regret values of MILP-\nRegret and MILP-CER policies as the samples were increased. We have not yet ascertained a reason\nfor there being no correlation in performance.\nIn the last result shown in Figure 1c, we employ the well known single product \ufb01nite horizon stochas-\ntic inventory control problem [10]. We compare DP-CER against the widely used benchmark algo-\nrithm on this domain, Maximin. The demand values at each decision epoch were taken from a\nnormal distribution. We considered three different settings of mean and variance of the demand.\nAs expected, the DP-CER approach provides much higher values than maximin and the difference\nbetween the two reduced as the cost to revenue ratio increased. We obtained similar results when\nthe demands were taken from other distributions (uniform and bi-modal).\n\nConclusions\n\nWe have introduced scalable sampling-based mechanisms for optimizing regret and a new variant of\nregret called CER in uncertain MDPs with dependent and independent uncertainties across states.\nWe have provided a variety of theoretical results that indicate the connection between regret and\nCER, quality bounds on regret in case of MILP-Regret, optimal substructure in optimizing CER for\nindependent uncertainty case and run-time performance for MinimizeCER. In the future, we hope to\nbetter understand the correlation between regret and CER, while also understanding the properties\nof CER policies.\nAcknowledgement This research was supported in part by the National Research Foundation Sin-\ngapore through the Singapore MIT Alliance for Research and Technologys Future Urban Mobility\nresearch programme. The last author was also supported by ONR grant N00014-12-1-0999.\n\n7We have provided the method for performing SAA analysis in the supplement.\n\n8\n\n510152025Samples00.10.20.30.40.50.60.7Di\ufb00erenceRandomGreedySampling Strategies (MILP-Regret)510152025Samples\u22121\u22120.500.511.522.5Gap123SAA Analysis0.10.20.30.40.50.60.70.80.91Cost-to-revenue ratio0100200300400PerformanceDP-CER (0.3)Maximin (0.3)DP-CER (0.5)Maximin (0.5)DP-CER (0.7)Maximin (0.7)DP-CER Analysis\fReferences\n[1] J. Andrew Bagnell, Andrew Y. Ng, and Jeff G. Schneider. Solving uncertain markov decision\n\nprocesses. Technical report, Carnegie Mellon University, 2001.\n\n[2] S\u00b4ebastien Bubeck, R\u00b4emi Munos, and Gilles Stoltz. Pure exploration in multi-armed bandits\nproblems. In Proceedings of the 20th international conference on Algorithmic learning theory,\nAlgorithmic Learning Theory, 2009.\n\n[3] Robert Givan, Sonia Leach, and Thomas Dean. Bounded-parameter markov decision pro-\n\ncesses. Arti\ufb01cial Intelligence, 122, 2000.\n\n[4] G. Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30, 2004.\n[5] Michael Kearns, Yishay Mansour, and Andrew Y. Ng. A sparse sampling algorithm for near-\n\noptimal planning in large markov decision processes. Machine Learning, 49, 2002.\n\n[6] Shie Mannor, O\ufb01r Mebel, and Huan Xu. Lightning does not strike twice: Robust MDPs with\n\ncoupled uncertainty. In International Conference on Machine Learning (ICML), 2012.\n\n[7] Andrew Mastin and Patrick Jaillet. Loss bounds for uncertain transition probabilities in markov\ndecision processes. In IEEE Annual Conference on Decision and Control (CDC), 2012, 2012.\n[8] Arnab Nilim and Laurent El Ghaoui. Ghaoui, l.: Robust control of markov decision processes\n\nwith uncertain transition matrices. Operations Research, 2005.\n\n[9] Joelle Pineau, Geoffrey J. Gordon, and Sebastian Thrun. Point-based value iteration: An\nanytime algorithm for POMDPs. In International Joint Conference on Arti\ufb01cial Intelligence,\n2003.\n\n[10] Martin Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming.\n\nJohn Wiley and Sons, 1994.\n\n[11] Kevin Regan and Craig Boutilier. Regret-based reward elicitation for markov decision pro-\n\ncesses. In Uncertainty in Arti\ufb01cial Intelligence, 2009.\n\n[12] Kevin Regan and Craig Boutilier. Robust policy computation in reward-uncertain MDPs using\n\nnondominated policies. In National Conference on Arti\ufb01cial Intelligence (AAAI), 2010.\n\n[13] Leonard Savage. The Foundations of Statistics. Wiley, 1954.\n[14] A. Shapiro. Monte carlo sampling methods. In Stochastic Programming, volume 10 of Hand-\n\nbooks in Operations Research and Management Science. Elsevier, 2003.\n\n[15] Wolfram Wiesemann, Daniel Kuhn, and Ber Rustem. Robust markov decision processes.\n\nMathematics of Operations Research, 38(1), 2013.\n\n[16] Huan Xu and Shie Mannor. Parametric regret in uncertain markov decision processes. In IEEE\n\nConference on Decision and Control, CDC, 2009.\n\n9\n\n\f", "award": [], "sourceid": 483, "authors": [{"given_name": "Asrar", "family_name": "Ahmed", "institution": "Singapore Management University"}, {"given_name": "Pradeep", "family_name": "Varakantham", "institution": "Singapore Management University"}, {"given_name": "Yossiri", "family_name": "Adulyasak", "institution": "Singapore-MIT Alliance for Research and Technology"}, {"given_name": "Patrick", "family_name": "Jaillet", "institution": "MIT"}]}