{"title": "Optimization, Learning, and Games with Predictable Sequences", "book": "Advances in Neural Information Processing Systems", "page_first": 3066, "page_last": 3074, "abstract": "We provide several applications of Optimistic Mirror Descent, an online learning algorithm based on the idea of predictable sequences. First, we recover the Mirror-Prox algorithm, prove an extension to Holder-smooth functions, and apply the results to saddle-point type problems. Second, we prove that a version of Optimistic Mirror Descent (which has a close relation to the Exponential Weights algorithm) can be used by two strongly-uncoupled players in a finite zero-sum matrix game to converge to the minimax equilibrium at the rate of O(log T / T). This addresses a question of Daskalakis et al, 2011. Further, we consider a partial information version of the problem. We then apply the results to approximate convex programming  and show a simple algorithm for the approximate Max-Flow problem.", "full_text": "Optimization, Learning, and Games with Predictable\n\nSequences\n\nAlexander Rakhlin\n\nUniversity of Pennsylvania\n\nKarthik Sridharan\n\nUniversity of Pennsylvania\n\nAbstract\n\nWe provide several applications of Optimistic Mirror Descent, an online learning\nalgorithm based on the idea of predictable sequences. First, we recover the Mir-\nror Prox algorithm for of\ufb02ine optimization, prove an extension to H\u00a8older-smooth\nfunctions, and apply the results to saddle-point type problems. Next, we prove\nthat a version of Optimistic Mirror Descent (which has a close relation to the Ex-\nponential Weights algorithm) can be used by two strongly-uncoupled players in\na \ufb01nite zero-sum matrix game to converge to the minimax equilibrium at the rate\n\nofO((log T)\uffffT). This addresses a question of Daskalakis et al [6]. Further, we\n\nconsider a partial information version of the problem. We then apply the results\nto convex programming and exhibit a simple algorithm for the approximate Max\nFlow problem.\n\n1\n\nIntroduction\n\nRecently, no-regret algorithms have received increasing attention in a variety of communities, in-\ncluding theoretical computer science, optimization, and game theory [3, 1]. The wide applicability\nof these algorithms is arguably due to the black-box regret guarantees that hold for arbitrary se-\nquences. However, such regret guarantees can be loose if the sequence being encountered is not\n\u201cworst-case\u201d. The reduction in \u201carbitrariness\u201d of the sequence can arise from the particular struc-\nture of the problem at hand, and should be exploited. For instance, in some applications of online\nmethods, the sequence comes from an additional computation done by the learner, thus being far\nfrom arbitrary.\nOne way to formally capture the partially benign nature of data is through a notion of predictable\nsequences [11]. We exhibit applications of this idea in several domains. First, we show that the\nMirror Prox method [9], designed for optimizing non-smooth structured saddle-point problems, can\nbe viewed as an instance of the predictable sequence approach. Predictability in this case is due\nprecisely to smoothness of the inner optimization part and the saddle-point structure of the problem.\nWe extend the results to H\u00a8older-smooth functions, interpolating between the case of well-predictable\ngradients and \u201cunpredictable\u201d gradients.\nSecond, we address the question raised in [6] about existence of \u201csimple\u201d algorithms that converge\n\nat the rate of \u02dcO(T\u22121) when employed in an uncoupled manner by players in a zero-sum \ufb01nite\nmatrix game, yet maintain the usualO(T\u22121\uffff2) rate against arbitrary sequences. We give a positive\n\nanswer and exhibit a fully adaptive algorithm that does not require the prior knowledge of whether\nthe other player is collaborating. Here, the additional predictability comes from the fact that both\nplayers attempt to converge to the minimax value. We also tackle a partial information version of\nthe problem where the player has only access to the real-valued payoff of the mixed actions played\nby the two players on each round rather than the entire vector.\nOur third application is to convex programming: optimization of a linear function subject to convex\nconstraints. This problem often arises in theoretical computer science, and we show that the idea of\n\n1\n\n\fpredictable sequences can be used here too. We provide a simple algorithm for \u270f-approximate Max\n\nFlow for a graph with d edges with time complexity \u02dcO(d3\uffff2\uffff\u270f), a performance previously obtained\n\nthrough a relatively involved procedure [8].\n\n2 Online Learning with Predictable Gradient Sequences\n\ncan be chosen adaptively based on the sequence observed so far. The method adheres to the OCO\n\nextension of the result in [11] for general Mt:\n\nWhen applying the lemma, we will often use the simple fact that\n\n1\n\nwith R2\n\nLet us describe the online convex optimization (OCO) problem and the basic algorithm studied in\n\n[4, 11]. LetF be a convex set of moves of the learner. On round t= 1, . . . , T , the learner makes\na prediction ft \u2208 F and observes a convex function Gt onF. The objective is to keep regret\nt=1 Gt(ft)\u2212 Gt(f\u2217) small for any f\u2217\u2208F. LetR be a 1-strongly convex function w.r.t. some\nT\u2211T\nnorm\uffff\u22c5\uffff onF, and let g0= arg ming\u2208FR(g). Suppose that at the beginning of every round t, the\nft= argmin\nf\u2208F\n\nlearner has access to Mt, a vector computable based on the past observations or side information. In\nthis paper we study the Optimistic Mirror Descent algorithm, de\ufb01ned by the interleaved sequence\n(1)\n\n\u2318t\ufffff, Mt\uffff+DR(f, gt\u22121) , gt= argmin\ng\u2208F\n\n\u2318t\uffffg,\u2207Gt(ft)\uffff+DR(g, gt\u22121)\n\nwhereDR is the Bregman Divergence with respect toR and{\u2318t} is a sequence of step sizes that\nprotocol since Mt is available at the beginning of round t, and\u2207Gt(ft) becomes available after\nthe prediction ft is made. The sequence{ft} will be called primary, while{gt} \u2013 secondary. This\nmethod was proposed in [4] for Mt=\u2207Gt\u22121(ft\u22121), and the following lemma is a straightforward\nLemma 1. LetF be a convex set in a Banach spaceB. LetR\u2236B \u2192 R be a 1-strongly convex\nfunction onF with respect to some norm\uffff\u22c5\uffff, and let\uffff\u22c5\uffff\u2217 denote the dual norm. For any \ufb01xed\nstep-size \u2318, the Optimistic Mirror Descent Algorithm yields, for any f\u2217\u2208F,\nT\ufffft=1\nGt(ft)\u2212 Gt(f\u2217)\u2264 T\ufffft=1\uffffft\u2212 f\u2217,\u2207t\uffff\n\u2264 \u2318\u22121R2+ T\ufffft=1\uffff\u2207t\u2212 Mt\uffff\u2217\uffffgt\u2212 ft\uffff\u2212 1\nwhere R\u2265 0 is such thatDR(f\u2217, g0)\u2264 R2 and\u2207t=\u2207Gt(ft).\n2\uffff\u2207t\u2212 Mt\uffff2\u2217+ 1\n\n\uffff\u2207t\u2212 Mt\uffff\u2217\uffffgt\u2212 ft\uffff= inf\n\u21e2>0\uffff \u21e2\nIn particular, by setting \u21e2 = \u2318, we obtain the (unnormalized) regret bound of \u2318\u22121R2 +\nt=1\uffff\u2207t\u2212 Mt\uffff2\u2217, which is R\uffff2\u2211T\n(\u2318\uffff2)\u2211T\nt=1\uffff\u2207t\u2212 Mt\uffff2\u2217 by choosing \u2318 optimally. Since this choice\nCorollary 2. Consider step size \u2318t= Rmax min\uffff\uffff\uffff\u2211t\u22121\ni=1\uffff\u2207i\u2212 Mi\uffff2\u2217\uffff\u22121\n, 1\uffff\nmax= supf,g\u2208FDR(f, g). Then regret of the Optimistic Mirror Descent algorithm is upper\nbounded by 3.5Rmax\uffff\uffff\u2211T\nt=1\uffff\u2207t\u2212 Mt\uffff2\u2217+ 1\uffff\uffffT .\nThese results indicate that tighter regret bounds are possible if one can guess the next gradient\u2207t\noptimize a function G(f) whose gradients are Lipschitz continuous:\uffff\u2207G(f)\u2212\u2207G(g)\uffff\u2217\u2264 H\ufffff\u2212g\uffff\nfor some H> 0. In this optimization setting, no guessing of Mt is needed: we may simply query\nthe oracle for the gradient and set Mt=\u2207G(gt\u22121). The Optimistic Mirror Descent then becomes\n\u2318t\uffffg,\u2207G(ft)\uffff+DR(g, gt\u22121)\nft= argmin\nf\u2208F\n\nis not known ahead of time, one may either employ the doubling trick, or choose the step size adap-\ntively:\n\nby computing Mt. One such case arises in of\ufb02ine optimization of a smooth function, whereby the\nprevious gradient turns out to be a good proxy for the next one. More precisely, suppose we aim to\n\n\u2318t\ufffff,\u2207G(gt\u22121)\uffff+DR(f, gt\u22121) , gt= argmin\ng\u2208F\n\ni=1\uffff\u2207i\u2212 Mi\uffff2\u2217+\uffff\u2211t\u22122\n\n2\u2318\n\nT\ufffft=1\uffff\uffffgt\u2212 ft\uffff2+\uffffgt\u22121\u2212 ft\uffff2\uffff (2)\n\n2\u21e2\uffffgt\u2212 ft\uffff2\uffff .\n\n(3)\n\n2\n\n\fimmediately yields a bound\n\nwhich can be recognized as the Mirror Prox method, due to Nemirovski [9]. By smoothness,\n\nfor Mirror Prox. We now extend this result to arbitrary \u21b5-H\u00a8older smooth functions, that is convex\n\n\uffff\u2207G(ft)\u2212 Mt\uffff\u2217=\uffff\u2207G(ft)\u2212\u2207G(gt\u22121)\uffff\u2217\u2264 H\uffffft\u2212 gt\u22121\uffff. Lemma 1 with Eq. (3) and \u21e2= \u2318= 1\uffffH\nT\ufffft=1\nG(ft)\u2212 G(f\u2217)\u2264 HR2,\nwhich implies that the average \u00affT = 1\nt=1 ft satis\ufb01es G( \u00affT)\u2212 G(f\u2217)\u2264 HR2\uffffT , a known bound\nT\u2211T\nfunctions G such that\uffff\u2207G(f)\u2212\u2207G(g)\uffff\u2217\u2264 H\ufffff\u2212 g\uffff\u21b5 for all f, g\u2208F.\nLemma 3. LetF be a convex set in a Banach spaceB and letR\u2236B\u2192 R be a 1-strongly convex\nfunction onF with respect to some norm\uffff\u22c5\uffff. Let G be a convex \u21b5-H\u00a8older smooth function with\nconstant H> 0 and \u21b5\u2208[0, 1]. Then the average \u00affT= 1\nT\u2211T\nt=1 ft of the trajectory given by Optimistic\nG(f)\u2264 8HR1+\u21b5\nG( \u00affT)\u2212 inf\nT 1+\u21b5\nf\u2208F\nwhere R\u2265 0 is such that supf\u2208FDR(f, g0)\u2264 R.\nThis result provides a smooth interpolation between the T\u22121\uffff2 rate at \u21b5= 0 (that is, no predictability\nof the gradient is possible) and the T\u22121 rate when the smoothness structure allows for a dramatic\n3 Structured Optimization\n\nspeed up with a very simple modi\ufb01cation of the original Mirror Descent.\n\nMirror Descent Algorithm enjoys\n\n2\n\nIn this section we consider the structured optimization problem\n\nargmin\n\nf\u2208F\n\nG(f)\n\nwhere G(f) is of the form G(f)= supx\u2208X (f, x) with (\u22c5, x) convex for every x\u2208X and (f,\u22c5)\nconcave for every f\u2208F. BothF andX are assumed to be convex sets. While G itself need not be\n\nsmooth, it has been recognized that the structure can be exploited to improve rates of optimization\nif the function  is smooth [10]. From the point of view of online learning, we will see that the opti-\nmization problem of the saddle point type can be solved by playing two online convex optimization\nalgorithms against each other (henceforth called Players I and II).\nSpeci\ufb01cally, assume that Player I produces a sequence f1, . . . , fT by using a regret-minimization\nalgorithm, such that\n\n1\nT\n\n1\nT\n\n1\nT\n\n1\nT\n\n1\nT\n\ninf\nf\n\nBy a standard argument (see e.g. [7]),\n\nand Player II produces x1, . . . , xT with\n\nT\ufffft=1\n(ft, xt)\u2212 inf\nf\u2208F\nT\ufffft=1(\u2212(ft, xt))\u2212 inf\nx\u2208X\nT\ufffft=1\n(f, xt)\u2264 inf\n(f, \u00afxT)\u2264 sup\n\uffff \u00affT , x\uffff\u2264 sup\n(ft, x)\n\u2264 inf\nt=1 ft and \u00afxT= 1\nwhere \u00affT= 1\nT\u2211T\nT\u2211T\nt=1 xt. By adding (4) and (5), we have\nT\ufffft=1\nT\ufffft=1\n(f, xt)\u2264 Rate1(x1, . . . , xT)+ Rate2(f1, . . . , fT)\n(ft, x)\u2212 inf\nx\u2208X\nf\u2208F\n\nT\ufffft=1\n(f, xt)\u2264 Rate1(x1, . . . , xT)\nT\ufffft=1(\u2212(ft, x))\u2264 Rate2(f1, . . . , fT) .\n(f, x)\n(f, x)\u2264 sup\n\nT\ufffft=1\n\nwhich sandwiches the previous sequence of inequalities up to the sum of regret rates and implies\nnear-optimality of \u00affT and \u00afxT .\n\n(4)\n\n(5)\n\n(6)\n\ninf\nf\n\nsup\nx\n\nf\n\n1\nT\n\nx\n\nsup\n\n1\nT\n\n1\nT\n\nf\n\nx\n\nx\n\n3\n\n\f(7)\n\n2\u2318\n\n1\n\n2\n\nR2\n\nsup\n\nsup\n\n2\n\n2\n\nt and M 2\n\n2\u21b5\n\nt\uffff2F\u2217+ 1\nt\uffff2X\u2217+ 1\n\n2\n\nThe proof of Lemma 4 is immediate from Lemma 1. We obtain the following corollary:\n\nLemma 4. Suppose both players employ the Optimistic Mirror Descent algorithm with, respectively,\npredictable sequences M 1\n\nt , 1-strongly convex functionsR1 onF (w.r.t.\uffff\u22c5\uffffF) andR2 on\nX (w.r.t.\uffff\u22c5\uffffX ), and \ufb01xed learning rates \u2318 and \u2318\u2032. Let{ft} and{xt} denote the primary sequences\nof the players while let{gt},{yt} denote the secondary. Then for any \u21b5, > 0,\nx\u2208X \uffff \u00affT , x\uffff\u2212 inf\nx\u2208X (f, x)\nf\u2208F\n\u2264 R2\nT\ufffft=1\uffff\uffffgt\u2212 ft\uffff2F+\uffffgt\u22121\u2212 ft\uffff2F\uffff\nT\ufffft=1\uffff\u2207f (ft, xt)\u2212 M 1\nT\ufffft=1\uffffgt\u2212 ft\uffff2F\u2212 1\n\u2318 + \u21b5\n+ R2\nT\ufffft=1\uffffyt\u2212 xt\uffff2X\u2212 1\nT\ufffft=1\uffff\uffffyt\u2212 xt\uffff2X+\uffffyt\u22121\u2212 xt\uffff2X\uffff\nT\ufffft=1\uffff\u2207x(ft, xt)\u2212 M 2\n\u2318\u2032 + \n2\u2318\u2032\n2, and \u00affT= 1\n1 andDR2(x\u2217, y0)\u2264 R2\nwhere R1 and R2 are such thatDR1(f\u2217, g0)\u2264 R2\nT\u2211T\nt=1 ft.\nCorollary 5. Suppose \u2236F\u00d7X\uffff R is H\u00a8older smooth in the following sense:\n\uffff\u2207f (f, x)\u2212\u2207f (f, y)\uffffF\u2217\u2264 H2\uffffx\u2212 y\uffff\u21b5\u2032X\n\uffff\u2207f (f, x)\u2212\u2207f (g, x)\uffffF\u2217\u2264 H1\ufffff\u2212 g\uffff\u21b5F ,\nand \uffff\u2207x(f, x)\u2212\u2207x(g, x)\uffffX\u2217\u2264 H4\ufffff\u2212 g\uffffF ,\n\uffff\u2207x(f, x)\u2212\u2207x(f, y)\uffffX\u2217\u2264 H3\uffffx\u2212 y\uffff\u2032X .\nLet  = min{\u21b5, \u21b5\u2032,, \u2032}, H = max{H1, H2, H3, H4}. Suppose both players employ Optimistic\nt = \u2207f (gt\u22121, yt\u22121) and M 2\nt = \u2207x(gt\u22121, yt\u22121), where{gt} and{yt}\nare the secondary sequences updated by the two algorithms, and with step sizes \u2318 = \u2318\u2032 =(R2\n1+\n2\uffff \u22121\n2 (2H)\u22121\uffff T\n2) 1\u2212\nx\u2208X \uffff \u00affT , x\uffff\u2212 inf\nf\u2208F\n\n1+ R2\nx\u2208X (f, x)\u2264 4H(R2\n2) 1+\n\nAs revealed in the proof of this corollary, the negative terms in (7), that come from an upper bound\non regret of Player I, in fact contribute to cancellations with positive terms in regret of Player II, and\nvice versa. Such a coupling of the upper bounds on regret of the two players can be seen as leading\nto faster rates under the appropriate assumptions, and this idea will be exploited to a great extent in\nthe proofs of the next section.\n\nMirror Descent with M 1\n\n2 . Then\n\n1+\n\n2\n\nT\n\nsup\n\nsup\n\n2\n\n(8)\n\n4 Zero-sum Game and Uncoupled Dynamics\n\nThe notions of a zero-sum matrix game and a minimax equilibrium are arguably the most basic and\nimportant notions of game theory. The tight connection between linear programming and minimax\nequilibrium suggests that there might be simple dynamics that can lead the two players of the game\nto eventually converge to the equilibrium value. Existence of such simple or natural dynamics is\nof interest in behavioral economics, where one asks whether agents can discover static solution\nconcepts of the game iteratively and without extensive communication.\n\ncourse, this is a particular form of the saddle point problem considered in the previous section, with\n\nMore formally, let A \u2208 [\u22121, 1]n\u00d7m be a matrix with bounded entries. The two players aim to\n\ufb01nd a pair of near-optimal mixed strategies( \u00aff , \u00afx) \u2208 n\u00d7 m such that \u00aff TA\u00afx is close to the\nminimax value minf\u2208n maxx\u2208m f TAx, where n is the probability simplex over n actions. Of\n(f, x)= f TAx. It is well-known (and follows immediately from (6)) that the players can compute\nO(T\u22121\uffff2) convergence rates, Daskalakis et al [6] asked whether faster methods exist. To make the\n\nnear-optimal strategies by simply playing no-regret algorithms [7]. More precisely, on round t, the\nplayers I and II \u201cpredict\u201d the mixed strategies ft and xt and observe Axt and f T\nt A, respectively.\nWhile black-box regret minimization algorithms, such as Exponential Weights, immediately yield\n\nproblem well-posed, it is required that the two players are strongly uncoupled: neither A nor the\nnumber of available actions of the opponent is known to either player, no \u201cfunny bit arithmetic\u201d\nis allowed, and memory storage of each player allows only for constant number of payoff vectors.\nThe authors of [6] exhibited a near-optimal algorithm that, if used by both players, yields a pair of\n\n4\n\n\fT\n\n\uffff-approximate minimax equi-\n\nmixed strategies that constitutes anO\uffff log(m+n)(log T+(log(m+n))3\uffff2)\n\nlibrium. Furthermore, the method has a regret bound of the same order as Exponential Weights\nwhen faced with an arbitrary sequence. The algorithm in [6] is an application of the excessive gap\ntechnique of Nesterov, and requires careful choreography and interleaving of rounds between the\ntwo non-communicating players. The authors, therefore, asked whether a simple algorithm (e.g. a\nmodi\ufb01cation of Exponential Weights) can in fact achieve the same result. We answer this in the af\ufb01r-\nmative. While a direct application of Mirror Prox does not yield the result (and also does not provide\nstrong decoupling), below we show that a modi\ufb01cation of Optimistic Mirror Descent achieves the\ngoal. Furthermore, by choosing the step size adaptively, the same method guarantees the typical\n\nO(T\u22121\uffff2) regret if not faced with a compliant player, thus ensuring robustness.\n\nIn Section 4.1, we analyze the \u201c\ufb01rst-order information\u201d version of the problem, as described above:\nupon playing the respective mixed strategies ft and xt on round t, Player I observes Axt and Player\nt A. Then, in Section 4.2, we consider an interesting extension to partial information,\nII observes f T\nwhereby the players submit their moves ft, xt but only observe the real value f T\nt Axt. Recall that in\nboth cases the matrix A is not known to the players.\n\n4.1 First-Order Information\n\nConsider the following simple algorithm. Initialize f0= g\u20320\u2208 n and x0= y\u20320\u2208 m to be uniform\ndistributions, set = 1\uffffT 2 and proceed as follows:\n\nOn round t, Player I performs\n\nwhile simultaneously Player II performs\n\nPlay\nUpdate\n\nPlay\nUpdate\n\nft and observe Axt\n\ngt(i)\u221d g\u2032t\u22121(i) exp{\u2212\u2318t[Axt]i},\nft+1(i)\u221d g\u2032t(i) exp{\u2212\u2318t+1[Axt]i}\nxt and observe f\ufffft A\nt A]i},\nyt(i)\u221d y\u2032t\u22121(i) exp{\u2212\u2318\u2032t[f T\nxt+1(i)\u221d y\u2032t(i) exp{\u2212\u2318\u2032t+1[f T\nt A]i}\n\ng\u2032t=(1\u2212 ) gt+(\uffffn) 1n\n\ny\u2032t=(1\u2212 ) yt+(\uffffm) 1m\n\nrespectively, M 1\n\nmethod. In such a case, the resulting algorithm is simply the constant step-size Exponential Weights\n\nb. Other than the \u201cmixing in\u201d of the uniform distribution, the algorithm for both players is simply\nthe Optimistic Mirror Descent with the (negative) entropy function.\nIn fact, the step of mixing\nin the uniform distribution is only needed when some coordinate of gt (resp., yt) is smaller than\n\nHere, 1n\u2208 Rn is a vector of all ones and both[b]i and b(i) refer to the i-th coordinate of a vector\n1\uffff(nT 2). Furthermore, this step is also not needed if none of the players deviate from the prescribed\nft(i)\u221d exp{\u2212\u2318\u2211t\u22122\ns=1[Axs\u22121]i+ 2\u2318[Axt\u22121]i}, but with a factor 2 in front of the latest loss vector!\nProposition 6. Let A\u2208[\u22121, 1]n\u00d7m,F= n,X = m. If both players use above algorithm with,\nt = f T\nt = Axt\u22121 and M 2\nt\u22121A, and the adaptive step sizes\n\u2318t= min\ufffflog(nT)\uffff\uffff\u2211t\u22121\ni=1\uffffAxi\u2212 Axi\u22121\uffff2\u2217+\uffff\u2211t\u22122\ni=1\uffffAxi\u2212 Axi\u22121\uffff2\u2217\uffff\u22121\n\u2318\u2032t= min\ufffflog(mT)\uffff\uffff\u2211t\u22121\n\u2217+\uffff\u2211t\u22122\n\u2217\uffff\u22121\ni=1\ufffff T\ni=1\ufffff T\ni\u22121A\uffff2\ni\u22121A\uffff2\ni A\u2212 f T\ni A\u2212 f T\nrespectively, then the pair( \u00affT , \u00afxT) is an O\uffff log m+log n+log T\n\uffff-approximate minimax equilibrium.\n\uffff\uffff\uffff\uffff T\ufffft=1\uffffAxt\u2212 Axt\u22121\uffff2\u2217+ 1\uffff\uffff\uffff\uffff\uffff\uffff .\n\uffff\uffff\uffff\nO\uffff\uffff\uffff\nlog(nT)T\n\nFurthermore, if only one player (say, Player I) follows the above algorithm, her regret against any\nsequence x1, . . . , xT of plays is\n\n11\uffff\n11\uffff\n\nand\n\n(9)\n\n, 1\n\n, 1\n\nT\n\n5\n\n\fIn particular, this implies the worst-case regret ofO\uffff log(nT)\u221aT\n\noptimization.\n\n\uffff in the general setting of online linear\n\nWe remark that (9) can give intermediate rates for regret in the case that the second player deviates\nfrom the prescribed strategy but produces \u201cstable\u201d moves. For instance, if the second player employs\na mirror descent algorithm (or Follow the Regularized Leader / Exponential Weights method) with\n\nLet us \ufb01nish with a technical remark. The reason for the extra step of \u201cmixing in\u201d the uniform\n\nregret if the other player deviates from using the algorithm. If one is only interested in the dynamics\nwhen both players cooperate, this step is not necessary, and in this case the extraneous log T factor\n\nstep size \u2318, one can typically show stability\uffffxt\u2212 xt\u22121\uffff=O(\u2318). In this case, (9) yields the rate\nO\uffff \u2318 log T\u221aT \uffff for the \ufb01rst player. A typical setting of \u2318\u221d T\u22121\uffff2 for the second player still ensures the\nO(log T\uffffT) regret for the \ufb01rst player.\ndistribution stems from the goal of having an adaptive and robust method that still attainsO(T\u22121\uffff2)\ndisappears from the above bound, leading to the O\uffff log n+log m\n\uffff convergence. On the technical side,\nmax \u2265 supgDR1(f\u2217, g) which is potentially in\ufb01nite for the negative entropy function\nR1. It is possible that the doubling trick or the analysis of Auer et al [2] (who encountered the\npreserving the regret minimization property. We also remark that Rmax is small whenR1 is instead\n\nthe need for the extra step is the following. The adaptive step size result of Corollary 2 involves\nthe term R2\n\nthe p-norm; hence, the use of this regularizer avoids the extraneous logarithmic in T factor while\nstill preserving the logarithmic dependence on n and m. However, projection onto the simplex under\nthe p-norm is not as elegant as the Exponential Weights update.\n\nsame problem for the Exponential Weights algorithm) can remove the extra log T factor while still\n\nT\n\n4.2 Partial Information\n\nInterestingly, up to logarithmic factors, the fast rate of the previous section is possible even in this\nscenario, but we do require the knowledge of the number of actions of the opposing player (or, an\nupper bound on this number). We leave it as an open problem the question of whether one can attain\n\nWe now turn to the partial (or, zero-th order) information model. Recall that the matrix A is not\nknown to the players, yet we are interested in \ufb01nding \u270f-optimal minimax strategies. On each round,\nt Axt.\nNow the question is, how many such observations do we need to get to an \u270f-optimal minimax\nstrategy? Can this be done while still ensuring the usual no-regret rate?\nThe speci\ufb01c setting we consider below requires that on each round t, the two players play four\n\nthe two players choose mixed strategies ft \u2208 n and xt \u2208 m, respectively, and observe f T\nt\u2212 f j\ntimes, and that these four plays are -close to each other (that is,\ufffff i\nt\uffff1\u2264  for i, j\u2208{1, . . . , 4}).\nthe 1\uffffT -type rate with only one play per round.\nu1, . . . , un\u22121 : orthonormal basis of n\nInitialize g1, f1= 1\nn 1n; Draw i0\u223c Unif([n\u2212 1])\nAt time t= 1 to T\nDraw it\u223c Unif([n\u2212 1])\nr+t =(ft+ uit\u22121)\uffffAxt\nr\u2212t =(ft\u2212 uit\u22121)\uffffAxt\n\u00afr+t =(ft+ uit)\uffffAxt\n\u00afr\u2212t =(ft\u2212 uit)\uffffAxt\n\u02c6at= n\n2(r+t \u2212 r\u2212t) uit\u22121\n\u00afat= n\n2(\u00afr+t \u2212 \u00afr\u2212t) uit\ngt(i)\u221d g\u2032t\u22121(i) exp{\u2212\u2318t\u02c6at(i)}\ng\u2032t=(1\u2212 ) gt+(\uffffn)1\nft+1(i)\u221d g\u2032t(i) exp{\u2212\u2318t+1\u00afat(i)}\n\nv1, . . . , vm\u22121 : orthonormal basis of m\nInitialize y1, x1= 1\nAt time t= 1 to T\nDraw jt\u223c Unif([m\u2212 1])\ns+t =\u2212f\ufffft A(xt+ vjt\u22121)\ns\u2212t =\u2212f\ufffft A(xt\u2212 vjt\u22121)\n\u00afs+t =\u2212f\ufffft A(xt+ vjt)\n\u00afs\u2212t =\u2212f\ufffft A(xt\u2212 vjt)\n\u02c6bt= m\n2(s+t\u2212 s\u2212t) vjt\u22121\n\u00afbt= m\n2(\u00afs+t\u2212 \u00afs\u2212t) vjt\n\u02c6bt(i)}\nyt(i)\u221d y\u2032t\u22121(i) exp{\u2212\u2318\u2032t\ny\u2032t=(1\u2212 ) yt+(\uffffm)1\nxt+1(i)\u221d y\u2032t(i) exp{\u2212\u2318\u2032t+1\n\u00afbt(i)}\n\nm 1m; Draw j0\u223c Unif([m\u22121])\n\nBuild estimates :\n\nBuild estimates :\n\nObserve :\n\nPlayer II\n\nPlay ft\n\nObserve :\n\nUpdate :\n\nPlay xt\n\nPlayer I\n\nUpdate :\n\nEnd\n\nEnd\n\n6\n\n\f,\n\n,\n\n1\n\n1\n\nand\n\nLemma 7. Let A\u2208[\u22121, 1]n\u00d7m,F= n,X = m, let  be small enough (e.g. exponentially small\nin m, n, T ), and let = 1\uffffT 2. If both players use above algorithms with the adaptive step sizes\n\ni=1\uffff\u02c6ai\u2212\u00afai\u22121\uffff2\u2217\u2212\uffff\u2211t\u22122\ni=1\uffff\u02c6ai\u2212\u00afai\u22121\uffff2\u2217\n\uffff\u02c6at\u22121\u2212\u00afat\u22122\uffff2\u2217\n\u2217\u2212\uffff\u2211t\u22122\ni=1\uffff\u02c6bi\u2212\u00afbi\u22121\uffff2\ni=1\uffff\u02c6bi\u2212\u00afbi\u22121\uffff2\n\u2217\n\uffff\u02c6bt\u22121\u2212\u00afbt\u22122\uffff2\n\u2217\n\n\u2318t= min\uffff\ufffflog(nT)\uffff\u2211t\u22121\n28m\ufffflog(mT)\uffff\n28n\ufffflog(nT)\uffff\uffff\uffff\uffff\uffff\uffff\uffff\n\u2318\u2032t= min\uffff\uffff\uffff\uffff\uffff\uffff\uffff\n\ufffflog(mT)\uffff\u2211t\u22121\nrespectively, then the pair( \u00affT , \u00afxT) is an\nO\uffff\uffff\uffff\uffffm log(nT)\ufffflog(mT)+ n log(mT)\ufffflog(nT)\uffff\n\uffff\uffff\uffff\nm\ufffflog(mT) log(nT)+ n\ufffflog(nT)\u2211T\nO\uffff\uffff\uffff\nt=1\uffffxt\u2212 xt\u22121\uffff2\nWe leave it as an open problem to \ufb01nd an algorithm that attains the 1\uffffT -type rate when both players\ni Aej = Ai,j upon drawing pure actions i, j from their respective mixed\nstrategies ft, xt. We hypothesize a rate better than T\u22121\uffff2 is not possible in this scenario.\n5 Approximate Smooth Convex Programming\n\n-approximate minimax equilibrium. Furthermore, if only one player (say, Player I) follows the above\nalgorithm, her regret against any sequence x1, . . . , xT of plays is bounded by\n\nonly observe the value eT\n\n\uffff\uffff\uffff\n\nT\n\nT\n\nIn this section we show how one can use the structured optimization results from Section 3 for ap-\nproximately solving convex programming problems. Speci\ufb01cally consider the optimization problem\n(10)\n\nargmax\n\nf\u2208G\n\ns.t.\n\nc\ufffff\n\u2200i\u2208[d], Gi(f)\u2264 1\n\nmax\n\n(11)\n\nargmin\n\nsup\n\nx\u2208X\n\nf\u2208F\n\nd\uffffi=1\n\nx(i)Gi(f) .\n\na mixture of constraints with the aim of violating at least one of them.\n\ni\u2208[d] Gi(f)= argmin\nf\u2208F\n\nThis problem is in the saddle-point form, as studied earlier in the paper. We may think of the \ufb01rst\n\nwhereG is a convex set and each Gi is an H-smooth convex function. Let the optimal value of the\nabove optimization problem be given by F\u2217> 0, and without loss of generality assume F\u2217 is known\n(one typically performs binary search if it is not known). De\ufb01ne the setsF={f\u2236 f\u2208G, c\ufffff= F\u2217}\nandX= d. The convex programming problem in (10) can now be reformulated as\nplayer as aiming to minimize the above expression overF, while the second player maximizes over\nLemma 8. Fix ,\u270f > 0. Assume there exists f0 \u2208 G such that c\ufffff0 \u2265 0 and for every i \u2208 [d],\nGi(f0)\u2264 1\u2212 . Suppose each Gi is 1-Lipschitz overF. Consider the solution\n\u270f+ and \u00affT= 1\nt=1 ft\u2208F is the average of the trajectory of the procedure in Lemma 4\nwhere \u21b5= \u270f\nT\u2211T\nfor the optimization problem (11). LetR1(\u22c5)= 1\n2 andR2 be the entropy function. Further let\n2\uffff\u22c5\uffff2\nB be a known constant such that B\u2265\ufffff\u2217\u2212 g0\uffff2 where g0\u2208F is some initialization and f\u2217\u2208F\nis the (unknown) solution to the optimization problem. Set \u2318= argmin\n\u2318\u2212 H,\n\u2318\u2264H\u22121 \uffff B2\nt =(G1(gt\u22121), . . . , Gd(gt\u22121)). Let number of iterations T be\nt =\u2211d\ni=1 yt\u22121(i)\u2207Gi(gt\u22121) and M 2\n\u2318\u2264H\u22121\uffff B2\nT> 1\n\n\u02c6fT=(1\u2212 \u21b5) \u00affT+ \u21b5f0\n\n\u2318 + \u2318 log d\n1\u2212\u2318H\uffff, \u2318\u2032= 1\n\nM 1\nsuch that\n\n\u2318 + \u2318 log d\n1\u2212 \u2318H\uffff\n\ninf\n\n\u270f\n\n7\n\n\fLemma 8 tells us that using the predictable sequences approach for the two players, one can obtain\nan \u270f\n -approximate solution to the smooth convex programming problem in number of iterations at\n\nmost order 1\uffff\u270f. If T1 (reps. T2) is the time complexity for single update of the predictable sequence\nalgorithm of Player I (resp. Player 2), then time complexity of the overall procedure isO\uffff T1+T2\n\uffff\n\n5.1 Application to Max-Flow\n\n\u270f\n\nWe now apply the above result to the problem of \ufb01nding Max Flow between a source and a sink\nin a network, such that the capacity constraint on each edge is satis\ufb01ed. For simplicity, consider a\nnetwork where each edge has capacity 1 (the method can be easily extended to the case of varying\ncapacity). Suppose the number of edges d in the network is the same order as number of vertices in\nthe network. The Max Flow problem can be seen as an instance of a convex (linear) programming\nproblem, and we apply the proposed algorithm for structured optimization to obtain an approximate\nsolution.\n\nof iterations of the proposed procedure.\n\nEuclidean norm squared as regularizer for the \ufb02ow player, then projection step can be performed in\n\nFor the Max Flow problem, the setsG andF are given by sets of linear equalities. Further, if we use\nO(d) time using conjugate gradient method. This is because we are simply minimizing Euclidean\nnorm squared subject to equality constraints which is well conditioned. Hence T1=O(d). Similarly,\nthe Exponential Weights update has time complexityO(d) as there are order d constraints, and so\noverall time complexity to produce \u270f approximate solution is given byO(nd), where n is the number\nOnce again, we shall assume that we know the value of the maximum \ufb02ow F\u2217 (for, otherwise, we\nFlow problem with f0 = 0\u2208G the 0 \ufb02ow, the time complexity to compute an \u270f-approximate Max\n\ncan use binary search to obtain it).\nCorollary 9. Applying the procedure for smooth convex programming from Lemma 8 to the Max\n\nFlow is bounded by\n\nO\uffff d3\uffff2\u221alog d\n\n\u270f\n\n\uffff .\n\nWe then have that \u02c6fT\u2208G satis\ufb01es all d constraints and is \u270f\n\uffff F\u2217 .\n\nc\uffff \u02c6fT\u2265\uffff1\u2212 \u270f\n\n -approximate, that is\n\nThis time complexity matches the known result from [8], but with a much simpler procedure (gradi-\nent descent for the \ufb02ow player and Exponential Weights for the constraints). It would be interesting\n\nto see whether the techniques presented here can be used to improve the dependence on d to d4\uffff3 or\nbetter while maintaining the 1\uffff\u270f dependence. While the result of [5] has the improved d4\uffff3 depen-\n\ndence, the complexity in terms of \u270f is much worse.\n\n6 Discussion\n\nWe close this paper with a discussion. As we showed, the notion of using extra information about the\nsequence is a powerful tool with applications in optimization, convex programming, game theory, to\nname a few. All the applications considered in this paper, however, used some notion of smoothness\nfor constructing the predictable process Mt. An interesting direction of further research is to isolate\nmore general conditions under which the next gradient is predictable, perhaps even when the func-\ntions are not smooth in any sense. For instance one could use techniques from bundle methods to\nfurther restrict the set of possible gradients the function being optimized can have at various points\nin the feasible set. This could then be used to solve for the right predictable sequence to use so as\nto optimize the bounds. Using this notion of selecting predictable sequences one can hope to derive\nadaptive optimization procedures that in practice can provide rapid convergence.\n\nAcknowledgements: We thank Vianney Perchet for insightful discussions. We gratefully acknowl-\nedge the support of NSF under grants CAREER DMS-0954737 and CCF-1116928, as well as Dean\u2019s\nResearch Fund.\n\n8\n\n\fReferences\n[1] S. Arora, E. Hazan, and S. Kale. The multiplicative weights update method: A meta-algorithm\n\nand applications. Theory of Computing, 8(1):121\u2013164, 2012.\n\n[2] P. Auer, N. Cesa-Bianchi, and C. Gentile. Adaptive and self-con\ufb01dent on-line learning algo-\n\nrithms. Journal of Computer and System Sciences, 64(1):48\u201375, 2002.\n\n[3] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University\n\nPress, 2006.\n\n[4] C.-K. Chiang, T. Yang, C.-J. Lee, M. Mahdavi, C.-J. Lu, R. Jin, and S. Zhu. Online optimiza-\n\ntion with gradual variations. In COLT, 2012.\n\n[5] P. Christiano, J. A Kelner, A. Madry, D. A. Spielman, and S.-H. Teng. Electrical \ufb02ows, lapla-\ncian systems, and faster approximation of maximum \ufb02ow in undirected graphs. In Proceedings\nof the 43rd annual ACM symposium on Theory of computing, pages 273\u2013282. ACM, 2011.\n\n[6] C. Daskalakis, A. Deckelbaum, and A. Kim. Near-optimal no-regret algorithms for zero-\nsum games. In Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete\nAlgorithms, pages 235\u2013254. SIAM, 2011.\n\n[7] Y. Freund and R. Schapire. Adaptive game playing using multiplicative weights. Games and\n\nEconomic Behavior, 29(1):79\u2013103, 1999.\n\n[8] A. Goldberg and S. Rao. Beyond the \ufb02ow decomposition barrier. Journal of the ACM (JACM),\n\n45(5):783\u2013797, 1998.\n\n[9] A. Nemirovski. Prox-method with rate of convergence O(1/t) for variational inequalities with\nlipschitz continuous monotone operators and smooth convex-concave saddle point problems.\nSIAM Journal on Optimization, 15(1):229\u2013251, 2004.\n\n[10] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming,\n\n103(1):127\u2013152, 2005.\n\n[11] A. Rakhlin and K. Sridharan. Online learning with predictable sequences. In Proceedings of\n\nthe 26th Annual Conference on Learning Theory (COLT), 2013.\n\n9\n\n\f", "award": [], "sourceid": 1400, "authors": [{"given_name": "Sasha", "family_name": "Rakhlin", "institution": "University of Pennsylvania"}, {"given_name": "Karthik", "family_name": "Sridharan", "institution": "University of Pennsylvania"}]}