{"title": "Local Minimax Complexity of Stochastic Convex Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 3423, "page_last": 3431, "abstract": "We extend the traditional worst-case, minimax analysis of stochastic convex optimization by introducing a localized form of minimax complexity for individual functions.  Our main result gives function-specific lower and upper bounds on the number of stochastic subgradient evaluations needed to optimize either the function or its ``hardest local alternative'' to a given numerical precision.  The bounds are expressed in terms of a localized and computational analogue of the modulus of continuity that is central to statistical minimax analysis. We show how the computational modulus of continuity can be explicitly calculated in concrete cases, and relates to the curvature of the function at the optimum.  We also prove a superefficiency result that demonstrates it is a meaningful benchmark, acting as a computational analogue of the Fisher information in statistical estimation. The nature and practical implications of the results are demonstrated in simulations.", "full_text": "Local Minimax Complexity of\nStochastic Convex Optimization\n\nYuancheng Zhu\n\nWharton Statistics Department\n\nUniversity of Pennsylvania\n\nSabyasachi Chatterjee\nDepartment of Statistics\nUniversity of Chicago\n\nJohn Duchi\n\nDepartment of Statistics\n\nDepartment of Electrical Engineering\n\nStanford University\n\nJohn Lafferty\n\nDepartment of Statistics\n\nDepartment of Computer Science\n\nUniversity of Chicago\n\nAbstract\n\nWe extend the traditional worst-case, minimax analysis of stochastic convex op-\ntimization by introducing a localized form of minimax complexity for individual\nfunctions. Our main result gives function-speci\ufb01c lower and upper bounds on\nthe number of stochastic subgradient evaluations needed to optimize either the\nfunction or its \u201chardest local alternative\u201d to a given numerical precision. The\nbounds are expressed in terms of a localized and computational analogue of the\nmodulus of continuity that is central to statistical minimax analysis. We show how\nthe computational modulus of continuity can be explicitly calculated in concrete\ncases, and relates to the curvature of the function at the optimum. We also prove a\nsuperef\ufb01ciency result that demonstrates it is a meaningful benchmark, acting as\na computational analogue of the Fisher information in statistical estimation. The\nnature and practical implications of the results are demonstrated in simulations.\n\n1\n\nIntroduction\n\nThe traditional analysis of algorithms is based on a worst-case, minimax formulation. One studies\nthe running time, measured in terms of the smallest number of arithmetic operations required by any\nalgorithm to solve any instance in the family of problems under consideration. Classical worst-case\ncomplexity theory focuses on discrete problems. In the setting of convex optimization, where the\nproblem instances require numerical rather than combinatorial optimization, Nemirovsky and Yudin\n[12] developed an approach to minimax analysis based on a \ufb01rst order oracle model of computation.\nIn this model, an algorithm to minimize a convex function can make queries to a \ufb01rst-order \u201coracle,\u201d\nand the complexity is de\ufb01ned as the smallest error achievable using some speci\ufb01ed minimum number\nof queries needed. Speci\ufb01cally, the oracle is queried with an input point x 2C from a convex domain\nC, and returns an unbiased estimate of a subgradient vector to the function f at x. After T calls to the\nof the oracle, and possibly also due to randomness in the algorithm. The Nemirovski-Yudin analysis\nreveals that, in the worst case, the number of calls to the oracle required to drive the expected error\n\noracle, an algorithm A returns a valuebxA 2C , which is a random variable due to the stochastic nature\nE(f (bxA)  inf x2C f (x)) below \u270f scales as T = O(1/\u270f) for the class of strongly convex functions,\n\nand as T = O(1/\u270f2) for the class of Lipschitz convex functions.\nIn practice, one naturally \ufb01nds that some functions are easier to optimize than others. Intuitively, if\nthe function is \u201csteep\u201d near the optimum, then the subgradient may carry a great deal of information,\nand a stochastic gradient descent algorithm may converge relatively quickly. A minimax approach\nto analyzing the running time cannot take this into account for a particular function, as it treats the\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fworst-case behavior of the algorithm over all functions. It would be of considerable interest to be able\nto assess the complexity of solving an individual convex optimization problem. Doing so requires a\nbreak from traditional worst-case thinking.\nIn this paper we revisit the traditional view of the complexity of convex optimization from the point\nof view of a type of localized minimax complexity. In local minimax, our objective is to quantify the\nintrinsic dif\ufb01culty of optimizing a speci\ufb01c convex function f. With the target f \ufb01xed, we take an\nalternative function g within the same function class F, and evaluate how the maximum expected\nerror decays with the number of calls to the oracle, for an optimal algorithm designed to optimize\neither f or g. The local minimax complexity RT (f ;F) is de\ufb01ned as the least favorable alternative g:\n(1)\n\nerror(A, h)\n\nRT (f ;F) = sup\ng2F\n\ninf\nA2AT\n\nmax\nh2{f,g}\n\nwhere error(A, h) is some measure of error for the algorithm applied to function h. Note that here the\nthe algorithm A is allowed to depend on the function f and the selected worst-case g. In contrast, the\ntraditional global worst-case performance of the best algorithm, as de\ufb01ned by the minimax complexity\nRT (F) of Nemirovsky and Yudin, is\n\nRT (F) = inf\nA2AT\n\nsup\ng2F\n\nerror(A, g).\n\n(2)\n\nThe local minimax complexity can be thought of as the dif\ufb01culty of optimizing the hardest alternative\nto the target function. Intuitively, a dif\ufb01cult alternative is a function g for which querying the oracle\nwith g gives results similar to querying with f, but for which the value of x 2C that minimizes g is\nfar from the value that minimizes f.\nOur analysis ties this function-speci\ufb01c notion of complexity to a localized and computational analogue\nof the modulus of continuity that is central to statistical minimax analysis [5, 6]. We show that the\nlocal minimax complexity gives a meaningful benchmark for quantifying the dif\ufb01culty of optimizing\na speci\ufb01c function by proving a superef\ufb01ciency result; in particular, outperforming this benchmark\nat some function must lead to a larger error at some other function. Furthermore, we propose an\nadaptive algorithm in the one-dimensional case that is based on binary search, and show that this\nalgorithm automatically achieves the local minimax complexity, up to a logarithmic factor. Our study\nof the algorithmic complexity of convex optimization is motivated by the work of Cai and Low [2],\nwho propose an analogous de\ufb01nition in the setting of statistical estimation of a one-dimensional\nconvex function. The present work can thus be seen as exposing a close connection between statistical\nestimation and numerical optimization of convex functions. In particular, our results imply that\nthe local minimax complexity can be viewed as a computational analogue of Fisher information in\nclassical statistical estimation.\nIn the following section we establish our notation, and give a technical overview of our main results,\nwhich characterize the local minimax complexity in terms of the computational modulus of continuity.\nIn Section 2.2, we demonstrate the phenomenon of superef\ufb01ciency of the local minimax complexity.\nIn Section 3 we present the algorithm that adapts to the benchmark, together with an analysis of its\ntheoretical properties. We also present simulations of the algorithm and comparisons to traditional\nstochastic gradient descent. Finally, we conclude with a brief review of related work and a discussion\nof future research directions suggested by our results.\n\n2 Local minimax complexity\n\nIn this section, we \ufb01rst establish notation and de\ufb01ne a modulus of continuity for a convex function f.\nWe then state our main result, which links the local minimax complexity to this modulus of continuity.\nLet F be the collection of Lipschitz convex functions de\ufb01ned on a compact convex set C\u21e2 Rd.\nGiven a function f 2F , our goal is to \ufb01nd a minimum point, x\u21e4f 2 arg min x2C f (x). However, our\nknowledge about f can only be gained through a \ufb01rst-order oracle. The oracle, upon being queried\nwith x 2C , returns f0(x) + \u21e0, where f0(x) is a subgradient of f at x and \u21e0 \u21e0 N(0, 2Id). When\nthe oracle is queried with a non-differentiable point x of f, instead of allowing the oracle to return\nan arbitrary subgradient at x, we assume that it has a deterministic mechanism for producing f0(x).\nThat is, when we query the oracle with x twice, it should return two random vectors with the same\nmean f0(x). Such an oracle can be realized, for example, by taking f0(x) = arg min z2@f (x) kzk.\nHere and throughout the paper, k\u00b7k denotes the Euclidean norm.\n\n2\n\n\ff (x)\n\ng(x)\n\nf0(x)\n\ng0(x)\n\n\u270f\n\n!(\u270f; f )\n\n\ufb02at set\n\nFigure 1: Illustration of the \ufb02at set and the modulus of continuity. Both the function f (left) and its\nderivative f0 (right) are shown (black curves), along with one of the many possible alternatives, g\nand its derivative g0 (solid gray curves), that achieve the sup in the de\ufb01nition of !f (\u270f). The \ufb02at set\ncontains all the points for which |f0(x)| <\u270f , and !f (\u270f) is the larger half width of the \ufb02at set.\n\nConsider optimization algorithms that make a total of T queries to this \ufb01rst-order oracle, and let AT\nwrite err(x, f ) for a measure of error for using x as the estimate of the minimum point of f 2F . In\nthis notation, the usual minimax complexity is de\ufb01ned as\n(3)\n\nbe the collection of all such algorithms. For A 2A T , denote bybxA the output of the algorithm. We\n\nRT (F) = inf\nA2AT\n\nsup\nf2F\n\nEf err(bxA, f ).\n\noutputbxA is thus a function of the entire sequence of random vectors vt \u21e0 N (f0(xt), 2Id) returned\n\nNote that the algorithm A queries the oracle at up to T points xt 2C selected sequentially, and the\nby the oracle. The expectation Ef denotes the average with respect to this randomness (and any\nadditional randomness injected by the algorithm itself). The minimax risk RT (F) characterizes the\nhardness of the entire class F. To quantify the dif\ufb01culty of optimizing an individual function f, we\nconsider the following local minimax complexity, comparing f to its hardest local alternative\n(4)\n\nRT (f ;F) = sup\ng2F\n\ninf\nA2AT\n\nmax\nh2{f,g}\n\nWe now proceed to de\ufb01ne a computational modulus of continuity that characterizes the local minimax\ncomplexity. Let X \u21e4f = arg min x2C f (x) be the set of minimum points of function f. We consider\nerr(x, f ) = inf y2X \u21e4f kx  yk as our measure of error. De\ufb01ne d(f, g) = inf x2X \u21e4f ,y2X \u21e4g kx  yk for\nf, g 2F . It is easy to see that err(x, f ) and d(f, g) satisfy the exclusion inequality\n\nEh err(bxA, h).\n\nerr(x, f ) <\n\nNext we de\ufb01ne\n\n1\n2\n\nd(f, g)\n\nimplies\n\nerr(x, g) \nx2C kf0(x)  g0(x)k\n\n\uf8ff(f, g) = sup\n\n1\n2\n\nd(f, g).\n\n(5)\n\n(6)\n\nwhere f0(x) is the unique subgradient of f that is returned as the mean by the oracle when queried\nwith x. For example, if we take f0(x) = arg min z2@f (x) kzk, we have\nx2C kProj@f (x)(0)  Proj@g(x)(0)k\n\nwhere ProjB(z) is the projection of z to the set B. Thus, d(f, g) measures the dissimilarity between\ntwo functions in terms of the distance between their minimizers, whereas \uf8ff(f, g) measures the\ndissimilarity by the largest separation between their subgradients at any given point.\nGiven d and \uf8ff, we de\ufb01ne the modulus of continuity of d with respect to \uf8ff at the function f by\n\n\uf8ff(f, g) = sup\n\n(7)\n\n!f (\u270f) = sup{d(f, g) : g 2F ,\uf8ff (f, g) \uf8ff \u270f} .\nWe now show how to calculate the modulus for some speci\ufb01c functions.\nExample 1. Suppose that f is a convex function on a one-dimensional interval C\u21e2 R. If we take\nf0(x) = arg min z2@f (x) kzk, then\n\n(8)\n\n!f (\u270f) = sup( inf\n\nx2X \u21e4f |x  y| : y 2C ,|f0(y)| <\u270f) .\n\n(9)\n\n3\n\n\fThe proof of this claim is given in the appendix. This result essentially says that the modulus of\ncontinuity measures the size (in fact, the larger half-width) of the the \u201c\ufb02at set\u201d where the magnitude\nof the subderivative is smaller than \u270f. See Figure 1 for an illustration Thus, for the class of symmetric\nfunctions f (x) = 1\n\nk|x|k over C = [1, 1], with k > 1,\n!f (\u270f) = \u270f\n\n1\n\nk1 .\n\nFor the asymmetric case f (x) = 1\n\nkl|x|klI(1 \uf8ff x \uf8ff 0) + 1\nkl_kr1 .\n\n!f (\u270f) = \u270f\n\n1\n\n(10)\nkr |x|kr I(0 < x \uf8ff 1) with kl, kr > 1,\n(11)\n\nThat is, the size of the \ufb02at set depends on the \ufb02atter side of the function.\n\n2.1 Local minimax is characterized by the modulus\nWe now state our main result linking the local minimax complexity to the modulus of continuity. We\nsay that the modulus of the continuity has polynomial growth if there exists \u21b5> 0 and \u270f0, such that\nfor any c  1 and \u270f \uf8ff \u270f0/c\n(12)\nOur main result below shows that the modulus of continuity characterizes the local minimax com-\nplexity of optimization of a particular convex function, in a manner similar to how the modulus of\ncontinuity quanti\ufb01es the (local) minimax risk in a statistical estimation setting [2, 5, 6], relating the\nobjective to a geometric property of the function.\nTheorem 1. Suppose that f 2F and that !f (\u270f) has polynomial growth. Then there exist constants\nC1 and C2 independent of T and T0 > 0 such that for all T > T0\n\n!f (c\u270f) \uf8ff c\u21b5!f (\u270f).\n\nC1 !f\u2713 \n\npT\u25c6 \uf8ff RT (f ;F) \uf8ff C2 !f\u2713 \npT\u25c6 .\n\n(13)\nRemark 1. We use the error metric err(x, f ) = inf y2X \u21e4f kx  yk here. For a given a pair (err, d)\nthat satis\ufb01es the exclusion inequality (5), our proof technique applies to yield the corresponding\nlower bound. For example, we could use err(x, f ) = inf y2X \u21e4f |vT (x  y)| for some vector v. This\nerror metric would be suitable when we wish to estimate vT x\u21e4f , for example, the \ufb01rst coordinate of\nx\u21e4f . Another natural choice of error metric is err(x, f ) = f (x)  inf x2C f (x), with a corresponding\ndistance d(f, g) = inf x2C |f (x)  inf x f (x) + g(x)  inf x g(x)|. For this case, while the proof of\nthe lower bound stays exactly the same, further work is required for the upper bound, which is beyond\nthe scope of this paper.\nRemark 2. The results can be extended to oracles with more general noise models. In particular,\nthe lower bounds will still hold with more general noise distributions, as long as Gaussian noise is a\nsubclass. Indeed, in proving lower bounds assuming Gaussianity only makes solving the optimization\nproblem easier. Our algorithm and upper bound analysis will go through for all sub-Gaussian noise\noracles. For the ease of presentation, we will focus on Gaussian noise model for the current paper.\nRemark 3. Although the theorem gives an upper bound for the local minimax complexity, this\ndoes not guarantee the existence of an algorithm that achieves the local complexity for any function.\nTherefore, it is important to design an algorithm that adapts to this benchmark for each individual\nfunction. We solve this problem in the one-dimensional case in Section 3.\n\nThe proof of this theorem is given in the appendix. We now illustrate the result with examples that\nverify the intuition that different functions should have different degrees of dif\ufb01culty for stochastic\nconvex optimization.\nExample 2. For the function f (x) = 1\n\nk|x|k with x 2 [1, 1] for k > 1, we have RT (f ;F) =\nOT  1\n2(k1). This agrees with the minimax risk complexity for the class of Lipschitz convex\nfunctions that satisfy f (x)  f (x\u21e4f )  \n2kx  x\u21e4fkk [14]. In particular, when k = 2, we recover the\nstrongly convex case, where the (global) minimax complexity is O1/pT with respect to the error\nerr(x, f ) = inf y2X \u21e4f kx yk. We see a faster rate of convergence for k < 2. As k ! 1, we also see\nthat the error fails to decrease as T gets large. This corresponds to the worst case for any Lipschitz\nconvex function. In the asymmetric setting with f (x) = 1\nkr |x|kr I(0 <\nx \uf8ff 1) with kl, kr > 1, we have RT (f ;F) = O(T \n\nkl|x|klI(1 \uf8ff x \uf8ff 0) + 1\n\n2(kl_kr1) ).\n\n1\n\n4\n\n\fThe following example illustrates that the local minimax complexity and modulus of continuity are\nconsistent with known behavior of stochastic gradient descent for strongly convex functions.\nExample 3. In this example we consider the error err(x, f ) = inf y2X \u21e4f |vT (x  y)| for some vector\nv, and let f be an arbitrary convex function satisfying r2f (x\u21e4f )  0 with Hessian continuous around\nx\u21e4f . Thus the optimizer x\u21e4f is unique. If we de\ufb01ne gw(x) = f (x)  wTr2f (x\u21e4f )x, then gw(x) is a\nconvex function with unique minimizer and\n\n\uf8ff(f, gw) = sup\n\nx rf (x)  (rf (x)  r2f (x\u21e4f )w) =r2f (x\u21e4f )w .\n\n(14)\n\npT\u25c6  sup\n\nThus, de\ufb01ning (w) = x\u21e4f  x\u21e4gw,\n!f\u2713 \n(15)\nBy the convexity of gw, we know that x\u21e4gw satis\ufb01es rf (x\u21e4gw )  r2f (x\u21e4f )1w = 0, and therefore by\nthe implicit function theorem, x\u21e4gw = x\u21e4f + w + o(kwk) as w ! 0. Thus,\n\npT r2f (x\u21e4f )1u\u25c6 .\n\nu vT \u2713 \npT\u25c6 as T ! 1.\n\nw {|vT (w)| :r2f (x\u21e4f )w \uf8ff /pT} sup\npT r2f (x\u21e4f )1v + o\u2713 \npT\u25c6 \n!f\u2713 \npT RT (f ;F)  C1r2f (x\u21e4f )1v\n\nwhere C1 is the same constant appearing in Theorem 1. This shows that the local minimax complexity\ncaptures the function-speci\ufb01c dependence on the constant in the strongly convex case. Stochastic\ngradient descent with averaging is known to adapt to this strong convexity constant [16, 13, 10]. Note\nthat lower bounds of similar forms on the minimax complexity have been obtained in [11].\n\nIn particular, we have the local minimax lower bound\n\n(16)\n\n(17)\n\nlim inf\nT!1\n\n\n\n2.2 Superef\ufb01ciency\n\nHaving characterized the local minimax complexity in terms of a computational modulus of continuity,\nwe would now like to show that there are consequences to outperforming it at some function. This\nwill strengthen the case that the local minimax complexity serves as a meaningful benchmark to\nquantify the dif\ufb01culty of optimizing any particular convex function.\nSuppose that f is any one-dimensional function such that X \u21e4f = [xl, xr], which has as asymptotic\nexpansion around {xl, xr} of the form\n\nf (xl  ) = f (xl) + lkl + o(kl) and f (xr + ) = f (xr) + rkr + o(kr )\n\n(18)\nfor > 0, some powers kl, kr > 1, and constants l, r > 0. The following result shows that if\nany algorithm signi\ufb01cantly outperforms the local modulus of continuity on such a function, then it\nunderperforms the modulus on a nearby function.\nProposition 1. Let f be any convex function satisfying the asymptotic expansion (21) around its\noptimum. Suppose that A 2A T is any algorithm that satis\ufb01es\n\nfor some constant C that only depends on k = kl _ kr.\n\n5\n\nwhere T < C1. De\ufb01ne g1(x) = f (x)  \u270fT x and g1(x) = f (x) + \u270fT x, where \u270fT is given by\nT/T . Then for some g 2{ g1, g1}, there exists T0 such that T  T0 implies\n\n\u270fT =q2 log C1\n\nEf err(bxA, f ) \uf8ffqEf err(bxA, f )2 \uf8ff T !f\u2713 \npT\u25c6 ,\n1A\nEg err(bxA, g)  C! g0@s 2 logC1/T\n\nT\n\n(19)\n\n(20)\n\n\fA proof of this result is given in the appendix, where it is derived as a consequence of a more general\n\nstatement. We remark that while condition (19) involves the squared errorpEf err(bxA, f )2, we\nexpect that the result holds with only the weaker inequality on the absolute error Ef err(bxA, f ).\n\nIt follows from this proposition that if an algorithm A signi\ufb01cantly outperforms the local minimax\ncomplexity in the sense that (19) holds for some sequence T ! 0 with lim inf T eT T = 1, then\nthere exists a sequence of convex functions gT with \uf8ff(f, gT ) ! 0, such that\n\n(21)\n\nlim inf\nT!1\n\nEgT err(bxA, gT )\n!gT\u21e3q2 log C1\n\nT/T\u2318 > 0.\n\nThis is analogous to the phenomenon of superef\ufb01ciency in classical parametric estimation problems,\nwhere outperforming the asymptotically optimal rate given by the Fisher information implies worse\nperformance at some other point in the parameter space. In this sense, !f can be viewed as a\ncomputational analogue of Fisher information in the setting of convex optimization. We note that\nsuperef\ufb01ciency has also been studied in nonparametric settings [1], and a similar result was shown by\nCai and Low [2] for local minimax estimation of convex functions.\n\n3 An adaptive optimization algorithm\n\nIn this section, we show that a simple stochastic binary search algorithm achieves the local minimax\ncomplexity in the one-dimensional case.\nThe general idea of the algorithm is as follows. Suppose that we are given a budget of T queries to\nthe oracle. We divide this budget into T0 = bT /Ec queries over each of E = br log Tc many rounds,\nwhere r > 0 is a constant to be speci\ufb01ed later. In each round, we query the oracle T0 times for the\nderivative at the mid-point of the current interval. Estimating the derivative by averaging over the\nqueries, we proceed to the left half of the interval if the estimated sign is positive, and to the right\nhalf of the interval if the estimated sign is negative. The details are given in Algorithm 1.\n\nAlgorithm 1 Sign testing binary search\n\nInput: T , r.\nInitialize: (a0, b0), E = br log Tc, T0 = bT /Ec.\nfor e = 1, . . . , E do\nQuery xe = (ae + be)/2 for T0 times to get Z(e)\nCalculate the average \u00afZ(e)\nT0\nIf \u00afZ(e)\nT0\nIf \u00afZ(e)\nend for\nOutput: xE.\n\nT0PT0\n> 0, set (ae+1, be+1) = (ae, xe).\nT0 \uf8ff 0, set (ae+1, be+1) = (xe, be).\n\nt=1 Z(e)\n\n= 1\n\n.\n\nt\n\nt\n\nfor t = 1, . . . , T0.\n\nWe will show that this algorithm adapts to the local minimax complexity up to a logarithmic factor.\nFirst, the following result shows that the algorithm gets us close to the \u201c\ufb02at set\u201d of the function.\n\nProposition 2. For  2 (0, 1), let C = p2 log(E/). De\ufb01ne\nI =\u21e2y 2 dom(f ) : |f0(y)| <\nSuppose that (a0, b0) \\I  6= ;. Then\ndist(xE,I) \uf8ff 2E(b0  a0)\nwith probability at least 1  .\nThis proposition tells us that after E rounds of bisection, we are at most a distance 2E(b0  a0)\nfrom the \ufb02at set I. In terms of the distance to the minimum point, we have\n(24)\n\nCpT0 .\n\nx2X \u21e4f |xE  x|\uf8ff 2E(b0  a0) + supn inf\n\nx2X \u21e4f |x  y| : y 2I o.\n\nIf the modulus of continuity satis\ufb01es the polynomial growth condition, we then obtain the following.\n\n(22)\n\n(23)\n\ninf\n\n6\n\n\fk = 1.5\n\n3\n0\n-\ne\n5\n\n0\n0\n1\n\n.\n\n0\n\n0\n2\n0\n\n.\n\nk = 2\n\nk = 3\n\n0\n5\n\n.\n\n0\n\n0\n2\n\n.\n\n0\n\n0\n\nk\ns\ni\nr\n\nk\ns\ni\nr\n\n5\n0\n0\n\n.\n\n0\n\n1\n0\n0\n\n.\n\n0\n\nk\ns\ni\nr\n\nk\ns\ni\nr\n\n5\n0\n\n.\n\n0\n\n1\n0\n\n.\n\n0\n\n4\n0\n-\ne\n5\n\nk\ns\ni\nr\n\nk\ns\ni\nr\n\n5\n0\n-\ne\n5\n\n6\n0\n-\ne\n5\n\n0\n5\n0\n\n.\n\n0\n\nk\ns\ni\nr\n\nk\ns\ni\nr\n\n0\n1\n0\n\n.\n\n100\n\n1000\n\nt\nt\n\n10000\n\n100\n\n1000\n\nt\nt\n\n10000\n\n100\n\n10000\n\n1000\n\nt\nt\n\nkl = 1.5, kr = 2\n\nkl = 1.5, kr = 3\n\nkl = 2, kr = 3\n\n0\n2\n\n.\n\n0\n\n0\n1\n\n.\n\n0\n\nk\ns\ni\nr\n\nk\ns\ni\nr\n\n0\n2\n\n.\n\n0\n\n0\n1\n\n.\n\n0\n\nk\ns\ni\nr\n\nk\ns\ni\nr\n\n0\n\n2\n0\n0\n0\n\n.\n\n5\n0\n\n.\n\n0\n\n2\n0\n0\n\n.\n\n5\n0\n\n.\n\n0\n\n2\n0\n0\n\n.\n\n100\n\n1000\n\n10000\n\n100\n\nt\nt\nbinary search\n\nSGD, \u2318(t) = 1/t\n\n1000\n\nt\nt\n\n10000\n\n100\n\nSGD, \u2318(t) = 1/pt\n\ntheoretic\n\n1000\n\nt\nt\n\n10000\n\nFigure 2: Simulation results: Averaged risk versus number of queries T . The black curves correspond\nto the risk of the stochastic binary search algorithm. The red and blue curves are for the stochastic\ngradient descent methods, red for stepsize 1/t and blue for 1/pt. The dashed gray lines indicate the\noptimal convergence rate. Note that the plots are on a log-log scale. The plots on the top panels are\nfor the symmetric cases f (x) = 1\n\nk|x  x\u21e4|k; the lower plots are for the asymmetric cases.\n\nCorollary 1. Let \u21b50 > 0. Suppose !f satis\ufb01es the polynomial growth condition (12) with constant\n\u21b5 \uf8ff \u21b50. Let r = 1\n\n2 \u21b50. Then with probability at least 1   and for large enough T ,\n\n(25)\n\nx2X \u21e4f |xE  x|\uf8ff eC! f\u2713 \npT\u25c6\nwhere the term eC hides a dependence on log T and log(1/).\n\nThe proofs of these results are given in the appendix.\n\ninf\n\n3.1 Simulations showing adaptation to the benchmark\nWe now demonstrate the performance of the stochastic binary search algorithm, making a comparision\nto stochastic gradient descent. For the stochastic gradient descent algorithm, we perform T steps of\nupdate\n\nwhere \u2318(t) is a stepsize function, chosen as either \u2318(t) = 1\nfollowing setup with symmetric functions f:\n\nxt+1 = xt  \u2318(t) \u00b7bg(xt)\n\n(26)\nt or \u2318(t) = 1pt. We \ufb01rst consider the\n\nk|x  x\u21e4|k for k = 3\n\n1. The function to optimize is fk(x) = 1\n2. The minimum point x\u21e4 \u21e0 Unif(1, 1) is selected uniformaly at random over the interval.\n3. The oracle returns the derivative at the query point with additive N (0, 2) noise,  = 0.1.\n4. The optimization algorithms know a priori that the minimum point is inside the interval\n(2, 2). Therefore, the binary search starts with interval (2, 2) and the stochastic gradient\ndescent starts at x0 \u21e0 Unif(2, 2) and project the query points to the interval (2, 2).\n\n2 , 2 or 3.\n\n7\n\n\f5. We carry out the simulation for values of T on a logarithmic grid between 100 and 10,000.\n\nFor each setup, we average the error |bx  x\u21e4| over 1,000 runs.\n\nThe simulation results are shown in the top 3 panels of Figure 2. Several properties predicted by\nour theory are apparent from the simulations. First, the risk curves for the stochastic binary search\nalgorithm parallel the gray curves. This indicates that the optimal rate of convergence is achieved.\nThus, the stochastic binary search adapts to the curvature of different functions and yields the optimal\nlocal minimax complexity, as given by our benchmark. Second, the stochastic gradient descent\nalgorithms with stepsize 1/t achieve the optimal rate when k = 2, but not when k = 3; with stepsize\n1/pt SGD gets close to the optimal rate when k = 3, but not when k = 2. Neither leads to the faster\nrate when k = 3\n2. This is as expected, since the stepsize needs to be adapted to the curvature at the\noptimum in order to achieve the optimal rate.\nNext, we consider a set of asymmetric functions. Using the same setup as in the symmetric case, we\nconsider the functions of the form f (x) = 1\nkr |x x\u21e4|kr I(x x\u21e4 > 0),\nfor exponent pairs (k1, k2) chosen to be ( 3\n2 , 3) and (2, 3). The simulation results are shown in\nthe bottom three panels of Figure 2. We observe that the stochastic binary search once again achieves\nthe optimal rate, which is determined by the \ufb02atter side of the function, that is, the larger of kl and kr.\n\nkl|x x\u21e4|klI(x x\u21e4 \uf8ff 0) + 1\n2 , 2), ( 3\n\n4 Related work and future directions\n\nIn related recent work, Ramdas and Singh [14] study minimax complexity for the class of Lipschitz\nconvex functions that satisfy f (x)  f (x\u21e4f )  \n2kx  x\u21e4fkk. They show that the minimax complexity\nunder the function value error is of the order T  k\n2(k1) . Juditski and Nesterov [8] also consider\nminimax complexity for the class of k-uniformly convex functions for k > 2. They give an\nadaptive algorithm based on stochastic gradient descent that achieves the minimax complexity up\nto a logarithmic factor. Connections with active learning are developed in [15], with related ideas\nappearing in [3]. Adaptivity in this line of work corresponds to the standard notion in statistical\nestimation, which seeks to adapt to a large subclass of a parameter space. In contrast, the results in\nthe current paper quantify the dif\ufb01culty of stochastic convex optimization at a much \ufb01ner scale, as\nthe benchmark is determined by the speci\ufb01c function to be optimized.\nThe stochastic binary search algorithm presented in Section 3, despite being adaptive, has a few\ndrawbacks. It requires the modulus of continuity of the function to satisfy polynomial growth, with\na parameter \u21b5 bounded away from 0. This rules out cases such as f (x) = |x|, which should have\nan error that decays exponentially in T ; it is of interest to handle this case as well. It would also be\nof interest to construct adaptive optimization procedures tuned to a \ufb01xed numerical precision. Such\nprocedures should have different running times depending on the hardness of the problem. Progress\non both problems has been made, and will be reported elsewhere.\nAnother challenge is to remove the logarithmic factors appearing in the binary search algorithm\ndeveloped in Section 3. In one dimension, stochastic convex optimization is intimately related to a\nnoisy root \ufb01nding problem for a monotone function taking values in [a, a] for some a > 0. Karp\nand Kleinberg [9] study optimal algorithms for such root \ufb01nding problems in a discrete setting. A\nbinary search algorithm that allows backtracking is proposed, which saves log factors in the running\ntime. It would be interesting to study the use of such techniques in our setting.\nOther areas that warrant study involve the dependence on dimension. The scaling with dimension\nof the local minimax complexity and modulus of continuity is not fully revealed by the current\nanalysis. Moreover, the superef\ufb01ciency result and the adaptive algorithm presented here are only for\nthe one-dimensional case. We note that a form of adaptive stochastic gradient algorithm for the class\nof uniformly convex functions in general, \ufb01xed dimension is developed in [8].\nFinally, a more open-ended direction is to consider larger classes of stochastic optimization problems.\nFor instance, minimax results are known for functions of the form f (x) := E F (x; \u21e0) where \u21e0 is a\nrandom variable and x 7! F (x; \u21e0) is convex for any \u21e0, when f is twice continuously differentiable\naround the minimum point with positive de\ufb01nite Hessian. However, the role of the local geometry\nis not well understood. It would be interesting to further develop the local complexity techniques\nintroduced in the current paper, to gain insight into the geometric structure of more general stochastic\noptimization problems.\n\n8\n\n\fAcknowledgments\n\nResearch supported in part by ONR grant 11896509 and NSF grant DMS-1513594. The authors\nthank Tony Cai, Praneeth Netrapalli, Rob Nowak, Aaron Sidford, and Steve Wright for insightful\ndiscussions and valuable comments on this work.\n\nReferences\n[1] Lawrence Brown and Mark Low. A constrained risk inequality with applications to nonpara-\n\nmetric functional estimation. Annals of Statistics, 24(6):2524\u20132535, 1996.\n\n[2] Tony Cai and Mark Low. A framework for estimation of convex functions. Statistica Sinica,\n\npages 423\u2013456, 2015.\n\n[3] Rui Castro and Robert Nowak. Minimax bounds for active learning. Information Theory, IEEE\n\nTransactions on, 54(5):2339\u20132353, 2008.\n\n[4] David Donoho. Statistical estimation and optimal recovery. The Annals of Statistics, pages\n\n238\u2013270, 1994.\n\n[5] David Donoho and Richard Liu. Geometrizing rates of convergence, I. Technical report,\n\nUniversity of California, Berkeley, 1987. Department of Statistics, Technical Report 137.\n\n[6] David Donoho and Richard Liu. Geometrizing rates of convergence, II. Annals of Statistics, 19:\n\n633\u2013667, 1991.\n\n[7] Jean-Baptiste Hiriart-Urruty and Claude Lemar\u00e9chal. Convex Analysis and Minimization\n\nAlgorithms I & II. Springer, New York, 1993.\n\n[8] Anatoli Juditski and Yuri Nesterov. Deterministic and stochastic primal-dual subgradient\n\nmethods for minimizing uniformly convex functions. Stochastic System, 4(1):44\u201380, 2014.\n\n[9] Richard M Karp and Robert Kleinberg. Noisy binary search and its applications. In Proceedings\nof the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 881\u2013890. Society\nfor Industrial and Applied Mathematics, 2007.\n\n[10] Eric Moulines and Francis Bach. Non-asymptotic analysis of stochastic approximation algo-\nrithms for machine learning. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q.\nWeinberger, editors, Advances in Neural Information Processing Systems 24, pages 451\u2013459,\n2011.\n\n[11] Aleksandr Nazin. Informational inequalities in gradient stochastic optimization optimal feasible\n\nalgorithms. Automation and Remote Control, 50(4):531\u2013540, 1989.\n\n[12] Arkadi Nemirovsky and David Yudin. Problem Complexity and Method Ef\ufb01ciency in Optimiza-\n\ntion. John Wiley & Sons, 1983.\n\n[13] Boris Polyak and Anatoli Juditsky. Acceleration of stochastic approximation by averaging.\n\nSIAM Journal on Control and Optimization, 30(4):838\u2013855, 1992.\n\n[14] Aaditya Ramdas and Aarti Singh. Optimal rates for stochastic convex optimization under\nTsybakov noise condition. In Proceedings of The 30th International Conference on Machine\nLearning, pages 365\u2013373, 2013.\n\n[15] Aaditya Ramdas and Aarti Singh. Algorithmic connections between active learning and\n\nstochastic convex optimization. arxiv:1505.04214, 2015.\n\n[16] David Ruppert. Ef\ufb01cient estimations from a slowly convergent Robbins-Monro process. Tech-\nnical report, Report 781, Cornell University Operations Research and Industrial Engineering,\n1988.\n\n[17] Alexandre Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1702, "authors": [{"given_name": "sabyasachi", "family_name": "chatterjee", "institution": "University of Chicago"}, {"given_name": "John", "family_name": "Duchi", "institution": "Stanford"}, {"given_name": "John", "family_name": "Lafferty", "institution": "University of Chicago"}, {"given_name": "Yuancheng", "family_name": "Zhu", "institution": "University of Chicago"}]}