{"title": "The Physical Systems Behind Optimization Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 4372, "page_last": 4381, "abstract": "We use differential equations based approaches to provide some {\\it \\textbf{physics}} insights into analyzing the dynamics of popular optimization algorithms in machine learning. In particular, we study gradient descent, proximal gradient descent, coordinate gradient descent, proximal coordinate gradient, and Newton's methods as well as their Nesterov's accelerated variants in a unified framework motivated by a natural connection of optimization algorithms to physical systems. Our analysis is applicable to more general algorithms and optimization problems {\\it \\textbf{beyond}} convexity and strong convexity, e.g. Polyak-\\L ojasiewicz and error bound conditions (possibly nonconvex).", "full_text": "The Physical Systems Behind Optimization\n\nAlgorithms\n\nLin F. Yang \u2217\n\nPrinceton University\n\nRaman Arora,\n\nJohns Hopkins University\n\nVladimir Braverman\n\nJohns Hopkins University\n\nlin.yang@princeton.edu\n\narora@cs.jhu.edu\n\nvova@cs.jhu.edu\n\nTuo Zhao\u2020\n\nGeorgia Institute of Technology\n\ntourzhao@gatech.edu\n\nAbstract\n\nWe use differential equations based approaches to provide some physics insights\ninto analyzing the dynamics of popular optimization algorithms in machine learning.\nIn particular, we study gradient descent, proximal gradient descent, coordinate\ngradient descent, proximal coordinate gradient, and Newton\u2019s methods as well\nas their Nesterov\u2019s accelerated variants in a uni\ufb01ed framework motivated by a\nnatural connection of optimization algorithms to physical systems. Our analysis is\napplicable to more general algorithms and optimization problems beyond convexity\nand strong convexity, e.g. Polyak-\u0141ojasiewicz and error bound conditions (possibly\nnonconvex).\n\n1\n\nIntroduction\n\n\u2217\nx\n\nf (x),\n\nx\u2208X\n\nMany machine learning problems can be cast into an optimization problem of the following form:\n\n= argmin\n\n(1.1)\nwhere X \u2286 Rd and f : X \u2192 R is a continuously differentiable function. For simplicity, we assume\nthat f is convex or approximately convex (more on this later). Perhaps, the earliest algorithm for\nsolving (1.1) is the vanilla gradient descent (VGD) algorithm, which dates back to Euler and Lagrange.\nVGD is simple, intuitive, and easy to implement in practice. For large-scale problems, it is usually\nmore scalable than more sophisticated algorithms (e.g. Newton).\nExisting state-of-the-art analysis shows that VGD achieves an O(1/k) convergence rate for smooth\nconvex functions and a linear convergence rate for strongly convex functions, where k is the number\nof iterations [11]. Recently, a class of Nesterov\u2019s accelerated gradient (NAG) algorithms have gained\npopularity in statistical signal processing and machine learning communities. These algorithms com-\nbine the vanilla gradient descent algorithm with an additional momentum term at each iteration. Such\na modi\ufb01cation, though simple, has a profound impact: the NAG algorithms attain faster convergence\nthan VGD. Speci\ufb01cally, NAG achieves O(1/k2) convergence for smooth convex functions, and linear\nconvergence with a better constant term for strongly convex functions [11].\nAnother closely related class of algorithms is randomized coordinate gradient descent (RCGD)\nalgorithms. These algorithms conduct a gradient descent-type step in each iteration, but only with\n\u2217Work was done while the author was at Johns Hopkins University. This work is partially supported by the\nNational Science Foundation under grant numbers 1546482, 1447639, 1650041 and 1652257, the ONR Award\nN00014-18-1-2364, the Israel Science Foundation grant #897/13, a Minerva Foundation grant, and by DARPA\naward W911NF1820267.\n\u2020Corresponding author.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\frespect to a single coordinate. RCGD has similar convergence rates to VGD, but has a smaller\noverall computational complexity, since its computational cost per iteration of RCGD is much smaller\nthan VGD [10, 7]. More recently, [5, 2] applied Nesterov\u2019s acceleration to RCGD, and proposed\naccelerated randomized coordinate gradient (ARCG) algorithms. Accordingly, they established\nsimilar accelerated convergence rates for ARCG.\nAnother line of research focuses on relaxing the convexity and strong convexity conditions for alterna-\ntive regularity conditions, including restricted secant inequality, error bound, Polyak-\u0141ojasiewicz, and\nquadratic growth conditions. These conditions have been shown to hold for many optimization prob-\nlems in machine learning, and faster convergence rates have been established (e.g. [8, 6, 9, 20, 3, 4]).\nAlthough various theoretical results have been established, the algorithmic proof of convergence and\nregularity conditions in these analyses rely heavily on algebraic tricks that are sometimes arguably\nmysterious to understand. To this end, a popular recent trend in the analysis of optimization algorithms\nhas been to study gradient descent as a discretization of gradient \ufb02ow; these approaches often provide\na clear interpretation for the continuous approximation of the algorithmic systems [16, 17]. In\n[16], authors propose a framework for studying discrete algorithmic systems under the limit of\nin\ufb01nitesimal time step. They show that Nesterov\u2019s accelerated gradient (NAG) algorithm can be\ndescribed by an ordinary differential equation (ODE) under the limit that time step tends to zero. In\n[17], authors study a more general family of ODE\u2019s that essentially correspond to accelerated gradient\nalgorithms. All these analyses, however, lack a natural interpretation in terms of physical systems\nbehind the optimization algorithms. Therefore, they do not clearly explain why the momentum leads\nto acceleration. Meanwhile, these analyses only consider general convex conditions and gradient\ndescent-type algorithms, and are NOT applicable to either the aforementioned relaxed conditions or\ncoordinate-gradient-type algorithms (due to the randomized coordinate selection).\nOur Contribution (I): We provide novel physics-based insights into the differential equation ap-\nproaches for optimization. In particular, we connect the optimization algorithms to natural physical\nsystems through differential equations. This allows us to establish a uni\ufb01ed theory for understanding\noptimization algorithms. Speci\ufb01cally, we consider the VGD, NAG, RCGD, and ARCG algorithms.\nAll of these algorithms are associated with damped oscillator systems with different particle mass\nand damping coef\ufb01cients. For example, VGD corresponds to a massless particle system while NAG\ncorresponds to a massive particle system. A damped oscillator system has a natural dissipation of\nits mechanical energy. The decay rate of the mechanical energy in the system is connected to the\nconvergence rate of the algorithm. Our results match the convergence rates of all algorithms consid-\nered here to those known in existing literature. We show that for a massless system, the convergence\nrate only depends on the gradient (force \ufb01eld) and smoothness of the function, whereas a massive\nparticle system has an energy decay rate proportional to the ratio between the mass and damping\ncoef\ufb01cient. We further show that optimal algorithms such as NAG correspond to an oscillator system\nnear critical damping. Such a phenomenon is known in the physical literature that the critically\ndamped system undergoes the fastest energy dissipation. We believe that this view can potentially\nhelp us design novel optimization algorithms in a more intuitive manner. As pointed out by the\nanonymous reviewers, some of the intuitions we provide are also presented in [13]; however, we give\na more detailed analysis in this paper.\nOur Contribution (II): We provide new analysis for more general optimization problems beyond\ngeneral convexity and strong convexity, as well as more general algorithms. Speci\ufb01cally, we provide\nseveral concrete examples: (1) VGD achieves linear convergence under the Polyak-\u0141ojasiewicz\n(PL) condition (possibly nonconvex), which matches the state-of-art result in [4]; (2) NAG achieves\naccelerated linear convergence (with a better constant term) under both general convex and quadratic\ngrowth conditions, which matches the state-of-art result in [19]; (3) Coordinate-gradient-type algo-\nrithms share the same ODE approximation with gradient-type algorithms, and our analysis involves\na more re\ufb01ned in\ufb01nitesimal analysis; (4) Newton\u2019s algorithm achieves linear convergence under\nthe strongly convex and self-concordance conditions. See Table 1 for a summary. Due to space\nlimitations, we present the extension to the nonsmooth composite optimization problem in Appendix.\nTable 1: Our contribution compared with [16, 17].\nRecently, an independent work considered\na framework similar to ours for analyzing\nthe \ufb01rst-order optimization algorithms [18];\nwhile the focus there is on bridging the gap\nbetween discrete algorithmic analysis and\ncontinuous approximation, we focus on un-\nderstanding the physical systems behind the\n\noptimization algorithms. Both perspectives are essential and complementary to each other.\n\n2\n\n [15]/[16]/Ours VGD NAG RCGD ARCG Newton General Convex --/--/R R/R/R --/--/R --/--/R --/R/-- Strongly Convex --/--/R --/--/R --/--/R --/--/R --/--/R Proximal Variants --/--/R R/--/R --/--/R --/--/R --/--/R PL Condition --/--/R --/--/R --/--/R --/--/R --/--/-- Physical Systems --/--/R --/--/R --/--/R --/--/R --/--/R \fBefore we proceed, we \ufb01rst introduce assumptions on the objective f.\nAssumption 1.1 (L-smooth). There exists a constant L > 0 such that for any x, y \u2208 Rd, we have\n(cid:107)\u2207f (x) \u2212 \u2207f (y)(cid:107) \u2264 L(cid:107)x \u2212 y(cid:107).\nAssumption 1.2 (\u00b5-strongly convex). There exists a constant \u00b5 such that for any x, y \u2208 Rd, we\nhave f (x) \u2265 f (y) + (cid:104)\u2207f (y), x \u2212 y(cid:105) + \u00b5\nAssumption 1.3 . (Lmax-coordinate-smooth) There exists a constant Lmax such that for any x, y \u2208\nRd, we have |\u2207jf (x) \u2212 \u2207jf (x\\j, yj)| \u2264 Lmax(xj \u2212 yj)2 for all j = 1, ..., d.\nThe Lmax-coordinate-smooth condition has been shown to be satis\ufb01ed by many machine learning\nproblems such as Ridge Regression and Logistic Regression. For convenience, we de\ufb01ne \u03ba =\nL/\u00b5 and \u03bamax = Lmax/\u00b5. Note that we also have Lmax \u2264 L \u2264 dLmax and \u03bamax \u2264 \u03ba \u2264 d\u03bamax.\n\n2(cid:107)x \u2212 y(cid:107)2.\n\n2 From Optimization Algorithms to ODE\n\nWe develop a uni\ufb01ed representation for the continuous approximations of the aforementioned opti-\nmization algorithms. Our analysis is inspired by [16], where the NAG algorithm for general convex\nfunction is approximated by an ordinary differential equation under the limit of in\ufb01nitesimal time step.\nWe start with VGD and NAG, and later show that RCGD and ARCG can also be approximated by\nthe same ODE. For self-containedness, we present a brief review for popular optimization algorithms\nin Appendix A (VGD, NAG, RCGD, ARCG, and Newton).\n\n2.1 A Uni\ufb01ed Framework for Continuous Approximation Analysis\n\nBy considering an in\ufb01nitesimal step size, we rewrite VGD and NAG in the following generic form:\n(2.1)\n\ny(k) = x(k) + \u03b1(x(k) \u2212 x(k\u22121)).\n\nand\nwhen f is strongly convex, and \u03b1 = k\u22121\n\nk+2 when f is\n\nx(k) = y(k\u22121) \u2212 \u03b7\u2207f (y(k\u22121))\n\u221a\n1/(\u00b5\u03b7)\u22121\n\u221a\n1/(\u00b5\u03b7)+1\n\nFor VGD, \u03b1 = 0; For NAG, \u03b1 =\ngeneral convex. We then rewrite (2.1) as\n\nx(k+1) \u2212 x(k)(cid:17) \u2212 \u03b1\n\nx(k) \u2212 x(k\u22121)(cid:17)\n\n(cid:16)\n\n(cid:16)\n\n+ \u03b7\u2207f\n\nx(k) + \u03b1(x(k) \u2212 x(k\u22121))\n\n= 0.\n\n(2.2)\nWhen considering the continuous-time limit of the above equation, it is not immediately clear how\nthe continuous-time is related to the step size k. We thus let h denote the time scaling factor and\nstudy the possible choices of h later on. With this, we de\ufb01ne a continuous time variable\n\nt = kh with X(t) = x((cid:100)t/h(cid:101)) = x(k),\n\n(2.3)\nwhere k is the iteration index, and X(t) from t = 0 to t = \u221e is a trajectory characterizing the\ndynamics of the algorithm. Throughout the paper, we may omit (t) if it is clear from the context.\n\u03b7, i.e.,\nNote that our de\ufb01nition in (2.3) is very different from [16], where t is de\ufb01ned as t = k\n\u03b7. There are several advantages by using our new de\ufb01nition: (1) The new de\ufb01nition\n\ufb01xing h =\nleads to a uni\ufb01ed analysis for both VGD and NAG. Speci\ufb01cally, if we follow the same notion as\n\u03b7 for NAG; (2) The new\n[16], we need to rede\ufb01ne t = k\u03b7 for VGD, which is different from t = k\nde\ufb01nition is more \ufb02exible, and leads to a uni\ufb01ed analysis for both gradient-type (VGD and NAG)\nand coordinate-gradient-type algorithms (RCGD and ARCG), regardless of their different step sizes,\ne.g \u03b7 = 1/L for VGD and NAG, and \u03b7 = 1/Lmax for RCGD and ARCG; (3) The new de\ufb01nition\nis equivalent to [16] only when h =\n\u03b7 is a natural\nrequirement of a massive particle system rather than an arti\ufb01cial choice of h.\nWe then proceed to derive the differential equation for (2.2). By Taylor expansion\n\n\u03b7. We will show later that, however, h (cid:16) \u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\n(cid:16)\n\n(cid:17)\n\nwhere \u02d9X(t) = dX(t)\n\ndt\n\ndt2 , we can rewrite (2.2) as\n\n(cid:16)\n(cid:16)\n\nx(k+1) \u2212 x(k)(cid:17)\nx(k) \u2212 x(k\u22121)(cid:17)\n(cid:104)\n(cid:16)\n\nand \u03b7\u2207f\n\nx(k) + \u03b1\nand \u00a8X(t) = d2X\n\n(1 + \u03b1)h2\n\n2\u03b7\n\n\u00a8X(t) +\n\n1\n= \u02d9X(t)h +\n2\n= \u02d9X(t)h \u2212 1\n2\n\nx(k) \u2212 x(k\u22121)(cid:17)(cid:105)\n\n\u00a8X(t)h2 + o(h),\n\n\u00a8X(t)h2 + o(h),\n\n= \u03b7\u2207f (X(t)) + O(\u03b7h).\n\n(1 \u2212 \u03b1)h\n\n\u03b7\n\n\u02d9X(t) + \u2207f (X(t)) + O(h) = 0.\n\n(2.4)\n\n3\n\n\fTaking the limit of h \u2192 0, we rewrite (2.4) in a more convenient form,\n\nm \u00a8X(t) + c \u02d9X(t) + \u2207f (X(t)) = 0.\n\nHere (2.5) describes exactly a damped oscillator system in d dimensions with\n\n(2.5)\n\nh2\n\u03b7\n\nm := 1+\u03b1\n2\nc := (1\u2212\u03b1)h\n\u03b7\nand f (x)\n\nas\nas\nas\n\nthe particle mass,\nthe damping coef\ufb01cient,\nthe potential \ufb01eld.\n\nLet us now consider how to choose h for different settings. The basic principle is that both m and\nc are \ufb01nite under the limit h, \u03b7 \u2192 0. In other words, the physical system is valid. Taking VGD\nas an example, for which we have \u03b1 = 0. In this case, the only valid setting is h = \u0398(\u03b7), under\nwhich, m \u2192 0 and c \u2192 c0 for some constant c0. We call such a particle system massless. For\n\u221a\n\u03b7) results in a valid physical system and it is massive\nNAG, it can also be veri\ufb01ed that only h = \u0398(\n(0 < m < \u221e, 0 \u2264 c < \u221e). Therefore, we provide a uni\ufb01ed framework of choosing the correct time\nscaling factor h.\n\n2.2 A Physical System: Damped Harmonic Oscillator\n\nIn classic mechanics, the harmonic oscillator is one of the \ufb01rst mechanic systems, which admit an\nexact solution. This system consists of a massive particle and restoring force. A typical example is a\nmassive particle connecting to a massless spring.\nThe spring always tends to stay at the equilibrium position. When it is stretched or compressed, there\nwill be a force acting on the object that stretches or compresses it. The force is always pointing\ntoward the equilibrium position. The energy stored in the spring is\n\nwhere X denotes the displacement of the spring, and K is the Hooke\u2019s constant of the spring. Here\nV (x) is called the potential energy in existing literature on physics.\n\nV (X) :=\n\nKX 2,\n\n1\n2\n\nWhen one end of spring is attached to a \ufb01xed point, and the other\nend is attached to a freely moving particle with mass m, we obtain a\nharmonic oscillator, as illustrated in Figure 1. If there is no friction\non the particle, by Newton\u2019s law, we write the differential equation\nto describe the system:\n\nm \u00a8X + KX = 0\n\nwhere \u00a8X := d2X/dt2 is the acceleration of the particle.\nIf we\ncompress the spring and release it at point x0, the system will start\noscillating, i.e., at time t, the position of the particle is X(t) =\nx0 cos(\u03c9t), where \u03c9 =\nSuch a system has two physical properties: (1) The total energy\n\n(cid:112)K/m is the oscillating frequency.\n\nE(t) := V (X(t)) + K(X(t)) = V (x0)\n\n2 m \u02d9X 2 is the kinetic energy\nis always a constant, where K(X) := 1\nof the system. This is also called energy conservation in physics; (2)\nThe system never stops.\nThe harmonic oscillator is closely related to optimization algorithms.\nAs we will show later, all our aforementioned optimization algorithms\nsimply simulate a system, where a particle is falling inside a given\npotential. From a perspective of optimization, the equilibrium is\nessentially the minimizer of the quadratic potential function V (x) =\n2Kx2. The desired property of the system is to stop the particle at\n1\nthe minimizer. However, a simple harmonic oscillator would not be\nsuf\ufb01cient and does not correspond to a convergent algorithm, since\nthe system never stops: the particle at the equilibrium has the largest\nkinetic energy, and the inertia of the massive particle would drive it\naway from the equilibrium.\n\nFigure 1: An illustration of\nthe harmonic oscillators: A\nmassive particle connects to\na massless spring. (Top) Un-\ndamped harmonic oscillator;\n(Bottom) Damped harmonic\noscillator.\n\nOne natural way to stop the particle at the equilibrium is adding damping to the system, which\ndissipates the mechanic energy, just like the real-world mechanics. A simple damping is a force\nproportional to the negative velocity of the particle (e.g. submerge the system in some viscous \ufb02uid)\n\n4\n\nmAF1=kx1mBx1B:EquilibriumpositionmCx2F2=kx2mAmBmCDampingcoe\ufb03cient:c\fFor the quadratic potential, i.e., f (x) =\n\nm \u00a8X + c \u02d9X + \u2207f (X) = 0.\n\n2 (cid:107)x \u2212 x\u2217(cid:107)2, the energy exhibits exponential decay, i.e.,\nK\n\nE(t) \u221d exp(\u2212ct/(2m))\n\n(2.6)\n\nde\ufb01ned as\n\nFf = \u2212c \u02d9X,\n\nwhere c is the viscous damping coef\ufb01cient. Suppose the potential energy of the system is f (x), then\nthe differential equation of the system is,\n\n(cid:114)\n\n(cid:104) c\n\nE(t) \u221d exp\n\n(cid:16) \u2212 1\n\nfor under damped or nearly critical damped system (e.g. c2 (cid:46) 4mK).\n(cid:17)\n(cid:105)\nFor an over damped system (i.e. c2 > 4mK), the energy decay is\n\u2212 4K\nm \u2212(cid:113) c2\nm\nFor extremely over damping cases, i.e., c2 (cid:29) 4mK, we have c\nc . This decay\ndoes not depend on the particle mass. The system exhibits a behavior as if the particle has no mass.\nIn the language of optimization, the corresponding algorithm has linear convergence. Note that the\nconvergence rate does only depend on the ratio c/m and does not depend on K when the system is\nunder damped or critically damped. The fastest convergence rate is obtained, when the system is\ncritically damped, c2 = 4mK.\n\nm2 \u2212 4K\n\nm \u2192 2K\n\nc2\nm2\n\nt\n\n.\n\n\u2212\n\n2\n\nm\n\n2.3 Suf\ufb01cient Conditions for Convergence\nFor notational simplicity, we assume that x\u2217 = 0 is a global minimum of f with f (x\u2217) = 0. The\npotential energy of the particle system is simply de\ufb01ned as V (t) := V (X(t)) := f (X(t)). If an\nalgorithm converges to optimal, a suf\ufb01cient condition is that the corresponding potential energy\nV decreases over time. The decreasing rate determines the convergence rate of the corresponding\nalgorithm.\nTheorem 2.1. Let \u03b3(t) > 0 be a nondecreasing function of t and \u0393(t) \u2265 0 be a nonnegative function.\nSuppose that \u03b3(t) and \u0393(t) satisfy\n\nd(\u03b3(t)(V (t) + \u0393(t)))\n\ndt\n\n\u2264 0\n\nand\n\nlim\nt\u21920+\n\n\u03b3(t)(V (t) + \u0393(t))) < \u221e.\n\nThen the convergence rate of the algorithm is characterized by 1\n\n\u03b3(t).\n\nProof. By d(\u03b3(t)(V (t)+\u0393(t)))\n\ndt\n\n\u2264 0, we have\n\n\u03b3(t)(V (t) + \u0393(t)) \u2264 \u03b3(0+)(f (X(0+)) + \u0393(0+)).\n\nThis further implies f (X) \u2264 V (t) + \u0393(t) \u2264 \u03b3(0+)(f (X(0+))+\u0393(0+))\n\n.\n\n\u03b3(t)\n\nIn words, \u03b3(t)[V (t) + \u0393(t)] serves as a Lyapunov function of system. We say that an algorithm is\n(1/\u03b3)-convergent, if the potential energy decay rate is O(1/\u03b3). For example, \u03b3(t) = eat corresponds\nto linear convergence, and \u03b3 = at corresponds to sublinear convergence, where a is a constant and\nindependent of t. In the following section, we apply Theorem 2.1 to different problems by choosing\ndifferent \u03b3\u2019s and \u0393\u2019s.\n\n3 Convergence Rate in Continuous Time\n\nWe derive the convergence rates of different algorithms for different families of objective functions.\nGiven our proposed framework, we only need to \ufb01nd \u03b3 and \u0393 to characterize the energy decay.\n\n3.1 Convergence Analysis of VGD\n\nWe study the convergence of VGD for two classes of functions: (1) General convex function \u2014 [11]\nhas shown that VGD achieves O(L/k) convergence for general convex functions; (2) A class of\nfunctions satisfying the Polyak-\u0141ojasiewicz (P\u0141) condition, which is de\ufb01ned as follows [14, 4].\n\n5\n\n\fAssumption 3.1 . We say that f satis\ufb01es the \u00b5-P\u0141 condition, if there exists a constant \u00b5 such that\nfor any x \u2208 Rd, we have 0 < f (x)\n\n(cid:107)\u2207f (x)(cid:107)2 \u2264 1\n2\u00b5.\n\n[4] has shown that the P\u0141 condition is the weakest condition among the following conditions: strong\nconvexity (SC), essential strong convexity (ESC), weak strong convexity (WSC), restricted secant\ninequality (RSI) and error bound (EB). Thus, the convergence analysis for the P\u0141 condition naturally\nextends to all the above conditions. Please refer to [4] for more detailed de\ufb01nitions and analyses as\nwell as various examples satisfying such a condition in machine learning.\n\n3.1.1 Sublinear Convergence for General Convex Function\nBy choosing \u0393(t) = c(cid:107)X(cid:107)2\n\nand \u03b3(t) = t, we have\n\n2t\n\n(cid:68)\u2207f (X(t)),\n\n(cid:69)\n\n(cid:68)\n\nd(\u03b3(t)(V (t) + \u0393(t)))\n\ndt\n\n= f (X(t)) + t\n+\n= f (X(t)) \u2212 (cid:104)\u2207f (X(t)), X(t)(cid:105) \u2212 t\nc\n\n\u02d9X(t)\n\nX(t), c \u02d9X(t)\n(cid:107)\u2207f (X(t))(cid:107)2 \u2264 0,\n\nwhere the last inequality follows from the convexity of f. Thus, Theorem 2.1 implies\n\n(cid:69)\n\nf (X(t)) \u2264 c(cid:107)x0(cid:107)2\n\n.\n\n2t\nL, we match the convergence rate in [11]:\n\nPlugging t = kh and c = h/\u03b7 into (3.1) and set \u03b7 = 1\nf (x(k)) \u2264 c(cid:107)x0(cid:107)2\nc\u2207f (X(t)). By choosing \u0393(t) = 0 and \u03b3(t) = exp(cid:0) 2\u00b5t\n(cid:69)(cid:19)\n\n3.1.2 Linear Convergence Under the Polyak-\u0141ojasiewicz Condition\nEquation (2.5) implies \u02d9X = \u2212 1\n\nL(cid:107)x0(cid:107)2\n\nd(\u03b3(t)(V (t) + \u0393(t)))\n\n2kh\n\n2k\n\n=\n\n.\n\nc\n\n(cid:68)\u2207f (X(t)),\n\nf (X(t)) +\n\n\u02d9X(t)\n\n(3.1)\n\n(3.2)\n\n(cid:1), we obtain\n\ndt\n\n= \u03b3(t)\n\n(cid:18) 2\u00b5\n(cid:18) 2\u00b5\n(cid:107)\u2207f (X(t))(cid:107)2 \u2264 1\n\n= \u03b3(t)\n\nc\n\nc\n\nf (X(t)) \u2212 1\nc\n\n(cid:107)\u2207f (X(t))(cid:107)2\n\n(cid:19)\n\n.\n\nBy the \u00b5-P\u0141 condition: 0 < f (X(t))\n\n2\u00b5 for some constant \u00b5 and any t, we have\n\nBy Theorem 2.1, for some constant C depending on x0, we obtain\n\nd(\u03b3(t)(V (t) + \u0393(t)))\n\ndt\n\nf (X(t)) \u2264 C\n\n(cid:48)\n\nexp\n\n,\n\n(3.3)\n\nwhich matches the behavior of an extremely over damped harmonic oscillator. Plugging t = kh and\nc = h/\u03b7 into (3.3) and set \u03b7 = 1\n\nL, we match the convergence rate in [4]:\n\n\u2212 2\u00b5t\nc\n\n\u2264 0.\n\n(cid:19)\n(cid:19)\n\n(cid:18)\n(cid:18)\n\n\u2212 2\u00b5\nL\n\nk\n\nf (xk) \u2264 C exp\n\nfor some constant C depending on x(0).\n\n3.2 Convergence Analysis of NAG\n\n(3.4)\n\nWe study the convergence of NAG for a class of convex functions satisfying the Polyak-\u0141ojasiewicz\n(P\u0141) condition. The convergence of NAG has been studied for general convex functions in [16], and\ntherefore is omitted. [11] has shown that NAG achieves a linear convergence for strongly convex\nfunctions. Our analysis shows that the strong convexity can be relaxed as it does in VGD. However,\nin contrast to VGD, NAG requires f to be convex.\nFor a L-smooth convex function satisfying \u00b5-P\u0141 condition, we have the particle mass and damping\n\u221a\ncoef\ufb01cient as m = h2\nm\u00b5. By [4], under convexity, P\u0141 is equivalent to\n\u03b7 = 2\n\u03b7\nquadratic growth (QG). Formally, we assume that f satis\ufb01es the following condition.\n\n\u221a\n\u00b5h\u221a\nc = 2\n\nand\n\n6\n\n\fAssumption 3.2 . We say that f satis\ufb01es the \u00b5-QG condition, if there exists a constant \u00b5 such that\nfor any x \u2208 Rd, we have f (x) \u2212 f (x\u2217) \u2265 \u00b5\nWe then proceed with the proof for NAG. We \ufb01rst de\ufb01ne two parameters, \u03bb and \u03c3. Let\n\n2 (cid:107)x \u2212 x\u2217(cid:107)2.\n\n\u03b3(t) = exp(\u03bbct)\n\nand \u0393(t) =\n\n(cid:107) \u02d9X + \u03c3cX(cid:107)2.\n\nm\n2\n\nGiven properly chosen \u03bb and \u03c3, we show that the required condition in Theorem 2.1 is satis\ufb01ed.\n2 (cid:107) \u02d9X(t)(cid:107)2. In contrast to an un-damped\nRecall that our proposed physical system has kinetic energy m\nsystem, NAG takes an effective velocity \u02d9X + \u03c3cX in the viscous \ufb02uid. By simple manipulation,\n\nWe then observe\nexp(\u2212\u03bbct)t\n\n(cid:104)\n\nd(V (t) + \u0393(t))\n\ndt\n\n= (cid:104)\u2207f (X),\n\n\u02d9X(cid:105) + m(cid:104) \u02d9X + \u03c3cX, \u00a8X + \u03c3c \u02d9X(cid:105).\n\n\u03bbcm\n\ndt\n\n(cid:19)\n\n\u03bbcf (X) +\n\nd(\u03b3(t)(V (t) + \u0393(t)))\n\n(cid:18)\n(cid:18) \u03bbcm\n\u2264(cid:104)\n+ (cid:104)X, (\u03bb\u03c3mc2 + m\u03c32c2) \u02d9X + m\u03c3c \u00a8X(cid:105)(cid:105)\n\n=\nf (X) + (cid:104) \u02d9X,\n\nm\u03c32c2\n\n1 +\n\n\u03bbc\n\n\u00b5\n\n2\n\n2\n\n.\n\n(cid:19)\n\n(cid:107) \u02d9X + \u03c3cX(cid:107)2 +\n\nd(V (t) + \u0393(t))\n\ndt\n\n+ m\u03c3c\n\n\u02d9X + \u2207f (X) + m \u00a8X(cid:105)\n\n(cid:105)\n\n(3.5)\n\nSince c2 = 4m\u00b5, we argue that if positive \u03c3 and \u03bb satisfy\n\nm(\u03bb + \u03c3) = 1 and \u03bb\n\n1 +\n\nm\u03c32c2\n\n(cid:18)\n\n(cid:19)\n\n\u2264 \u03c3,\n\ndt\n\n(cid:18) \u03bbcm\n\nthen we guarantee d(\u03b3(t)(V (t)+\u0393(t)))\n\n(cid:19)\nBy convexity of f, we have \u03bbc(cid:0)1+ m\u03c32c2\n\n\u2264 0. Indeed, we obtain\n\u02d9X + \u2207f (X) + m \u00a8X(cid:105) = \u2212 \u03bbmc\n2\n\n(cid:104) \u02d9X,\n(cid:104)X, (\u03bb\u03c3mc2 + m\u03c32c2) \u02d9X + m\u03c3c \u00a8X(cid:105) = \u2212\u03c3c(cid:104)X,\u2207f (X)(cid:105).\n\n(cid:1)f (X)\u2212\u03c3c(cid:104)X,\u2207f (X)(cid:105) \u2264 \u03c3cf (X)\u2212\u03c3c(cid:104)X,\u2207f (X)(cid:105) \u2264 0.\n\n(cid:107) \u02d9X(cid:107)2 \u2264 0\n\n+ m\u03c3c\n\nand\n\n\u00b5\n\n2\n\n\u00b5\n\nTo make (3.5) hold, it is suf\ufb01cient to set \u03c3 = 4\nf (X(t)) \u2264 C\n\n5m and \u03bb = 1\n5m. By Theorem 2.1, we obtain\n\u2212 ct\n5m\n\nexp\n\n(cid:48)(cid:48)\n\n(3.6)\n\n(cid:18)\n\nfor some constant C(cid:48)(cid:48) depending on x(0). Plugging t = hk, m = h2\n(3.6), we have that\n\n\u03b7 , c = 2\n\nf (xk) \u2264 C\n\n(cid:48)(cid:48)\n\nexp\n\nsatisfying P\u0141 condition from L/\u00b5 to(cid:112)\n\nComparing with VGD, NAG improves the constant term on the convergence rate for convex functions\nL/\u00b5. This matches with the algorithmic proof of [11] for\n\nstrongly convex functions, and [19] for convex functions satisfying the QG condition.\n\nk\n\n.\n\n\u221a\n\nm\u00b5, and \u03b7 = 1\n\nL into\n\n(3.7)\n\n(cid:18)\n\n(cid:114) \u00b5\n\nL\n\n\u2212 2\n5\n\n(cid:19)\n(cid:19)\n\n3.3 Convergence Analysis of RCGD and ARCG\n\ni\n\ni = x(k\u22121)\nx(k)\n\n\u2212 \u03b7\u2207if (x(k\u22121))\n\nOur proposed framework also justi\ufb01es the convergence analysis of the RCGD and ARCG algorithms.\nWe will show that the trajectory of the RCGD algorithm converges weakly to the VGD algorithm, and\nthus our analysis for VGD directly applies. Conditioning on x(k), the updating formula for RCGD is\n(3.8)\nwhere \u03b7 is the step size and i is randomly selected from {1, 2, . . . , d} with equal probabilities. Fixing\na coordinate i, we compute its expectation and variance as\n\u2207if\n\n(cid:12)(cid:12)x(k)\ni \u2212 x(k\u22121)\n(cid:12)(cid:12)x(k)\ni \u2212 x(k\u22121)\nWe de\ufb01ne the in\ufb01nitesimal time scaling factor h \u2264 \u03b7 as it does in Section 2.1 and denote (cid:101)X h(t) :=\nx((cid:98)t/h(cid:99)). We prove that for each i \u2208 [d], (cid:101)X h\n\nand x(k)\\i = x(k\u22121)\n(cid:16)\n\ni (t) converges weakly to a deterministic function Xi(t)\n\nE(cid:0)x(k)\nVar(cid:0)x(k)\n\nx(k\u22121)(cid:17)\n(cid:13)(cid:13)(cid:13)\u2207if\n(cid:16)\n\n(cid:1) = \u2212 \u03b7\n(cid:1) =\n\nx(k\u22121)(cid:17)(cid:13)(cid:13)(cid:13)2\n\nd\n\u03b72(d \u2212 1)\n\nand\n\nd2\n\n\\i\n\n.\n\n,\n\ni\n\ni\n\ni\n\ni\n\n7\n\n\fas \u03b7 \u2192 0. Speci\ufb01cally, we rewrite (3.8) as,\n\n(cid:101)X h(t + h) \u2212 (cid:101)X h(t) = \u2212\u03b7\u2207if ((cid:101)X h(t)).\n\nE(cid:0)(cid:101)X h(t + h) \u2212 (cid:101)X h(t)(cid:12)(cid:12)(cid:101)X h(t)(cid:1) = \u2212 1\n\nTaking the limit of \u03b7 \u2192 0 at a \ufb01x time t, we have\n|Xi(t + h) \u2212 Xi(t)| = O(\u03b7) and 1\n\u03b7\n\n\u2207f ((cid:101)X h(t)) + O(h).\n\u03b7 Var(cid:0)(cid:101)X h(t + h)\u2212 (cid:101)X h(t)(cid:12)(cid:12)(cid:101)X h(t)(cid:1) = O(h).\nSince (cid:107)\u2207f ((cid:101)X h(t))(cid:107)2 is bounded at the time t, we have 1\nUsing an in\ufb01nitesimal generator argument in [1], we conclude that (cid:101)X h(t) converges to X(t) weakly\nf (xk) \u2264 C1 exp(cid:0) \u2212 2\u00b5\n\nd\u2207f (X(t)) = 0 and X(0) = x(0). Since \u03b7 \u2264 1\n\nas h \u2192 0, where X(t) satis\ufb01es,\n(3.4), we have\n\nk(cid:1).\n\n\u02d9X(t) + 1\n\n, by\n\nLmax\n\nd\n\n(3.9)\n\nfor some constant C1 depending on x(0). The analysis for general convex functions follows similarly.\nOne can easily match the convergence rate as it does in (3.2), f (x(k)) \u2264 c(cid:107)x0(cid:107)2\n\nRepeating the above argument for ARCG, we obtain that the trajectory (cid:101)X h(t) converges weakly to\n\n2kh = dLmax(cid:107)x0(cid:107)2\n\n2k\n\n.\n\ndLmax\n\nX(t), where X(t) satis\ufb01es\n\nm \u00a8X(t) + c \u02d9X(t) + \u2207f (X(t)) = 0.\n\nFor general convex function, we have m = h2\nwe have f (xk) \u2264 C2d\nFor convex functions satisfying \u00b5-QG condition, m = h2\n\nt , where \u03b7(cid:48) = \u03b7\nk2 , for some constant C2 depending on x(0) and Lmax.\n\n\u03b7(cid:48) and c = 2(cid:112) m\u00b5\n(cid:1) for some constant C3 depending on x(0).\n\nf (xk) \u2264 C3 exp(cid:0) \u2212 2\n\n(cid:113) \u00b5\n\n\u03b7(cid:48) and c = 3m\n\n5d\n\nLmax\n\nd . By the analysis of [16],\n\nd . By (3.7), we obtain\n\n3.4 Convergence Analysis for Newton\n\nNewton\u2019s algorithm is a second-order algorithm. Although it is different from both VGD and NAG, we\ncan \ufb01t it into our proposed framework by choosing \u03b7 = 1\nWe consider only the case f is \u00b5-strongly convex, L-smooth and \u03bd-self-concordant. By (2.5), if h/\u03b7\nis not vanishing under the limit of h \u2192 0, we achieve a similar equation,\n\nL and the gradient as L(cid:2)\u22072f (X)(cid:3)\u22121 \u2207f (X).\n\nC \u02d9X + \u2207f (X) = 0,\n\nwhere C = h\u22072f (X) is the viscosity tensor of the system. In such a system, the function f not only\ndetermines the gradient \ufb01eld, but also determines a viscosity tensor \ufb01eld. The particle system is as\nif submerged in an anisotropic \ufb02uid that exhibits different viscosity along different directions. We\nrelease the particle at point x0 that is suf\ufb01ciently close to the minimizer 0, i.e. (cid:107)x0 \u2212 0(cid:107) \u2264 \u03b6 for\nsome parameter \u03b6 determined by \u03bd, \u00b5, and L. Now we consider the decay of the potential energy\nV (X) := f (X). By Theorem 2.1 with \u03b3(t) = exp( t\nf (X) \u2212 1\nh\n\nBy simple calculus, we have \u2207f (X) = \u2212(cid:82) 0\nwhere (cid:107)v(cid:107)X =(cid:0)vT\u22072f (X)v(cid:1) \u2208 [\u00b5(cid:107)v(cid:107)2 , L(cid:107)v(cid:107)2]. Let \u03b2 = \u03bd\u03b6L \u2264 1/2. By integration and the\n\n1 \u22072f ((1 \u2212 t)X)dt \u00b7 X. By the self-concordance\n\n(1 \u2212 \u03bdt(cid:107)X(cid:107)X )2\u22072f (X) (cid:22) \u22072f ((1 \u2212 t)X)dt (cid:22)\n\n(cid:10)\u2207f (X), (\u22072f (X))\n\n\u22121\u2207f (X)(cid:11)(cid:21)\n\n2h ) and \u0393(t) = 0, we have\n\n(1 \u2212 \u03bdt(cid:107)X(cid:107)X )2\n\ncondition, we have\n\n\u22072f (X),\n\n(cid:18) t\n\n(cid:20) 1\n\nd(\u03b3(t)f (X))\n\n(cid:19)\n\n= exp\n\n2h\n\n2h\n\ndt\n\n1\n\n\u00b7\n\n.\n\nconvexity of f, we have\n\n(cid:90) 1\n\n(1 \u2212 \u03b2)\u22072f (X) (cid:22)\n\n\u22072f ((1 \u2212 t)X)dt (cid:22) 1\n1 \u2212 \u03b2\n\nf (X) \u2212(cid:10)\u2207f (X), (\u22072f (X))\n\n\u22121\u2207f (X)(cid:11) \u2264 1\n\n0\n\n\u22072f (X)\n\nand 1\n2\n\nf (X) \u2212 1\n2\n\n(cid:104)\u2207f (X), X(cid:105) \u2264 0.\n\n2\n\nNote that our proposed ODE framework only proves a local linear convergence for Newton method\nunder the strongly convex, smooth and self concordant conditions. The convergence rate contains\nan absolute constant, which does not depend on \u00b5 and L. This partially justi\ufb01es the superior local\n\n8\n\n\fconvergence performance of the Newton\u2019s algorithm for ill-conditioned problems with very small\n\u00b5 and very large L. Existing literature, however, has proved the local quadratic convergence of the\nNewton\u2019s algorithm, which is better than our ODE-type analysis. This is mainly because the discrete\nalgorithmic analysis takes the advantage of \u201clarge\u201d step sizes, but the ODE only characterizes \u201csmall\u201d\nstep sizes, and therefore fails to achieve quadratic convergence.\n\n4 Numerical Simulations\n\nWe present an illustration of our theoretical analysis in Figure 2. We consider a strongly convex\nquadratic program\n\nf (x) =\n\n(cid:62)\n\nx\n\nHx, where H =\n\n1\n2\n\n(cid:20) 300\n\n1\n\n(cid:21)\n\n.\n\n1\n50\n\nObviously, f (x) is strongly convex and x\u2217 = [0, 0](cid:62) is the minimizer. We choose \u03b7 = 10\u22124 for\nVGD and NAG, and \u03b7 = 2 \u00d7 10\u22124 for RCGD and ARCG. The trajectories of VGD and NAG are\nobtained by the default method for solving ODE in MATLAB.\n\n5 Discussions\n\nFigure 2: The algorithmic iter-\nates and trajectories of a simple\nquadratic program.\n\nWe then give a more detailed interpretation of our proposed\nsystem from a perspective of physics:\nConsequence of Particle Mass \u2014 As shown in Section 2, a\nmassless particle system (mass m = 0) describes the simple\ngradient descent algorithm. By Newton\u2019s law, a 0-mass particle\ncan achieve in\ufb01nite acceleration and has in\ufb01nitesimal response\ntime to any force acting on it. Thus, the particle is \u201clocked\u201d\non the force \ufb01eld (the gradient \ufb01eld) of the potential (f) \u2013 the\nvelocity of the particle is always proportional to the restoration\nforce acting on the particle. The convergence rate of the algo-\nrithm is only determined by the function f and the damping\ncoef\ufb01cient. The mechanic energy is stored in the force \ufb01eld (the potential energy) rather than in the\nkinetic energy. Whereas for a massive particle system, the mechanic energy is also partially stored\nin the kinetic energy of the particle. Therefore, even when the force \ufb01eld is not strong enough, the\nparticle keeps a high speed.\n2 (cid:107)x(cid:107)2, the system has a\nDamping and Convergence Rate \u2014 For a quadratic potential V (x) = \u00b5\nexponential energy decay, where the exponent factor depends on mass m, damping coef\ufb01cient c, and\nthe property of the function (e.g. P\u0141-conef\ufb01cient). As discussed in Section 2, the decay rate is the\nfastest when the system is critically damped, i.e, c2 = 4m\u00b5. For either under or over damped system,\nthe decay rate is slower. For a potential function f satisfying convexity and \u00b5-P\u0141 condition, NAG\ncorresponds to a nearly critically damped system, whereas VGD corresponds to an extremely over\ndamped system, i.e., c2 (cid:29) 4m\u00b5. Moreover, we can achieve different acceleration rate by choosing\ndifferent m/c ratio for NAG, i.e., \u03b1 = 1/(\u00b5\u03b7)s\u22121\n1/(\u00b5\u03b7)s+1 for some absolute constant s > 0. However\ns = 1/2 achieves the largest convergence rate since it is exactly the critical damping: c2 = 4m\u00b5.\nConnecting P\u0141 Condition to Hooke\u2019s law \u2014 The \u00b5-P\u0141 and convex conditions together naturally\nmimic the property of a quadratic potential V , i.e., a damped harmonic oscillator. Speci\ufb01cally, the\n\u00b5-P\u0141 condition\n\nguarantees that the force \ufb01eld is strong enough, since the left hand side of the above equation is\nexactly the potential energy of a spring based on Hooke\u2019s law. Moreover, the convexity condition\nV (x) \u2264 (cid:104)\u2207V (x), X(cid:105) guarantees that the force \ufb01eld has a large component pointing at the equilibrium\npoint (acting as a restoration force). As indicated in [4], P\u0141 is a much weaker condition than the\nstrong convexity. Some functions that satisfy local P\u0141 condition do not even satisfy convexity, e.g.,\nmatrix factorization. The connection between the P\u0141 condition and the Hooke\u2019s law indicates that\nstrong convexity is not the fundamental characterization of linear convergence. If there is another\ncondition that employs a form of the Hooke\u2019s law, it should employ linear convergence as well.\n\n9\n\nHooke\u2019sconstantDisplacementPotentialEnergyPotentialEnergyofSpring\u00b52rV\u00b52V(x)\fReferences\n[1] Stewart N Ethier and Thomas G Kurtz. Markov processes: characterization and convergence,\n\nvolume 282. John Wiley &amp; Sons, 2009.\n\n[2] Olivier Fercoq and Peter Richt\u00e1rik. Accelerated, parallel, and proximal coordinate descent.\n\nSIAM Journal on Optimization, 25(4):1997\u20132023, 2015.\n\n[3] Pinghua Gong and Jieping Ye. Linear convergence of variance-reduced stochastic gradient\n\nwithout strong convexity. arXiv preprint arXiv:1406.1102, 2014.\n\n[4] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-\ngradient methods under the polyak-\u0142ojasiewicz condition. In Joint European Conference on\nMachine Learning and Knowledge Discovery in Databases, pages 795\u2013811. Springer, 2016.\n\n[5] Qihang Lin, Zhaosong Lu, and Lin Xiao. An accelerated proximal coordinate gradient method.\n\nIn Advances in Neural Information Processing Systems, pages 3059\u20133067, 2014.\n\n[6] Ji Liu and Stephen J Wright. Asynchronous stochastic coordinate descent: Parallelism and\n\nconvergence properties. SIAM Journal on Optimization, 25(1):351\u2013376, 2015.\n\n[7] Zhaosong Lu and Lin Xiao. On the complexity analysis of randomized block-coordinate descent\n\nmethods. Mathematical Programming, 152(1-2):615\u2013642, 2015.\n\n[8] Zhi-Quan Luo and Paul Tseng. Error bounds and convergence analysis of feasible descent\n\nmethods: a general approach. Annals of Operations Research, 46(1):157\u2013178, 1993.\n\n[9] I Necoara, Yu Nesterov, and F Glineur. Linear convergence of \ufb01rst order methods for non-\n\nstrongly convex optimization. arXiv preprint arXiv:1504.06298, 2015.\n\n[10] Yu Nesterov. Ef\ufb01ciency of coordinate descent methods on huge-scale optimization problems.\n\nSIAM Journal on Optimization, 22(2):341\u2013362, 2012.\n\n[11] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87.\n\nSpringer Science & Business Media, 2013.\n\n[12] Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science &amp; Business\n\nMedia, 2006.\n\n[13] Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR\n\nComputational Mathematics and Mathematical Physics, 4(5):1\u201317, 1964.\n\n[14] Boris Teodorovich Polyak. Gradient methods for minimizing functionals. Zhurnal Vychisli-\n\ntel\u2019noi Matematiki i Matematicheskoi Fiziki, 3(4):643\u2013653, 1963.\n\n[15] Ralph Tyrell Rockafellar. Convex analysis. Princeton university press, 2015.\n\n[16] Weijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modeling nes-\nterov\u2019s accelerated gradient method: theory and insights. In Advances in Neural Information\nProcessing Systems, pages 2510\u20132518, 2014.\n\n[17] Andre Wibisono, Ashia C Wilson, and Michael I Jordan. A variational perspective on accelerated\n\nmethods in optimization. arXiv preprint arXiv:1603.04245, 2016.\n\n[18] Ashia C Wilson, Benjamin Recht, and Michael I Jordan. A lyapunov analysis of momentum\n\nmethods in optimization. arXiv preprint arXiv:1611.02635, 2016.\n\n[19] Hui Zhang. New analysis of linear convergence of gradient-type methods via unifying error\n\nbound conditions. arXiv preprint arXiv:1606.00269, 2016.\n\n[20] Hui Zhang and Wotao Yin. Gradient methods for convex minimization: better rates under\n\nweaker conditions. arXiv preprint arXiv:1303.4645, 2013.\n\n10\n\n\f", "award": [], "sourceid": 2136, "authors": [{"given_name": "Lin", "family_name": "Yang", "institution": "Princeton University"}, {"given_name": "Raman", "family_name": "Arora", "institution": "Johns Hopkins University"}, {"given_name": "Vladimir", "family_name": "braverman", "institution": "Johns Hopkins University"}, {"given_name": "Tuo", "family_name": "Zhao", "institution": "Georgia Tech"}]}