{"title": "Dimension-Free Bounds for Low-Precision Training", "book": "Advances in Neural Information Processing Systems", "page_first": 11733, "page_last": 11761, "abstract": "Low-precision training is a promising way of decreasing the time and energy cost of training machine learning models.\nPrevious work has analyzed low-precision training algorithms, such as low-precision stochastic gradient descent, and derived theoretical bounds on their convergence rates.\nThese bounds tend to depend on the dimension of the model $d$ in that the number of bits needed to achieve a particular error bound increases as $d$ increases.\nIn this paper, we derive new bounds for low-precision training algorithms that do not contain the dimension $d$ , which lets us better understand what affects the convergence of these algorithms as parameters scale.\nOur methods also generalize naturally to let us prove new convergence bounds on low-precision training with other quantization schemes, such as low-precision floating-point computation and logarithmic quantization.", "full_text": "Dimension-Free Bounds for Low-Precision Training\n\nZheng Li\n\nIIIS, Tsinghua University\nlzlz19971997@gmail.com\n\nChristopher De Sa\nCornell University\n\ncdesa@cs.cornell.edu\n\nAbstract\n\nLow-precision training is a promising way of decreasing the time and energy cost\nof training machine learning models. Previous work has analyzed low-precision\ntraining algorithms, such as low-precision stochastic gradient descent, and derived\ntheoretical bounds on their convergence rates. These bounds tend to depend on the\ndimension of the model d in that the number of bits needed to achieve a particular\nerror bound increases as d increases. In this paper, we derive new bounds for\nlow-precision training algorithms that do not contain the dimension d , which lets\nus better understand what affects the convergence of these algorithms as parameters\nscale. Our methods also generalize naturally to let us prove new convergence\nbounds on low-precision training with other quantization schemes, such as low-\nprecision \ufb02oating-point computation and logarithmic quantization.\n\n1\n\nIntroduction\n\nAs machine learning models continue to scale to target larger problems on bigger data, the task\nof training these models quickly and ef\ufb01ciently becomes an ever-more-important problem. One\npromising technique for doing this is low-precision computation, which replaces the 32-bit or 64-bit\n\ufb02oating point numbers that are usually used in ML computations with smaller numbers, often 8-bit or\n16-bit \ufb01xed point numbers. Low-precision computation is a broadly applicable technique that has\nreceived a lot of attention, especially for deep learning, and specialized hardware accelerators have\nbeen developed to support it [2, 3, 14].\nA major application for low-precision computation is the training of ML models using empirical\nrisk minimization. This training is usually done using stochastic gradient descent (SGD), and most\nresearch in low-precision training has focused on low-precision versions of SGD. While most of this\nwork is empirical [4\u20137, 11, 12, 15, 16, 18, 20, 22, 23], signi\ufb01cant research has also been done in the\ntheoretical analysis of low-precision training. This theoretical work has succeeded in proving bounds\non the convergence rate of low-precision SGD and related low-precision methods in various settings,\nincluding for convex [8, 21] and non-convex objectives [1, 9, 17]. One common characteristic of these\nresults is that the bounds tend to depend on the dimension d of the model being learned (equivalently,\nd is the number of parameters). For example, [17] gives the convergence bound\n\u221a\n\nE [f ( \u00afwT ) \u2212 f (w\u2217)] \u2264 (1 + log(T + 1))\u03c32\n(cid:21)\n\n2\u00b5T\n\n(cid:20)(cid:13)(cid:13)(cid:13)\u2207 \u02dcf (w)\n(cid:13)(cid:13)(cid:13)2\n\nwhere the objective f is strongly convex with parameter \u00b5, low-precision SGD outputs \u00afwT after T\niterations, w\u2217 is the true global minimizer of the objective, \u03c32\nmax is an upper bound on the second\n\u2264 \u03c32\nmoment of the stochastic gradient samples E\nmax, and \u03b4 is the quantization step,\nthe difference between adjacent numbers in the low-precision format. Notice that, as T \u2192 \u221e, this\nbound shows convergence down to a level of error that increases with the dimension d. Equivalently,\nin order to achieve the same level of error as d increases, we would need to use more bits of\nquantization to make \u03b4 smaller. Similar dimension-dependent results, where either the error or the\n\n2\n\n,\n\n(1)\n\nmax\n\n+\n\nd\n\n\u03c3max\u03b4\n2\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fnumber of bits needed increases with d, can also be seen in other work on low-precision training\nalgorithms [1, 8, 21]. This dependence on d is unsatisfying because the motivation for low-precision\ntraining is to tackle large-scale problems on big data, where d can range up to 108 or more for\ncommonly used models [19]. For example, to compensate for a factor of d = 108 in (1), we\ncould add bits to decrease the quantization step \u03b4 by a factor of\nd, but this would require adding\nlog2(104) \u2248 13 bits, which is signi\ufb01cant compared to the 8 or 16 bits that are commonly used in\nlow-precision training.\nIn this paper, we address this problem by proving bounds on the convergence of LP-SGD [17] that do\nnot contain dimension d in the expression. Our main technique for doing so is a tight dimension-free\nbound on the expected quantization error of the low-precision stochastic gradients in terms of the\n(cid:96)1-norm. Our results are summarized in Table 3, and we make the following contributions:\n\n\u221a\n\n\u2022 We describe conditions under which we can prove a dimension-free bound on the conver-\ngence of SGD with \ufb01xed-point, quantized iterates on both convex and non-convex problems.\n\u2022 We study non-linear quantization schemes, in which the representable low-precision numbers\nare distributed non-uniformly. We prove dimension-free convergence bounds for SGD using\nlogarithmic quantization [16], and we show that using logarithmic quantization can reduce\nthe number of bits needed for LP-SPG to provably converge.\n\n\u2022 We study quantization using low-precision \ufb02oating-point numbers, and we present theoretical\nanalyis that suggests how to assign a given number of bits to exponent and mantissa to\noptimize the accuracy of training algorithms. We validate our results experimentally.\n\n2 Related Work\n\nMotivated by the practical implications of faster machine learning, much work has been done on\nlow-precision training. This work can be roughly divided into two groups. The \ufb01rst focuses on\ntraining deep models with low-precision weights, to be later used for faster inference. For some\napplications, methods of this type have achieved good results with very low-precision models: for\nexample, binarized [5, 12, 18] and ternary networks [23] have been observed to be effective (although\nas is usual for deep learning they lack theoretical convergence results). However, these approaches\nare still typically trained with full-precision iterates: the goal is faster inference, not faster training\n(although faster training is often achieved as a bonus side-effect).\nA second line of work on low-precision training, which is applied to both DNN training and non-deep-\nlearning tasks, focuses on making various aspects of SGD low-precision, while still trying to solve the\nsame optimization problem as the full-precision version. The most common way to do this is to make\nthe iterates of SGD (the wt in the SGD update step wt+1 = wt \u2212 \u03b1t\u2207ft(wt)) stored and computed\nin low-precision arithmetic [4, 8, 9, 11, 17]. This is the setting we will focus on most in this paper,\nbecause it has substantial theoretical prior work which exhibits the dimension-dependence we set out\nto study [1, 8, 17, 21]. The only paper we found with a bound that was not dimension-dependent was\nDe Sa et al. [9], but in that paper the authors required that the gradient samples be 1-sparse (have\nonly one nonzero entry), which is not a realistic assumption for most ML training tasks. In addition\nto quantizing the iterates, other work has studied quantizing the training set [21] and numbers used\nto communicate among parallel workers [1]. We expect that our results on dimension-free bounds\nwill be complementary with these existing theoretical approaches, and we hope that they can help to\nexplain the success of the exciting empirical work in this area.\n\n3 Dimension-Free Bounds for SGD\n\nIn this section, we analyze the performance of stochastic gradient descent (SGD) using low-precision\ntraining. Though there are numerous variants of this algorithm, SGD remains the de facto algorithm\nused most for machine learning. We will start by describing SGD and how it can be made low-\nprecision. Suppose we are trying to solve the problem\n\nminimize: f (w) =\n\n1\nn\n\n\u02dcfi(w)\n\nover: w \u2208 Rd.\n\n(2)\n\nn(cid:88)\n\ni=1\n\n2\n\n\fSGD solves this problem iteratively by repeatedly running the update step\n\nwt+1 = wt \u2212 \u03b1\u2207 \u02dcfit(wt)\n\n(3)\nwhere \u03b1 is the step size1 or learning rate, and it is the index of a component function chosen randomly\nand uniformly at each iteration from {1, . . . , n}. To make this algorithm low-precision, we quantize\nthe iterates (the vectors wt) and store them in a low-precision format. The standard format to use lets\nus represent numbers in a set\n\ndom(\u03b4, b) = {\u2212\u03b4 \u00b7 2b\u22121,\u2212\u03b4 \u00b7 (2b\u22121 \u2212 1),\u00b7\u00b7\u00b7 ,\u2212\u03b4, 0, \u03b4, 2\u03b4,\u00b7\u00b7\u00b7 , \u03b4 \u00b7 (2b\u22121 \u2212 1)}\n\nwith \u03b4 > 0 being the quantization gap, the distance between adjacent representable numbers, and\nb \u2208 N being the number of bits we use [8]. Usually, \u03b4 is a power of 2, and this scheme is called\n\ufb01xed-point arithmetic. It is straightforward to encode numbers in this set as b-bit signed integers,\nby just multiplying or dividing by \u03b4 to convert to or from the encoded format\u2014and we can even\ndo many arithmetic computations on these numbers directly as integers. This is sometimes called\nlinear quantization because the representable points are distributed uniformly throughout their range.\nHowever, as the gradient samples will produce numbers outside this set during iteration, we need\nsome way to map these numbers to the set of numbers that we can represent. The standard way to\ndo this is with a quantization function Q(x) : R \u2192 dom(\u03b4, b). While many quantization functions\nhave been proposed, the one typically used in theoretical analysis (which we will continue to use\nhere) is randomized rounding. Randomized rounding, also known as unbiased rounding or stochastic\nrounding, rounds up or down at random such that E [Q(x)] = x whenever x is within the range of\nrepresentable numbers (i.e. when \u2212\u03b4 \u00b7 2b\u22121 \u2264 x \u2264 \u03b4 \u00b7 (2b\u22121 \u2212 1)). When x is outside that range, we\nquantize it to the closest representable point. When we apply Q to a vector argument, it quantizes\neach of its components independently.\nUsing this quantization function, we can write the update step for low-precision SGD (LP-SGD),\nwhich is a simple quantization of (3),\n\nwt+1 = Q(wt \u2212 \u03b1\u2207 \u02dcfit (wt))\n\n(4)\n\nAs mentioned before, one common feature of prior bounds on the convergence of LP-SGD is that\nthey depend on the number of dimensions d, whereas bounds on full precision SGD under the same\nconditions don\u2019t. This difference is due to the fact that, when we quantize a number w, it increases\n\nits variance by E(cid:2)(Q(w) \u2212 w)2(cid:3) \u2264 \u03b42/4. Observe that this inequality is tight since it holds as\n\nan equality when w is in the middle of two quantization points, e.g. w = \u03b4/2, as illustrated in\nFigure 1(a). When quantizing a vector w \u2208 Rd, the squared error can be increased by\n\n(cid:104)(cid:107)Q(w) \u2212 w(cid:107)2\n\n(cid:105)\n\nE\n\n=\n\n2\n\nd(cid:88)\n\nk=1\n\nE(cid:2)(Q(wk) \u2212 wk)2(cid:3) \u2264 \u03b42d\n\n4\n\n,\n\n(5)\n\nand this bound is again tight. This variance inequality is the source of the d term in analyses of\nLP-SGD, and the tightness of the bound leads to the natural belief that the d term is inherent, and that\nlow-precision results are inevitably dimension-dependent.\nHowever, we propose that if we can instead bound the variance in (5) with some properties of the\nproblem itself that is not inherently dependent on d, we can achieve a result that is dimension-free.\nOne way to do this is to look at the variance graphically. Figure 1(a) plots the quantization error\nas a function of w along with the bound in (5). Notice that the squared error looks like a series of\nparabolas, and the bound in (5) is tight at the top of those parabolas, but loose elsewhere. Instead,\nsuppose we want to do the opposite and produce a bound that is tight when the error is zero (at points\n\nin dom(\u03b4, b)). To do this, we observe that E(cid:2)(Q(w) \u2212 w)2(cid:3) \u2264 \u03b4|w \u2212 z| for any z \u2208 dom(\u03b4, b). This\n\nbound is also tight when z is adjacent to w, and we plot it in Figure 1(a) as well. The natural vector\nanalog of this is\n\n\u03b4|wk \u2212 zk|= \u03b4 (cid:107)w \u2212 z(cid:107)1 , \u2200z \u2208 dom(\u03b4, b)d\n\n(6)\n\n1Usually in SGD the step size is decreased over time, but here for simplicity we consider a constant learning\n\nrate schedule.\n\nk=1\n\n3\n\n(cid:104)(cid:107)Q(w) \u2212 w(cid:107)2\n\n(cid:105) \u2264 d(cid:88)\n\n2\n\nE\n\n\f(a) Linear quantization error and\ntwo possible tight upper bounds\n\nFigure 1: A \ufb01gure of actual quantization variance E(cid:2)(Q(w) \u2212 w)2(cid:3) and the tight upper bound that\n\n(b) Variance and bound for loga-\nrithmic quantization.\n\nwe introduced in one dimension. We plot this bound when taking the minimum over all possible z.\n\n(c) Variance\n\ufb02oating-point quantization.\n\nand bound for\n\nwhere (cid:107)\u00b7(cid:107)1 denotes the (cid:96)1-norm. This is a dimension-free bound we can use to replace (5) to bound\nthe convergence of LP-SGD and other algorithms. However, this replacement is nontrivial as our\nbound is now non-constant: it depends on w, which is a variable updated each iteration. Also, in\norder to bound this new (cid:96)1-norm term, we will need some new assumptions about the problem. Next,\nwe will state these assumptions, along with the standard assumptions used in the analysis of SGD for\nboth convex and non-convex objectives, and then we will use them to present our dimension-free\nbound on the convergence of SGD.\nAssumption 1. All the loss functions \u02dcfi are differentiable, and their gradients are L-Lipschitz\ncontinuous in the sense of 2-norm, that is,\n\n\u2200i \u2208 {1, 2,\u00b7\u00b7\u00b7 , n}, \u2200x, y \u2208 Rd,\n\n\u2200i \u2208 {1, 2,\u00b7\u00b7\u00b7 , n}, \u2200x, y \u2208 Rd,\n\n(cid:13)(cid:13)(cid:13)\u2207 \u02dcfi(x) \u2212 \u2207 \u02dcfi(y)\n(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)1\n(cid:13)(cid:13)(cid:13)\u2207 \u02dcfi(x) \u2212 \u2207 \u02dcfi(y)\n\n\u2264 L(cid:107)x \u2212 y(cid:107)2\n\n\u2264 L1 (cid:107)x \u2212 y(cid:107)2\n\nAssumption 2. All the gradients of the loss functions \u02dcfi are L1-Lipschitz continuous in the sense of\n1-norm to 2-norm, that is,\n\nThese two assumptions are simply expressing of Lipschitz continuity in different norms. Assumption 1\nis a standard assumption in the analysis of SGD on convex objectives, and has been applied in the\nlow-precision case as well in prior work [8]. Assumption 2 is analogous to 1, except we are bounding\nthe (cid:96)1-norm instead of the (cid:96)2-norm. This holds naturally (with a reasonable value of L1) for many\nproblems, in particular problems for which the gradient samples are sparse.\nAssumption 3. The total loss function f is \u00b5-strongly convex for some \u00b5 > 0:\n2 \u2265 (w \u2212 v)T\u2207f (v)\n\n\u2200w, v, f (w) \u2212 f (v) \u2212 \u00b5\n2\n\n(cid:107)w \u2212 v(cid:107)2\n\nThis is a standard assumption that bounds the curvature of the loss function f, and is satis\ufb01ed for\nmany classes of convex objectives. When an objective is strongly convex and Lipschitz continuous, it\nis standard to say it has condition number \u03ba = L/\u00b5, and here we extend this to say it has L1 condition\nnumber \u03ba1 = L1/\u00b5. And for our analysis on the non-convex case, we don\u2019t have this assumption.\nAssumption 4. If the objective is convex, we assume that the gradient of each loss function is\nbounded by some constant near the optimal point w\u2217 in the sense of l1 and l2 norm, that is,\n\n\u2264 \u03c32,\n\n(cid:20)(cid:13)(cid:13)(cid:13)\u2207 \u02dcfi(w\u2217)\n(cid:21)\n(cid:13)(cid:13)(cid:13)2\n(cid:20)(cid:13)(cid:13)(cid:13)\u2207 \u02dcfi(w)\n\nE\n\n2\n\n(cid:21)\n\n(cid:13)(cid:13)(cid:13)2\n\n2\n\nE\n\n\u2200w,\n\nE\n\n(cid:104)(cid:13)(cid:13)(cid:13)\u2207 \u02dcfi(w\u2217)\n(cid:13)(cid:13)(cid:13)1\n(cid:105) \u2264 \u03c31\n(cid:104)(cid:13)(cid:13)(cid:13)\u2207 \u02dcfi(w)\n(cid:13)(cid:13)(cid:13)1\n(cid:105) \u2264 \u03c31\n\n\u2264 \u03c32,\n\nE\n\nIf the objective is non-convex, there is not necessarily a single optimal point, so we just assume each\nloss function has a global bound on its gradient: for any w,\n\nThis assumption constrains the gradient for each loss function at the optimal point. We know\n\u2207f (w\u2217) = 1\n\u02dc\u2207fi(w\u2217) = 0, so it is intuitive that each \u2207 \u02dcfi(w\u2217) can be bounded by some value.\n\n(cid:80)\n\nn\n\ni\n\n4\n\n2.01.51.00.50.00.51.01.52.0quantized value0.20.00.20.40.60.81.0Variance and BoundsVariance and Bounds for Linear Quantizationactual variance2/4 boundour l1-norm bound0200400600800100012001400quantized value02000400060008000100001200014000Variance and BoundsVariance and Bounds for Non-Linear Quantizationactual varianceour bound050100150200250quantized value050100150200Variance and BoundVariance and Bound for de-normal FPQactual varianceour bound\fTable 1: Summary of our dimension-free results compared with prior work. The values report the\nnumber of bits needed, according to the theoretical bound, for the LP-SGD [17] algorithm to achieve\nan expected objective gap (f (w) \u2212 f (w\u2217)) of \u0001 in the convex case, and an expected gradient of \u0001 in\nthe non-convex case, when we let step size \u03b1 \u2192 0, epoch length T \u2192 \u221e. Here we let R denote the\nradius of the range of numbers representable in the low-precision format and assume (cid:107)w\u2217(cid:107)2 = \u0398(R).\nThe rest of the parameters can be found in the assumptions to be introduced later.\n\nCONVEX\n\nOBJECTIVE CLASS\nNUMBER OF BITS NEEDED FOR E [f (w) \u2212 f (w\u2217)] \u2264 \u0001\nPRIOR DIMENSION-\nDEPENDENT BOUND\nOUR DIMENSION-\nFREE BOUND\nDIMENSION-FREE WITH\nLOGARITHMIC QUANTIZATION\n\nlog2 O(R\u03c3max\nlog2 O(R\u03c31/\u0001)\nlog2 O( R\u03c3\n\n\u221a\nd/\u0001)\n\n\u00b7 log (1 + \u03c31\n\n\u0001\n\n\u03c3 ))\n\nNON-CONVEX\n\nE(cid:2)(cid:107)\u2207f ( \u00afw)(cid:107)2\n\n(cid:3) \u2264 \u0001\n\n2\n\n\u2014\n\nlog2 O(LR\u03c31/\u0001)\nlog2 O( LR\u221a\n\n\u0001 \u00b7 log (1 + \u03c31\u221a\n\n\u0001 ))\n\nIn the non-convex case, however, we need a global bound on the gradient instead of just at the\noptimum. This is a natural assumption to make and it has been used in a lot of other work in this area.\nNote that this assumption only needs to hold under the expectation over all \u02dcfi.\nFor non-convex cases, we need the following additional assumption.\nAssumption 5. The variance of the gradient of each loss function is bounded by some constant \u03c32\n0:\n\n(cid:104)(cid:107)\u2207fi(w) \u2212 \u2207f (w)(cid:107)2\n\n2\n\n(cid:105) (cid:54) \u03c32\n\n0\n\n\u2200w, Var(\u2207fi(w)) = E\n\nWith these assumptions, we proved the following theorems for low-precision SGD:\nTheorem 1. Suppose that we run LP-SGD on an objective that satis\ufb01es Assumptions 1\u20134, and with\nstep size \u03b1 < 1/(2\u03ba2\u00b5). After T LP-SGD update steps (4), select \u00afwT uniformly at random from\n{w0, w1, . . . , wT\u22121}. Then, the expected objective gap of \u00afwT is bounded by\n\nE [f ( \u00afwT ) \u2212 f (w\u2217)] \u2264 1\n2\u03b1T\n\n(cid:107)w0 \u2212 w\u2217(cid:107)2\n\n2 +\n\n\u03b1\u03c32 + \u03b4\u03c31\n\n2\n\n+\n\n\u03b42\u03ba2\n1\u00b5\n4\n\nTheorem 2. Suppose that we run LP-SGD on an objective that is non-convex and satis\ufb01es Assump-\ntions 1, 4, 5, with constant step size \u03b1. After T LP-SGD update steps, select \u00afwT uniformly at random\nfrom {w0, w1, . . . , wT\u22121}. Then the expected squared gradient norm of \u00afwT is bounded by\n\n(cid:104)(cid:107)\u2207f ( \u00afwT )(cid:107)2\n\n2\n\n(cid:105) (cid:54)\n\nE\n\n2\n\n2\u03b1 \u2212 \u03b12L\n\nf (w0) \u2212 f\u2217\n\nT\n\n+\n\n\u03b1\u03c32\n\n0L + L\u03b4\u03c31\n2 \u2212 \u03b1L\n\nThe \ufb01rst theorem shows a bound of the expected distance between the result we get at T -th iteration\nand the optimal value for a convex objective, and the second shows a bound of the expected gradient\nat T -th iteration, where f\u2217 is the global minimum of objective f. By choosing an appropriate step\nsize we can achieve convergence at a 1/T rate, while the limit we converge to is only dependent\non dimension-free factors. Meanwhile, as mentioned in the \ufb01rst section, previous work gives a\ndimension-dependent bound (1) for the problem, which also converges at a 1/T rate.2 Therefore our\nresult guarantees a dimension-free convergence limit without weakening the convergence rate.\n\u221a\nIt is important to note that, because the dimension-dependent bound in (5) was tight, we should not\nd\u00b7 \u03ba\nexpect our new result to improve upon the previous theory in all cases. In the worst case, \u03ba1 =\nd \u00b7 \u03c3; this follows from the fact that for vectors in Rd, the norms are related by\nand similarly \u03c31 =\nd \u00b7 (cid:107)x(cid:107)2. Substituting this into our result produces a dimension-dependent\nbound again. This illustrates the importance of introducing the new parameters \u03ba1 and \u03c31 and\nrequiring that they be bounded; if we could not express our bound in terms of these parameters, the\nbest we could do here is recover a dimension-dependent bound. This means that we achieve the\nstandard result of a dimension-dependent bound in the worst case, but in other cases our results are\nstrictly better.\n\n\u221a\nthe inequality (cid:107)x(cid:107)1 \u2264 \u221a\n\n2Previous work (1) used a decaying step size while ours uses a constant step size to achieve a better result.\n\n5\n\n\f(a) The convergence of training\nloss gap f (w) \u2212 f (w\u2217) for LP-\nSGD, showing the effect of differ-\nent model sizes and precision.\nFigure 2: (a) Convergence of full-precision (fp) SGD and LP-SGD; (b)(c) Plots of the asymptotic\nloss gap from (a) as a function of model size d and \u03c31.\n\n(c) The size of the noise ball does\ndepend on \u03c31, especially when pre-\ncision is low.\n\n(b) The size of the noise ball is\nnot signi\ufb01cantly affected by model\nsize d.\n\nExperiments Next, we validate our theoretical results experimentally on convex problems. To do\nthis, we analyzed how the size of the noise \ufb02oor of convergence of SGD and LP-SGD varies as the\ndimension is changed for a class of synthetic problems. Importantly, we needed to pick a class of\nproblems for which the parameters L, L1, \u00b5, \u03c3, and \u03c31, did not change as we changed the dimension\nd. To do this, we chose a class of synthetic linear regression models with loss components sampled\nindependently and identically as\n\n\u02dcfi(w) =\n\n(\u02dcxT w \u2212 \u02dcy)2\n\n1\n2\n\nwhere \u02dcx is a sparse vector sampled to have s nonzero entries each of which is sampled uniformly\nfrom {\u22121, 1}, and \u02dcy is sampled from N (\u02dcxT w\u2217, \u03b22) for some variance parameter \u03b2. Importantly, the\nnonzero entries of \u02dcx were chosen non-uniformly such that Pr[\u02dcxi (cid:54)= 0] = pi for some probabilities\npi which decrease as i increases; this lets us ensure that \u00b5 remains constant as d is increased. For\nsimplicity, we sampled a fresh loss component of this form at each SGD iteration, which is sometimes\ncalled the online setting. It is straightforward to derive that for this problem\n\n\u03c31 =(cid:112)2s/\u03c0\u03c3.\n\n\u00b5 = pd\n\nL = s\n\n\u221a\nL1 = s\n\ns\n\n\u03c32 = \u03b22s\n\nWe set \u03b1 = 0.01, \u03b2 = 0.2, p1 = 0.9, pd = 0.001, and s = 16, we chose each entry of w\u2217 uniformly\nfrom [\u22121/2, 1/2], and we set \u03b4 such that the low-precision numbers would range from \u22121 to 1. We\nset these parameters so that our model satis\ufb01es Assumptions 1 to 4 while maintaining the smallest\nvalues of the parameters L, L1, \u03c3, and \u03c31; choosing a different set of parameters will lead to a looser\ntheoretical bound. Figure 2(a) shows the convergence of SGD and LP-SGD as the dimension d is\nchanged, for both 8-bit and 6-bit quantization. Notice that while changing d has an effect on the\ninitial convergence rate for both SGD and LP-SGD, it has no effect on the noise ball size, the eventual\nloss gap that the algorithm converges to. Figure 2(b) measures this noise ball size more explicitly\nas the dimension is changed: it reports the loss gap averaged across the second half of the iterates.\nNotice that as the dimension d is changed, the average loss gap is almost unchanged, even for very\nlow-precision methods for which the precision does signi\ufb01cantly affect the size of the noise ball. This\nvalidates our dimension-free bounds, and shows that they can describe the actual dependence on d in\nat least one case.\nFigure 2(c) validates our results in the opposite way: it looks at how this gap changes as our new\n\u221a\nparameters \u03c31 and L1 change while d, \u00b5, and \u03c3 are kept \ufb01xed. To do this, we \ufb01xed d = 1024 and\nchanged s across a range, setting \u03b2 = 0.8/\ns, which keeps \u03c32 constant as s is changed: this has the\neffect of changing \u03c31 (and, as a side effect, L1 and L). We can see from \ufb01gure 2(c) that changing \u03c31\nin this way has a much greater effect on LP-SGD than on SGD. This validates our theoretical results,\nand suggests that \u03c31 and L1 can effectively determine the effect of low-precision compute on SGD.\n\n4 Non-linear Quantization\n\nUp till now, most theoretical work in the area of low-precision machine learning has been on linear\nquantization, where the distance between adjacent quantization points is a constant value \u03b4. Another\noption is non-linear quantization (NLQ), in which we quantize to a set of points that are non-uniformly\n\n6\n\n02000004000006000008000001000000iterations0.0000.0250.0500.0750.1000.1250.1500.1750.200loss gapConvergence of Low-Precision SGDd=128, fpd=1024, fpd=8192, fpd=128, b=8d=1024, b=8d=8192, b=8d=128, b=6d=1024, b=6d=8192, b=6641282565121024204840968192model size0.000.020.040.060.080.10asymptotic average loss gapDimension vs. Noise Ball Size of Low-Precision SGDfull precision8-bit6-bit1.52.02.53.03.54.04.55.0L1 gradient noise at optimum 10.0000.0250.0500.0750.1000.1250.1500.1750.200asymptotic average loss gap1 vs. Noise Ball Size of Low-Precision SGDfull precision8-bit6-bit\fdistributed. This approach has been shown to be effective for accelerating deep learning in some\nsettings [16]. In general, we can quantize to a set of points\n\nD = {\u2212qn,\u00b7\u00b7\u00b7 ,\u2212q1, q0, q1,\u00b7\u00b7\u00b7 , qn\u22121},\n\nand, just like with linear quantization, we can still use a quantization function Q(w) with randomized\nrounding that rounds up or down to a number in D in such a way that E [Q(w)] = w for w \u2208\n[\u2212qn, qn\u22121]. When we consider the quantization variance here, the natural dimension-dependent\nbound would be\n\n(cid:104)(cid:107)Q(w) \u2212 w(cid:107)2\n\n(cid:105) \u2264 d\n\n2\n\nE\n\n(qi \u2212 qi\u22121)2.\n\nmax\n\ni\n\n4\n\nThis is still a tight bound since it holds with equality for a number in the middle of two adjacent\nquantization points. However, when applied in the analysis of LP-SGD, this bound induces poor\nperformance and often under-represents the actual result.\nHere we discuss a speci\ufb01c NLQ method and use it to introduce a tight bound on the quantization\nvariance. This method has been previously studied as logarithmic quantization or \u00b5\u2212law quantization,\nand is de\ufb01ned recursively by\n\nqi+1 \u2212 qi = \u03b4 + \u03b6qi\n\nq0 = 0,\n\n(7)\nwhere \u03b4 > 0 and \u03b6 > 0 are \ufb01xed parameters. Note that this includes linear quantization as a special\ncase by setting \u03b6 = 0. It turns out that we can prove a tight dimension-free bound on the quantization\nvariance of this scheme. First, we introduce the following de\ufb01nition.\nDe\ufb01nition 1. An unbiased quantization function Q satis\ufb01es the dimension-free variance bound with\nparameters \u03b4, \u03b6, and \u03b7 if for all w \u2208 [\u2212qn, qn\u22121] and all z \u2208 D,\n\n(cid:105) \u2264 \u03b4 (cid:107)w \u2212 z(cid:107)1 + \u03b6 (cid:107)z(cid:107)2 \u00b7 (cid:107)w \u2212 z(cid:107)2 + \u03b7 (cid:107)w \u2212 z(cid:107)2\n\n(cid:104)(cid:107)Q(w) \u2212 w(cid:107)2\n\n2 .\n\nE\n\n2\n\nWe can prove that our logarithmic quantization scheme satis\ufb01es this bound.\nLemma 1. The logarithmic quantization scheme (7) satis\ufb01es the dimension-free variance bound with\nparameters \u03b4, \u03b6, and \u03b7 = \u03b62\n\n4(\u03b6+1) < \u03b6\n4 .\n\nNotice that this bound becomes identical to the linear quantization bound (6) when \u03b6 = 0, so this\nresult is a strict generalization of our results from the linear quantization case. With this setup, we\ncan apply NLQ to the low-precision training algorithms we have studied earlier in this paper.\nTheorem 3. Suppose that we run LP-SGD on a convex objective that satis\ufb01es Assumptions 1\u20134, and\nusing a quantization scheme that satis\ufb01es the dimension-free variance bound 1. If \u03b6 < 1\nE [(f ( \u00afwT ) \u2212 f (w\u2217))] \u2264 (cid:107)w0 \u2212 w\u2217(cid:107)2\n\n(1 + \u03b7)\u03b1\u03c32 + \u03b4\u03c31 + \u03b6\u03c3 (cid:107)w\u2217(cid:107)2\n\n(\u03b4L1 + \u03b6L(cid:107)w\u2217(cid:107)2 + \u03b6\u03c3)2\n\n\u03ba , then\n\n+\n\n+\n\n2\n\n2\u03b1T\n\n2\n\n4\u00b5\n\nFor non-convex objectives, we need to assume a bound for the iterates we deal with, that is,\nAssumption 6. The scale of the iterates is bounded by some constant R0, i.e. \u2200t, (cid:107)wt(cid:107)2\nTheorem 4. Suppose that we run LP-SGD on a non-convex objective thatsatis\ufb01es previous assump-\ntions 1,4\u2013 6, with constant step size \u03b1 <\n2(\u03b7+1)L and using a quantization scheme that satis\ufb01es the\ndimension-free variance bound 1, then\n\n(cid:54) R0.\n\n1\n\n(cid:104)(cid:107)\u2207f ( \u00afwT )(cid:107)2\n\n2\n\n(cid:105) (cid:54) 2(f (w0) \u2212 f\u2217)\n\nE\n\n\u03b1T\n\n+ \u03b1\u03c32\n\n0L + L\u03b4\u03c31 +\n\n1\n2\n\n(L\u03b6R0)2\n\n\u221a\n\u0001 \u00b7 log (1 + \u03c31/\n\nIf we \ufb01x the representable range R (the largest-magnitude values representable in the low-precision\nformat) and choose our quantization parameters optimally, we get the result that the number of bits\nwe need to achieve objective gap or expected gradient \u0001 is log2 O((R\u03c3/\u03b5) \u00b7 log (1 + \u03c31/\u03c3)) and\n\u221a\nlog2 O(LR/\nin the worst case where we do not have a bound on \u03c31 and must use \u03c31 \u2264 \u221a\n\u0001)) (as is shown in table 3). These bounds are notable because even\nd \u00b7 \u03c3, which recovers the\ndimension term, the bounds still manage to \u201chide\u201d it within a log term. This greatly decreases the\neffect of the dimension, and suggests that NLQ may be a promising technique to use for low-precision\n\u03ba = \u00b5\nL, which to some\ntraining at scale. Also note that, although the \ufb01rst bound holds only when \u03b6 < 1\nextent limits the acceleration of the strides in logarithmic quantization, the bound \u00b5\nL is independent\nof \u03c3 and \u03c31, thus this effect of \u201cpushing \" \u03c31 into a log term is independent of the setting of \u03b6.\n\n7\n\n\fFloating point. Next, we look at another type of non-linear quantization that is of great practical\nuse: \ufb02oating-point quantization (FPQ). Here, the quantization points are simply \ufb02oating-point\nnumbers with some \ufb01xed number of exponential bits be and mantissa bits bm. Floating-point numbers\nare represented in the form\n\n(\u22121)sign bit \u00b7 2exponent\u2212bias \u00b7 (1.m1m2m3 . . . mbm )\n\n(8)\nwhere \u201cexponent\u201d is a be-bit unsigned number, the mi are the bm bits of the mantissa, and \u201cbias\u201d is a\nterm that sets the range of the representable numbers by determining the range of the exponent. In\nstandard \ufb02oating point numbers, the exponent ranges from [\u22122be\u22121+2, 2be\u22121\u22121], which corresponds\nto a bias of 2be\u22121 \u2212 1. To make our results more general, we also consider non-standard bias by\nde\ufb01ning a scaling factor s = 2\u2212(bias\u2212standard bias); the standard bias setting corresponds to s = 1.\nWe also consider the case of denormal \ufb02oating point numbers, which tries to address under\ufb02ow by\nreplacing the 1 in (8) with a 0 for the smallest exponent value. Under these conditions, we can prove\nthat \ufb02oating-point quantization satis\ufb01es the bound in De\ufb01nition 1.\nLemma 2. The FPQ scheme using randomized rounding satis\ufb01es the dimension-free variance bound\nwith parameters \u03b4normal, \u03b6, and \u03b7 for normal FPQ and \u03b4denormal, \u03b6, and \u03b7 for denormal FPQ where\n\n\u03b4normal =\n\n4s\n22be\u22121 ,\n\n\u03b4denormal =\n\n8s\n\n22be\u22121+bm\n\n,\n\n\u03b6 = 2\u2212bm,\n\n\u03b7 =\n\n\u03b6 2\n\n4(\u03b6 + 1)\n\n.\n\nThis bound can be immediately combined with Theorem 3 to produce dimension-free bounds on the\nconvergence rate of low-precision \ufb02oating-point SGD. If we are given a \ufb01xed number of total bits\nb = be + bm, we can minimize this upper bound on the objective gap or the expected gradient to try\nto predict the best way to allocate our bits between the exponent and the mantissa.\nTheorem 5. When using normal FPQ for a convex objective, given b total bits, the optimal number\nof exponential bits be such that the asymptotic upper bound on the objective gap given by Theorem 3\nis minimized is in the interval between:\n\n(cid:20)\n\n(cid:18) 2(ln 2)sL1\n\nL(cid:107)w\u2217(cid:107)2 + \u03c3\n\n(cid:19)\n\n(cid:21)\n\n+ 2b\n\n.\n\n(cid:20)\n\n(cid:19)\n\n(cid:18) 2(ln 2)s\u03c31\n\n\u03c3 (cid:107)w\u2217(cid:107)2\n\n(cid:21)\n\nlog2\n\n2 log2\n\n+ 2b\n\nand\n\nlog2\n\n2 log2\n\nTheorem 6. When using denormal FPQ for a convex objective, given b total bits, the optimal number\nof exponential bits be such that the asymptotic upper bound on the objective gap, as T \u2192 \u221e and\n\u03b1 \u2192 0, given by Theorem 3 is minimized is in the interval between:\n1 \u2212 2\nln 2\n\n(cid:18) e(L(cid:107)w\u2217(cid:107)2 + \u03c3)\n\n(cid:18) e\u03c3 (cid:107)w\u2217(cid:107)2\n\n1 \u2212 2\nln 2\n\n(cid:19)(cid:21)\n\n(cid:19)(cid:21)\n\nlog2\n\nlog2\n\n(cid:20)\n\n(cid:20)\n\nand\n\nW\n\nW\n\n8s\u03c31\n\n8sL1\n\nwhere e denotes the base of the natural logarithm and W stands for the Lambert W function. In\ncases where neither of these two values exists, the noise ball size increases as be, thus be = 2 would\nbe the optimal setting, which is equivalent to linear quantization.\nTheorem 7. When using normal FPQ for a non-convex objective, given b total bits, the optimal\nnumber of exponential bits be such that the asymptotic upper bound on the gradient, as T \u2192 \u221e and\n\u03b1 \u2192 0, given by Theorem 4 is minimized, at:\n\n(cid:20) 2\n\n(cid:18) (ln 2)2s\u03c3122b\n\n(cid:19)(cid:21)\n\nlog2\n\nW\n\nln 2\n\nLR2\n0\n\nThese theorems give us an idea of where the optimal setting of be lies such that the theoretical\nasymptotic error or the expected gradient is minimized. When using normal FPQ, this optimal\nassignment of be is O(log(b)), and for denormal FPQ the result is independent of b. Also, we found\nthat for de-normal FPQ used in non-convex objectives, the optimal setting of be is the solution to a\ntranscendental equation, which may not exist. This suggests that once the total number of bits grows\npast a threshold, we should assign most of or all the extra bits to the mantissa.\n\nExperiments For FPQ, we ran experiments on two different data sets. First, we ran LP-SGD on\nthe same synthetic data set that we used for linear regression. Here we used normal FPQ with 20\nbits in total, and we get the result in Figure 3(a). In this diagram, we plotted the empirical noise ball\nsize, its theoretical upper bound, and the optimal interval for be as Theorem 5 predicts. As the \ufb01gure\n\n8\n\n\f(a) Training noise ball size of SGD using 16\nbits normal FPQ on synthetic data set\n\n(b) Training noise ball size of SGD using 16 bits\nnormal FPQ on MNIST\n\nFigure 3: Plots of noise ball size vs. be when running SGD with 16 bits FPQ on synthetic data set\nand MNIST. Note the use of two y-axes in Figure 3(b) to make the series \ufb01t in one \ufb01gure.\n\nshows, our theorem accurately predicts the optimal setting of exponential bits, which is 5 in this case,\nto minimize both the theoretical upper bound and the actual empirical result of the noise ball size,\ndespite the theoretical upper bound being loose.\nSecond, we ran LP-SGD on the MNIST dataset [10]. To set up the experiment, we normalized the\nMNIST data to be in [0, 1] by dividing by 255, then subtracted out the mean for each features. We\nran multiclass logistic regression using an L2 regularization constant of 10\u22124 and a step size of\n\u03b1 = 10\u22124, running for 500 total epochs (passes through the dataset) to be sure we converged. For\nthis task, our (measured) problem parameters were L = 37.41, L1 = 685.27, \u03c3 = 2.38, \u03c31 = 29.11,\nand d = 784. In Figure 3(b), we plotted the observed loss gap, averaged across the last ten epochs,\nfor LP-SGD using various 16-bit \ufb02oating point formats. We also plot our theoretical bound on the\nloss gap, and the predicted optimal number of exponential bits to use based on that bound. Our results\nshow that even though our bound is very loose for this task, it still predicts the right number of bits to\nuse with reasonable accuracy. This experiment also validates the use of IEEE standard half-precision\n\ufb02oating-point numbers, which have 5 exponential bits, for this sort of task.\n\n5 Conclusion\n\nIn this paper, we present dimension-free bounds on the convergence of SGD when applied to low-\nprecision training. We point out the conditions under which such bounds hold, for both convex\nand non-convex objectives. We further extend our results to non-linear methods of quantization:\nlogarithmic quantization and \ufb02oating point quantization. We analyze the performance of SGD under\nlogarithmic quantization and demonstrate that NLQ is a promising method for reducing the number\nof bits required in low-precision training. We also presented ways in which our theory could be used\nto suggest how to allocate bits between exponent and mantissa when FPQ is used. We hope that our\nwork will encourage further investigation of non-linear quantization techniques.\n\n9\n\n2345678number of exponential bits102101100101102103104105106noise ball sizeSGD noise ball size under FPQ with 16 bitsempirical resulttheory bound23456789number of exponential bits102101100noise ball sizeempirical results103104105106theoretical noise ball16 bits noise ball of FPQ on MNISTtheory boundoptimal interval\fReferences\n[1] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd:\nCommunication-ef\ufb01cient sgd via gradient quantization and encoding. In Advances in Neural\nInformation Processing Systems, pages 1707\u20131718, 2017.\n\n[2] Doug Burger. Microsoft unveils project brainwave for real-time ai. Microsoft Research,\n\nMicrosoft, 22, 2017.\n\n[3] Adrian M Caul\ufb01eld, Eric S Chung, Andrew Putnam, Hari Angepat, Daniel Firestone, Jeremy\nFowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, et al. Con\ufb01gurable\nclouds. IEEE Micro, 37(3):52\u201361, 2017.\n\n[4] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Training deep neural networks\n\nwith low precision multiplications. arXiv preprint arXiv:1412.7024, 2014.\n\n[5] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep\nneural networks with binary weights during propagations. In Advances in neural information\nprocessing systems, pages 3123\u20133131, 2015.\n\n[6] Dipankar Das, Naveen Mellempudi, Dheevatsa Mudigere, Dhiraj Kalamkar, Sasikanth Avancha,\nKunal Banerjee, Srinivas Sridharan, Karthik Vaidyanathan, Bharat Kaul, Evangelos Georganas,\net al. Mixed precision training of convolutional neural networks using integer operations. arXiv\npreprint arXiv:1802.00930, 2018.\n\n[7] Christopher De Sa, Matthew Feldman, Christopher R\u00e9, and Kunle Olukotun. Understanding\nand optimizing asynchronous low-precision stochastic gradient descent. In Proceedings of the\n44th Annual International Symposium on Computer Architecture, pages 561\u2013574. ACM, 2017.\n\n[8] Christopher De Sa, Megan Leszczynski, Jian Zhang, Alana Marzoev, Christopher R Aberger,\nKunle Olukotun, and Christopher R\u00e9. High-accuracy low-precision training. arXiv preprint\narXiv:1803.03383, 2018.\n\n[9] Christopher M De Sa, Ce Zhang, Kunle Olukotun, and Christopher R\u00e9. Taming the wild: A\nuni\ufb01ed analysis of hogwild-style algorithms. In Advances in neural information processing\nsystems, pages 2674\u20132682, 2015.\n\n[10] Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE\n\nSignal Processing Magazine, 29(6):141\u2013142, 2012.\n\n[11] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning\nwith limited numerical precision. In International Conference on Machine Learning, pages\n1737\u20131746, 2015.\n\n[12] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized\nneural networks. In Advances in neural information processing systems, pages 4107\u20134115,\n2016.\n\n[13] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance\nIn International Conference on Neural Information Processing Systems, pages\n\nreduction.\n315\u2013323, 2013.\n\n[14] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder\nBajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance\nanalysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium\non Computer Architecture, pages 1\u201312. ACM, 2017.\n\n[15] Urs K\u00f6ster, Tristan Webb, Xin Wang, Marcel Nassar, Arjun K Bansal, William Constable, Oguz\nElibol, Scott Gray, Stewart Hall, Luke Hornof, et al. Flexpoint: An adaptive numerical format\nfor ef\ufb01cient training of deep neural networks. In Advances in Neural Information Processing\nSystems, pages 1742\u20131752, 2017.\n\n10\n\n\f[16] Edward H Lee, Daisuke Miyashita, Elaina Chai, Boris Murmann, and S Simon Wong. Lognet:\nEnergy-ef\ufb01cient neural networks using logarithmic computation. In Acoustics, Speech and\nSignal Processing (ICASSP), 2017 IEEE International Conference on, pages 5900\u20135904. IEEE,\n2017.\n\n[17] Hao Li, Soham De, Zheng Xu, Christoph Studer, Hanan Samet, and Tom Goldstein. Training\nquantized nets: A deeper understanding. In Advances in Neural Information Processing Systems,\npages 5813\u20135823, 2017.\n\n[18] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet\nclassi\ufb01cation using binary convolutional neural networks. In European Conference on Computer\nVision, pages 525\u2013542. Springer, 2016.\n\n[19] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[20] Shuang Wu, Guoqi Li, Feng Chen, and Luping Shi. Training and inference with integers in\n\ndeep neural networks. arXiv preprint arXiv:1802.04680, 2018.\n\n[21] Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. Zipml: Training\nlinear models with end-to-end low precision, and a little bit of deep learning. In International\nConference on Machine Learning, pages 4035\u20134043, 2017.\n\n[22] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net:\nTraining low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint\narXiv:1606.06160, 2016.\n\n[23] Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. arXiv\n\npreprint arXiv:1612.01064, 2016.\n\n11\n\n\fA Algorithm\n\nIn our work, we presented dimension-free bounds on the performance of low-precision SGD, here we\npresent the algorithm in detail.\n\nAlgorithm 1 LP-SGD: Low-Precision Stochastic Gradient Descent\n\ngiven: n loss functions \u02dcfi, number of epochs T , step size \u03b1, and initial iterate w0.\ngiven: low-precision quantization function Q.\n(cid:17)\nfor t = 0 to T \u2212 1 do\nsample it uniformly from {1, 2,\u00b7\u00b7\u00b7 , n},\nwt \u2212 \u03b1\u2207 \u02dcfit (wt)\nquantize wt+1 \u2190 Q\n\n(cid:16)\n\nend for\nreturn wT\n\nB Proof for results in Table 3\n\nAs mentioned in the caption of Table 3, here only we consider the convergence limit, that is, we\nassume \u03b1 \u2192 0, T \u2192 \u221e, and we compute the minimum number of bits b we would require in\norder for the limit to be less than some small positive \u03b5. Meanwhile, we denote the radius of the\nrepresentable range by R and we assume R = (cid:107)w\u2217(cid:107)2 without loss of generality, as this is the worst\ncase for all our bounds that depend on (cid:107)w\u2217(cid:107)2. Then in linear quantization, we have:\n\nq2b\u22121\u22121 = \u03b4 \u00b7(cid:0)2b\u22121 \u2212 1(cid:1) \u2265 R\n\nand in non-linear quantization, we need:\n\nq2b\u22121\u22121 =\n\n\u03b4\n\u03b6\n\n(cid:16)\n\n(1 + \u03b6)(2b\u22121\u22121) \u2212 1\n\n(cid:17) \u2265 R\n\n(9)\n\nIn the following proof we\u2019ll take the equality for these two inequalities.\nFirst, for convex objectives.\n\nB.1 LP-SGD in previous work\n\nIn previous work Li et al. [17], we have\n\n\u221a\n\nf ( \u00afwT ) \u2212 f (w\u2217) \u2264 (1 + log(T + 1))G2\n\nG\u03b4\n2\nmax is an upper bound on\nhere we re-denote G as \u03c3max for concordance with our result. Here \u03c32\n\u2264 \u03c32\nmax. Substitute \u03b4 with\nthe second moment of the stochastic gradient samples E\n2b\u22121\u22121 and set the limit (as \u03b1 \u2192 0 and T \u2192 \u221e) to be \u2264 \u03b5, and notice that 2b\u22121 \u2212 1 > 2b\u22122, then\nwe have:\n\n(cid:21)\n\n2\u00b5T\n\n+\n\nd\n\nR\n\n(cid:20)\n\u2207(cid:13)(cid:13)(cid:13) \u02dcf (w)\n(cid:13)(cid:13)(cid:13)2\n(cid:33)\n\n2\n\n(cid:32)\n\n(cid:32)\n\n\u221a\n\n(cid:33)\n\n\u221a\n\nd\n\n\u221a\n\nd\n\n\u03c3maxR\nd\n2 (2b\u22121 \u2212 1)\n\n= O (\u03b5) \u21d2 b \u2264 log2\n\n\u03c3maxR\n\u03b5\n\n+ 1 = log2 O\n\n\u03c3maxR\n\u03b5\n\nB.2 LP-SGD in our work\n\nIn Theorem 1, we know that\n\nE [f ( \u00afwT ) \u2212 f (w\u2217)] \u2264 1\n2\u03b1T\n\n2 +\nSet the limit (as \u03b1 \u2192 0 and T \u2192 \u221e) to be \u2264 \u03b5, then we need:\n\n(cid:107)w0 \u2212 w\u2217(cid:107)2\n\n\u03b1\u03c32 + \u03b4\u03c31\n\n2\n\n+\n\n\u03b42\u03ba2\n1\u00b5\n4\n\n= O (\u03b5) ,\n\n\u03b4\u03c31\n2\n\n\u03b42\u03ba2\n1\u00b5\n4\n\n= O (\u03b5) .\n\n12\n\n\fThen for suf\ufb01ciently small \u03b5, more explicitly, \u03b5 that satis\ufb01es L2\n1\n\u00b5\u03c32\n1\n\nO(\u03b5) < 1, setting\n\nwill satisfy the requirements, and we will get\n= \u03b4 = O\n\nR\n\n2b\u22121 \u2212 1\n\n\u03b4 = O\n\nThen our number of bits would look like\nb = log2 O\n\n(cid:18) \u03c31R\n\n(cid:19)\n\n\u03b5\n\n\u03c31\n\n(cid:19)\n\n(cid:18) \u03b5\n(cid:19)\n(cid:18) \u03b5\n(cid:18) \u03c31R\n\n\u03c31\n\n\u03b4 = O\n\n\u03c31\n\n(cid:18) \u03b5\n(cid:18)\n(cid:18)\n\nmin\n\nmax\n\n\u221a\n\n\u21d2 b = log2 O\n(cid:19)(cid:19)\n(cid:19)(cid:19)\n\n\u03b5\u00b5\nL1\n\n,\n\n,\n\nRL1\u221a\n\u03b5\u00b5\n\n\u03b5\n\n,\n\nThis is the expression that we wanted. Notice that even if we did not invoke small \u03b5 in the above\nbig-O analysis, we can set\n\nwhich shows explicitly that we have replaced the dimension factor with parameters of the loss\nfunctions.\n\nB.3 LP-SGD in our work using NLQ\n\nIn Theorem 3, we know that, if \u03b6 < 1\nE [(f ( \u00afwT ) \u2212 f (w\u2217))] \u2264 1\n2\u03b1T\nSet the limit (as \u03b1 \u2192 0 and T \u2192 \u221e) to be \u2264 \u03b5 and replace (cid:107)w\u2217(cid:107)2 with R; then we get\n\n(1 + \u03b7)\u03b1\u03c32 + \u03b4\u03c31 + \u03b6\u03c3 (cid:107)w\u2217(cid:107)2\n\n\u03ba, then\n(cid:107)w0 \u2212 w\u2217(cid:107)2\n2+\n\n+\n\n2\n\n(\u03b4L1 + \u03b6L(cid:107)w\u2217(cid:107)2 + \u03b6\u03c3)2\n\n4\u00b5\n\n\u03b4\u03c31 + \u03b6\u03c3R\n\n(\u03b4L1 + \u03b6LR + \u03b6\u03c3)2\n\n+\n\n2\n\n4\u00b5\n\n= O (\u03b5) .\n\nSo, in addition to our requirement that \u03b6 \u2264 \u03ba\u22121, it suf\ufb01ces to have\n\n\u03b4\u03c31 = O (\u03b5) ,\n\n\u03b6\u03c3R = O (\u03b5) ,\n\n\u03b42L2\n1\n\n\u00b5\n\n= O (\u03b5) ,\n\n\u03b6 2(LR + \u03c3)2\n\n\u00b5\n\n= O (\u03b5) .\n\nIf we set\n\n\u03b4 =\n\nO (\u03b5)\n\u03c31\n\n,\n\n\u03b6 =\n\nO (\u03b5)\n\u03c3R\n\n,\n\nthen all our other requirements will be satis\ufb01ed for suf\ufb01ciently small \u03b5. Speci\ufb01cally, we need \u03b5 to be\nsmall enough that\n\nO (\u03b5) \u2264 1,\n\n\u03ba\n\u03c3R\n\nO (\u03b5) \u2264 1,\n\nL2\n1\n\u03c32\n1\u00b5\n\n(LR + \u03c3)2\n\n\u03c32R2\u00b5\n\nO (\u03b5) \u2264 1.\n\nAs is standard in big-O analysis, we assume that \u03b5 is small enough that these requirements are\nsatis\ufb01ed, in which case our assignment of \u03b4 and \u03b6, combined with the results of Theorem 3, is\nsuf\ufb01cient to ensure an objective gap of \u03b5. Next, starting from (9), the number of bits we need for\nnon-linear quantization must satisfy\n\nwhich happens only when\n\n(1 + \u03b6)(2b\u22121\u22121) \u2212 1 \u2265 \u03b6R\n\u03b4\n\n(cid:0)2b\u22121 \u2212 1(cid:1) log(1 + \u03b6) \u2265 log\n(cid:18)\n(cid:0)2b\u22121 \u2212 1(cid:1) \u00b7 \u03b6\n\n\u2265 log\n\n(cid:18)\n\n1 +\n\n(cid:19)\n\n.\n\n\u03b6R\n\u03b4\n\n(cid:19)\n\n.\n\n1 +\n\n\u03b6R\n\u03b4\n\n2\n\n13\n\nSince we know that 0 \u2264 \u03b6 < 1, it follows that log(1 + \u03b6) \u2265 \u03b6/2. So in order for the above to be true,\nit suf\ufb01ces to have\n\n\fSince 2b\u22121 \u2212 1 > 2b\u22122, it follows that it suf\ufb01ces to have\n\nAnd this will be true if\n\n\u03b6\nFinally, using our assignment of \u03b4 and \u03b6 gives us\n\nb = log2 O\n\nlog\n\n1 +\n\n2b \u00b7 \u03b6\n8\n\n(cid:18)\n\n(cid:19)\n\n\u03b6R\n\u03b4\n\n\u2265 log\n\n(cid:18) 1\n(cid:18) \u03c3R\n\n\u0001\n\n1 +\n\n(cid:18)\n(cid:16)\n\n.\n\n(cid:19)(cid:19)\n(cid:17)(cid:19)\n\n.\n\n.\n\n\u03b6R\n\u03b4\n\n\u03c31\n\u03c3\n\nb = log2 O\n\nlog\n\n1 +\n\nThis is the expression that we wanted. Notice that even if we did not invoke small \u03b5 in the above\nbig-O analysis, we would still get a rate in which all of our (cid:96)1-dependent terms are inside the double-\nlogarithm, because none of the requirements above that constrain \u03b6 are (cid:96)1-dependent. To be explicit,\nto do this we would set \u03b4 and \u03b6 to be\n\u221a\n\n\u221a\n\n(cid:18)\n\n\u03b4 = O\n\nmin\n\n\u03b6 = O\n\nmin\n\n,\n\n\u03b5\u00b5\n\nLR + \u03c3\n\n(cid:18) \u03b5\n(cid:18)\n\n\u03c31\n\n,\n\n\u03b5\u00b5\nL1\n\n(cid:19)(cid:19)\n(cid:18) \u03c3R\n\n,\n\n(cid:18)\n\n(cid:19)\n\n(cid:18) \u03b5\n(cid:18)\n\n\u03c3R\n\n(cid:19)(cid:19)\n\n.\n\n,\n\n1\n\u03ba\n\n(cid:19)(cid:19)\n\n,\n\nb = log2 O\n\nmax\n\n,\n\nLR + \u03c3\u221a\n\u03b5\u00b5\n\n, \u03ba\n\n\u03b5\n\n\u00b7 log\n\n1 +\n\n\u03b6R\n\u03b4\n\nThen our number of bits would look like\n\nwhich shows explicitly that any (cid:96)1-dependent terms are inside the double logarithm.\nNext, for non-convex objectives.\n\nB.4 LP-SGD in our work for non-convex objectives\n\nIn Theorem 2, we know that\n\n(cid:104)(cid:107)\u2207f ( \u00afwT )(cid:107)2\n\n2\n\n(cid:105) (cid:54)\n\nE\n\nf (w0) \u2212 f (w\u2217)\n\n2\n\n2\u03b1 \u2212 \u03b12L\n\n+\n\n\u03b1\u03c32\n\n0L + L\u03b4\u03c31\n2 \u2212 \u03b1L\n\nSet the limit (as \u03b1 \u2192 0 and T \u2192 \u221e) to be \u2264 \u03b5, then we need:\n\nthus\n\nT\n\n(cid:19)\n\nL\u03b4\u03c31\n\n2\n\n= O (\u03b5)\n\n(cid:18) \u03b5\n(cid:19)\n\nL\u03c31\n\n\u03b4 = O\n\n(cid:18) \u03b5\n\nL\u03c31\n\nwill satisfy the requirements, and we will get\n\nR\n\n2b\u22121 \u2212 1\n\n= \u03b4 = O\n\n\u21d2 b = log2 O\n\n(cid:18) LR\u03c31\n\n(cid:19)\n\n\u03b5\n\nB.5 LP-SGD in our work using NLQ for non-convex objectives\n\nIn Theorem 4, we know that, if \u03b1 <\n\n(cid:104)(cid:107)\u2207f ( \u00afwT )(cid:107)2\n\n1\n\n2(\u03b7+1)L , then\n\n(cid:105) (cid:54) 2(f (w0) \u2212 f (w\u2217))\n\nE\n\n(L\u03b6R0)2\nSet the limit (as \u03b1 \u2192 0 and T \u2192 \u221e) to be \u2264 \u03b5 and replace R0 with R; then we need\n\n0L + L\u03b4\u03c31 +\n\n+ \u03b1\u03c32\n\n\u03b1T\n\n2\n\n1\n2\n\nit suf\ufb01ces to have\n\nL\u03b4\u03c31 +\n\n1\n2\nL\u03b4\u03c31 = O (\u03b5) ,\n\n(L\u03b6R0)2 = O (\u03b5)\n\n(L\u03b6R0)2 = O (\u03b5)\n\n14\n\n\fso we let\n\n\u03b4 =\n\nO (\u03b5)\nL\u03c31\n\n,\n\n\u03b6 =\n\n(cid:112)O (\u03b5)\n\nLR\n\n,\n\nthen all our other requirements will be satis\ufb01ed. Next, starting from (9), the number of bits we need\nfor non-linear quantization must satisfy\n\nwhich happens only when\n\nSince we know that 0 \u2264 \u03b6 < 1, it follows that log(1 + \u03b6) \u2265 \u03b6/2. So in order for the above to be true,\nit suf\ufb01ces to have\n\n(1 + \u03b6)(2b\u22121\u22121) \u2212 1 \u2265 \u03b6R\n\u03b4\n\n(cid:18)\n\n(cid:0)2b\u22121 \u2212 1(cid:1) log(1 + \u03b6) \u2265 log\n(cid:18)\n(cid:0)2b\u22121 \u2212 1(cid:1) \u00b7 \u03b6\n\n\u2265 log\n\n1 +\n\n.\n\n(cid:19)\n\n.\n\n\u03b6R\n\u03b4\n\n(cid:19)\n\n1 +\n\n\u03b6R\n\u03b4\n\n(cid:19)\n\n.\n\n(cid:19)(cid:19)\n(cid:19)(cid:19)\n\n.\n\n2\n\n(cid:18)\n\n\u2265 log\n\n(cid:18) 1\n(cid:18) LR\u221a\n\n\u0001\n\n\u03b6R\n\u03b4\n\n1 +\n\n(cid:18)\n(cid:18)\n\nlog\n\n1 +\n\n\u03c31\u221a\n\u0001\n\n2b \u00b7 \u03b6\n8\n\nSince 2b\u22121 \u2212 1 > 2b\u22122, it follows that it suf\ufb01ces to have\n\nAnd this will be true if\n\n\u03b6\nFinally, using our assignment of \u03b4 and \u03b6 gives us\n\nb = log2 O\n\nlog\n\n1 +\n\n\u03b6R\n\u03b4\n\nb = log2 O\n\nC Proof for theorems\n\nBefore we prove the main theorems presented in the paper, we will prove the following lemmas that\nwill be useful later, as well as the lemmas we presented before.\nThe proof of lemma 1 can be extracted from the proof of lemma 5 that we will show later.\n\nProof of Lemma 2. Here we consider the positive case \ufb01rst, then symmetrically the negative case\nalso holds. First, for normal FPQ, the set of quantization points are:\n\n(cid:26)\n\n(cid:18)\n\n(cid:19)\n\nD = {0} \u222a\n\ns \u00b7\n\n1 +\n\nx\nnm\n\n\u00b7 2y | x = 0, 1,\u00b7\u00b7\u00b7 , nm \u2212 1, y = \u2212 ne\n2\n\n+ 2,\u00b7\u00b7\u00b7 ,\n\nne\n2\n\n\u2212 1\n\nand we set the parameters for the nonlinear quantization bound to be:\n\n(cid:27)\n\n1\nnm\n\n, \u03b7 =\n\n\u03b6 2\n\n4(1 + \u03b6)\n\n=\n\n1\n\n4nm(nm + 1)\n\n2(cid:1)ne , \u03b6 =\n\n4s\n\n2 +2 =\n\n\u03b4 = s \u00b7 2\u2212 ne\n\n(cid:0)\u221a\nE(cid:2)[Q(w) \u2212 w]2(cid:3) =\n\nFor any w within representable range, we can assume it is in [qi, qi+1), then\n\nqi+1 \u2212 w\nqi+1 \u2212 qi\n\n\u00b7 (w \u2212 qi)2 +\n\n\u00b7 (qi+1 \u2212 w)2\n\nw \u2212 qi\nqi+1 \u2212 qi\n\n= (w \u2212 qi)(qi+1 \u2212 w)\n\nSo now we only need to prove that\n\n\u2200v \u2208 D, (w \u2212 qi)(qi+1 \u2212 w) \u2264 \u03b4 \u00b7 |w \u2212 v|+\u03b6 \u00b7 |v|\u00b7|w \u2212 v|+\u03b7 \u00b7 |w \u2212 v|2\n\nFirst, we consider a special case where qi = 0. In this case, qi+1 = s \u00b7 1 \u00b7 2\u2212 ne\nobvious that\nLHS = (w \u2212 qi)(qi+1 \u2212 w) = w(\u03b4 \u2212 w) \u2264 \u03b4w \u2264 RHS\n\n2 +2 = \u03b4. If v = 0, it is\n\n15\n\n\fand similarly for v = \u03b4,\n\nLHS = (w \u2212 qi)(qi+1 \u2212 w) = w(\u03b4 \u2212 w) \u2264 \u03b4(\u03b4 \u2212 w) \u2264 RHS\n\nand for v > \u03b4,\n\nNext, we consider the case where qi (cid:54)= 0. In this case, we can assume qi = s \u00b7(cid:16)\n\nRHS \u2265 \u03b4(v \u2212 w) \u2265 \u03b4(\u03b4 \u2212 w) \u2265 w(\u03b4 \u2212 w) = LHS\n\n(cid:17) \u00b7 2y, then\n\n1 + x\nnm\n\nqi+1 \u2212 qi = s \u00b7 2y \u2264 1\nIf v \u2265 qi+1, denote y = qi+1 \u2212 w, then\n\nqi = \u03b6qi.\n\nnm\n\nLHS = (w \u2212 qi)(qi+1 \u2212 w) = y \u00b7 (qi+1 \u2212 qi \u2212 y) = y \u00b7 (\u03b6qi \u2212 y)\nRHS \u2265 \u03b6 \u00b7 qi+1 \u00b7 (qi+1 \u2212 w) = \u03b6qi+1y \u2265 \u03b6qiy \u2265 LHS\n\nSecondly, if 0 \u2264 v \u2264 qi, denote y = w \u2212 qi, then\n\nLHS = (w \u2212 qi)(qi+1 \u2212 w) = y \u00b7 (qi+1 \u2212 qi \u2212 y) = y \u00b7 (\u03b6qi \u2212 y)\nRHS = \u03b4 \u00b7 (w \u2212 v) + \u03b6 \u00b7 v \u00b7 (w \u2212 v) + \u03b7 \u00b7 (w \u2212 v)2\n\n= \u2212 (\u03b6 \u2212 \u03b7) \u00b7 v2 + (\u2212\u03b4 + \u03b6w \u2212 2\u03b7w) \u00b7 v + \u03b4w \u2212 \u03b42\n\nobserve that \u03b6 \u2212 \u03b7 > 0, so the right hand side is a concave function of v, thus it achieves minimum at\neither v = 0 or v = qi. At v = qi:\n\nRHS = \u03b4y + \u03b6qiy + \u03b7y2 \u2265 \u03b6qiy \u2265 LHS\n\nand at v = 0, since qi+1 \u2264 (1 + \u03b6)qi and qi \u2264 w,\nRHS \u2212 LHS = \u03b4 \u00b7 w + \u03b6 \u00b7 0 \u00b7 w + \u03b7 \u00b7 w2 \u2212 (w \u2212 qi)(qi+1 \u2212 w)\n\n= (1 + \u03b7)w2 + (\u03b4 \u2212 qi \u2212 qi+1)w + qiqi+1\n\u2265 (1 + \u03b7)w2 \u2212 (qi + qi+1) \u00b7 w + qiqi+1\n= (1 + \u03b7)w2 \u2212 [(2 + \u03b6)qiw + (qi+1 \u2212 (1 + \u03b6)qi)w] + [(1 + \u03b6)q2\n= (1 + \u03b7)w2 \u2212 (2 + \u03b6)qi \u00b7 w + (1 + \u03b6)q2\n\u2265 (1 + \u03b7)w2 \u2212 (2 + \u03b6)qi \u00b7 w + (1 + \u03b6)q2\n\ni + (qi+1 \u2212 (1 + \u03b6)qi)(qi \u2212 w)\n\ni\n\ni + (qi+1 \u2212 (1 + \u03b6)qi)qi]\n\n4(\u03b6+1) = (\u03b6+2)2\nwhich is a positive parabola. Recall that \u03b7 = \u03b62\ni = 0, therefore RHS \u2212 LHS \u2265 0.\n(2 + \u03b6)2q2\nNow we extend this conclusion to the case where v \u2264 0. In this case,\n\ni \u2212 4(1 + \u03b7)(1 + \u03b6)q2\n\nRHS = \u03b4 \u00b7 (w \u2212 v) + \u03b6 \u00b7 (\u2212v) \u00b7 (w \u2212 v) + \u03b7 \u00b7 (w \u2212 v)2\n\n4(\u03b6+1) \u2212 1, thus the determinant is\n\nsince w, \u03b6, \u03b4, \u03b7 are all positive, this is apparently a decreasing function of v, thus it achieves minimum\nat v = 0, which is what we have already proven.\nSo far, we\u2019ve proven the lemma in the case of w \u2265 0, v \u2265 0 and w \u2265 0, v \u2264 0, and symmetrically\nit holds for w \u2264 0, v \u2264 0 and w \u2264 0, v \u2265 0, which indicates that we can extend D to be a set\ncontaining both positive and negative numbers.\nIn the de-normal FPQ case, the set of quantization points are:\n\n(cid:27)\n\n\u00b7 2\u2212 ne\n\n2 +3 | x = 0, 1,\u00b7\u00b7\u00b7 , nm \u2212 1\n\n(cid:19)\n\n1 +\n\nx\nnm\n\n\u00b7 2y | x = 0, 1,\u00b7\u00b7\u00b7 , nm \u2212 1, y = \u2212 ne\n2\n\n+ 3,\u00b7\u00b7\u00b7 ,\n\n\u2212 1\n\nne\n2\n\n(cid:27)\n\n(cid:26)\n\nD =\n\n\u222a\n\n(cid:18)\n\n(cid:26)\n\ns \u00b7 x\nnm\ns \u00b7\n\nand we set the parameters for the nonlinear quantization bound to be:\n\n\u03b4 = s \u00b7 1\nnm\n\n\u00b7 2\u2212 ne\n\n2 +3 =\n\n\u00b7\n\n8\nC\n\n1\nnm\n\n, \u03b7 =\n\n\u03b6 2\n\n4(1 + \u03b6)\n\n=\n\n1\n\n4nm(nm + 1)\n\n(cid:0)\u221a\n\n2(cid:1)ne , \u03b6 =\n\nsne\n\nThe proof for this case follows the exact same structure as the normal FPQ case.\n\n16\n\n\fLemma 3. Under condition of linear quantization when using low-precision representation (\u03b4, b),\nfor any w, v \u2208 Rd where Q(\u03b4,b)(w) = w,\n\n(cid:104)(cid:13)(cid:13)Q(\u03b4,b)(w + v) \u2212 w\u2217(cid:13)(cid:13)2\n\n(cid:105) \u2264 (cid:107)(w + v) \u2212 w\u2217(cid:107)2\n\n2 + \u03b4 (cid:107)v(cid:107)1 .\n\nE\n\n2\n\nwhere Q is the linear quantization function.\n\nProof of Lemma 3. (This proof follows the same structure as the proof for lemma 1 in [8]) First,\nobserve that this lemma holds if it holds for each dimension, so we only need to prove that for any\nw, v \u2208 R where Q(\u03b4,b)(w) = w, i.e. w \u2208 dom(\u03b4, b),\n\nE(cid:2)(Q(\u03b4,b)(w + v) \u2212 w\u2217)2(cid:3) \u2264 (w + v \u2212 w\u2217)2 + \u03b4|v|\n\nthen we can sum up all the dimensions to get the result.\nNow we consider the problem in two situations. First, if w + v is within the range representable by\n\n(\u03b4, b), then E(cid:2)Q(\u03b4,b)(w + v)(cid:3) = w + v. In this case,\n\nE(cid:2)(Q(\u03b4,b)(w + v) \u2212 w\u2217)2(cid:3)\n= E(cid:2)[(Q(\u03b4,b)(w + v) \u2212 (w + v)) \u2212 ((w + v) \u2212 w\u2217)]2(cid:3)\n= E(cid:2)[Q(\u03b4,b)(w + v) \u2212 (w + v)]2 \u2212 2[Q(\u03b4,b)(w + v) \u2212 (w + v)][(w + v) \u2212 w\u2217](cid:3)\n= E(cid:2)[Q(\u03b4,b)(w + v) \u2212 (w + v)]2(cid:3) \u2212 2[(w + v) \u2212 (w + v)][(w + v) \u2212 w\u2217]\n= [(w + v) \u2212 w\u2217]2 + E(cid:2)[Q(\u03b4,b)(w + v) \u2212 (w + v)]2(cid:3)\n\n+ [(w + v) \u2212 w\u2217]2\n\n+ [(w + v) \u2212 w\u2217]2\n\nSince (w + v) is within representable range, E(cid:2)[Q(\u03b4,b)(w + v) \u2212 (w + v)]2(cid:3) is equivalent to\nE(cid:2)[Q(\u03b4,\u221e)(v) + w \u2212 (w + v)]2(cid:3), which equals E(cid:2)[Q(\u03b4,\u221e)(v) \u2212 v]2(cid:3) since Q(\u03b4,b)(w) = w.\nNow we only need to prove that E(cid:2)[Q(\u03b4,\u221e)(v) \u2212 v]2(cid:3) \u2264 \u03b4|v|. Observe that this trivially holds for\n\nv = 0, and is symmetrical for positive and negative v. Without loss of generality we assume v > 0,\nlet z be the rounded-down quantization of v, then we have z \u2265 0. Then Q(\u03b4,b)(v) will round to\nz + \u03b4 (the rounded-up quantization of v) with probability v\u2212z\n\u03b4 , and it will round to z with probability\nz+\u03b4\u2212v\n\n. This quantization is unbiased because\n\nz + \u03b4 \u2212 v\n\nvz \u2212 z2 + v\u03b4 \u2212 z\u03b4\n\nz2 + z\u03b4 \u2212 vz\n\nz =\n\n+\n\n= v.\n\n\u03b4\n\nE(cid:2)Q(\u03b4,\u221e)(w)(cid:3) =\n\nv \u2212 z\n\n\u03b4\nThus, its variance will be\n\n(z + \u03b4) +\n\nE(cid:2)(Q(\u03b4,\u221e)(v) \u2212 v)2(cid:3) =\n\n\u03b4\nv \u2212 z\n\n\u03b4\n\n(z + \u03b4 \u2212 v)2 +\n\n= (v \u2212 z)(z + \u03b4 \u2212 v)\n= \u03b4(v \u2212 z) \u2212 (v \u2212 z)2\n\u2264 \u03b4(v \u2212 z) \u2264 \u03b4v.\n\n\u03b4\nz + \u03b4 \u2212 v\n\n(cid:18) z + \u03b4 \u2212 v\n\n\u03b4\n\n\u03b4\n\n(z \u2212 v)2\nv \u2212 z\n\n+\n\n\u03b4\n\n\u03b4\n\n(cid:19)\n\ntherefore\n\nE(cid:2)(Q(\u03b4,b)(w + v) \u2212 w\u2217)2(cid:3) \u2264 (w + v \u2212 w\u2217)2 + \u03b4|v|\n\nIn the other case, when w + v is on the exterior of the representable region, the quantization function\nQ(\u03b4,b) just maps it to the nearest representable value. Since w\u2217 is in the interior of the representable\nregion, this operation will make w + v closer to w\u2217. Thus,\n\nand so it will certainly be the case that\n\n(Q(\u03b4,b)(w + v) \u2212 w\u2217)2 \u2264 (w + v \u2212 w\u2217)2,\n\nE(cid:2)(Q(\u03b4,b)(w + v) \u2212 w\u2217)2(cid:3) \u2264 (w + v \u2212 w\u2217)2 + \u03b4|v|.\n(cid:104)(cid:13)(cid:13)Q(\u03b4,b)(w + v) \u2212 w\u2217(cid:13)(cid:13)2\n\n(cid:105) \u2264 (cid:107)(w + v) \u2212 w\u2217(cid:107)2\n\n2 + \u03b4 (cid:107)v(cid:107)1 .\n\n2\n\nE\n\nNow that we\u2019ve proven the inequality for one dimension, we can sum up all d dimensions and get\n\n17\n\n\fFor completeness, we also re-state the proof of following lemma 4, which was presented as equation\n(8) in [13], and here we present the proof for this lemma used in [8].\nLemma 4. Under the standard condition of Lipschitz continuity, if i is sampled uniformly at random\nfrom {1, . . . , N}, then for any w,\n\n(cid:104)(cid:107)\u2207fi(w) \u2212 \u2207fi(w\u2217)(cid:107)2\n\n(cid:105) \u2264 2L (f (w) \u2212 f (w\u2217)) .\n\nE\n\n2\n\nProof of Lemma 4. For any i, de\ufb01ne\n\ngi(w) = fi(w) \u2212 fi(w\u2217) \u2212 (w \u2212 w\u2217)T\u2207fi(w\u2217).\n\nClearly, if i is sampled randomly as in the lemma statement, E [gi(w)] = f (w). But also, w\u2217 must\nbe the minimizer of gi, so for any w\ngi(w\u2217) \u2264 min\n\u2264 min\n\ngi(w \u2212 \u03b7\u2207gi(w))\ngi(w) \u2212 \u03b7 (cid:107)\u2207gi(w)(cid:107)2\n\n(cid:107)\u2207gi(w)(cid:107)2\n\n(cid:18)\n\n(cid:19)\n\n\u03b72L\n\n2\n\n\u03b7\n\n\u03b7\n\n2 +\n\n2\n\n= gi(w) \u2212 1\n2L\n\n(cid:107)\u2207gi(w)(cid:107)2\n2 .\n\nwhere the second inequality follows from the Lipschitz continuity property. Re-writing this in terms\nof fi and averaging over all the i now proves the lemma statement.\nLemma 5. Under the condition of logarithmic quantization, for any w, v \u2208 Rd where v \u2208 Dd,\n\n(cid:104)(cid:107)Q(w) \u2212 w\u2217(cid:107)2\n\n(cid:105) \u2264 (cid:107)w \u2212 w\u2217(cid:107)2\n\n2\n\nE\n\n2 + \u03b4 (cid:107)w \u2212 v(cid:107)1 + \u03b6 (cid:107)v(cid:107)2 (cid:107)w \u2212 v(cid:107)2 + \u03b7 (cid:107)w \u2212 v(cid:107)2\n\n2\n\nwhere Q is the non-linear quantization function.\n\nNote that the proof this lemma naturally extends to lemma 1, thus we omitted the proof for lemma 1\nand just present the proof for lemma 5.\n\nProof of Lemma 5. Here we only consider the positive case \ufb01rst, where\n\nD = {q0, q1,\u00b7\u00b7\u00b7 , qn\u22121}\n\nwith [0, qn\u22121] being the representable range of D. As for the negative case, we will show later that it\nholds symmetrically.\nObserve that this lemma holds if it holds for each dimension, so we only need to prove that for any\nw, v \u2208 R where v \u2208 D,\n\nE(cid:2)[Q(w) \u2212 w\u2217]2(cid:3) \u2264 |w \u2212 w\u2217|2+\u03b4 \u00b7 |w \u2212 v|+\u03b6 \u00b7 |v|\u00b7|w \u2212 v|+\u03b7 \u00b7 |w \u2212 v|2\n\nthen we can sum up all the dimensions and use Cauchy-Schwarz inequality to get the result.\nNow we consider the problem in two situations.\nFirst, if w is outside the representable range, the quantization function Q just maps it to the nearest\nrepresentable value. Since w\u2217 is in the interior of the representable range, this operation will make w\ncloser to w\u2217. Thus,\n\nand so it will certainly be the case that\n\n[Q(w) \u2212 w\u2217]2 \u2264 (w \u2212 w\u2217)2,\n\nE(cid:2)[Q(w) \u2212 w\u2217]2(cid:3) \u2264 |w \u2212 w\u2217|2+\u03b4 \u00b7 |w \u2212 v|+\u03b6 \u00b7 |v|\u00b7|w \u2212 v|+\u03b7 \u00b7 |w \u2212 v|2\n\nSecond, if w is within the representable range, then E [Q(w)] = w. In this case,\n\nE(cid:2)[Q(w) \u2212 w\u2217]2(cid:3)\n= E(cid:2)[(Q(w) \u2212 w) \u2212 (w \u2212 w\u2217)]2(cid:3)\n= E(cid:2)[Q(w) \u2212 w]2 \u2212 2[Q(w) \u2212 w](w \u2212 w\u2217)(cid:3) + (w \u2212 w\u2217)2\n= E(cid:2)[Q(w) \u2212 w]2(cid:3) \u2212 2(w \u2212 w)(w \u2212 w\u2217) + (w \u2212 w\u2217)2\n= (w \u2212 w\u2217)2 + E(cid:2)[Q(w) \u2212 w]2(cid:3)\n\n18\n\n\fSince w is within representable range, we can assume it is in [qi, qi+1), then\n\nE(cid:2)[Q(w) \u2212 w]2(cid:3) =\n\nqi+1 \u2212 w\nqi+1 \u2212 qi\n\n= (w \u2212 qi)(qi+1 \u2212 w)\n\n\u00b7 (w \u2212 qi)2 +\n\n\u00b7 (qi+1 \u2212 w)2\n\nw \u2212 qi\nqi+1 \u2212 qi\n\nSo now we only need to prove that\n\n(w \u2212 qi)(qi+1 \u2212 w) \u2264 \u03b4 \u00b7 |w \u2212 v|+\u03b6 \u00b7 |v|\u00b7|w \u2212 v|+\u03b7 \u00b7 |w \u2212 v|2\n\nNote that v \u2208 D, so it is either v \u2265 qi+1 or v \u2264 qi.\nFirstly, if v \u2265 qi+1, denote y = qi+1 \u2212 w, then\n\nLHS = (w \u2212 qi)(qi+1 \u2212 w) = y \u00b7 (qi+1 \u2212 qi \u2212 y) = y \u00b7 (\u03b4 + \u03b6qi \u2212 y)\nRHS = \u03b4 \u00b7 (v \u2212 w) + \u03b6 \u00b7 v \u00b7 (v \u2212 w) + \u03b7 \u00b7 (v \u2212 w)2\n\n\u2265 \u03b4 \u00b7 (qi+1 \u2212 w) + \u03b6 \u00b7 qi+1 \u00b7 (qi+1 \u2212 w) + \u03b7 \u00b7 (qi+1 \u2212 w)2\n= \u03b4y + \u03b6qi+1y + \u03b7y2\n\u2265 \u03b4y + \u03b6qiy \u2212 y2 = LHS\nSecondly, if 0 \u2264 v \u2264 qi, denote y = w \u2212 qi, then\n\nLHS = (w \u2212 qi)(qi+1 \u2212 w) = y \u00b7 (qi+1 \u2212 qi \u2212 y) = y \u00b7 (\u03b4 + \u03b6qi \u2212 y)\nRHS = \u03b4 \u00b7 (w \u2212 v) + \u03b6 \u00b7 v \u00b7 (w \u2212 v) + \u03b7 \u00b7 (w \u2212 v)2\n\n= \u2212 (\u03b6 \u2212 \u03b7) \u00b7 v2 + (\u2212\u03b4 + \u03b6w \u2212 2\u03b7w) \u00b7 v + \u03b4w \u2212 \u03b42\n\nobserve that \u03b6 \u2212 \u03b7 > 0, so the right hand side is a concave function of v, thus it achieves minimum at\neither v = 0 or v = qi. At v = qi:\n\nRHS = \u03b4y + \u03b6qiy + \u03b7y2 \u2265 \u03b4y + \u03b6qiy \u2212 y2 = LHS\n\nand at v = 0:\n\nRHS \u2212 LHS = \u03b4 \u00b7 w + \u03b6 \u00b7 0 \u00b7 w + \u03b7 \u00b7 w2 \u2212 (w \u2212 qi)(qi+1 \u2212 w)\n\n= (1 + \u03b7)w2 + (\u03b4 \u2212 qi \u2212 qi+1)w + qiqi+1\n= (1 + \u03b7)w2 \u2212 (2 + \u03b6)qi \u00b7 w + qiqi+1\n\u2265 (1 + \u03b7)w2 \u2212 (2 + \u03b6)qi \u00b7 w + (1 + \u03b6)q2\n\ni\n\n4(\u03b6+1) = (\u03b6+2)2\nwhich is a positive parabola. Recall that \u03b7 = \u03b62\ni = 0, therefore RHS \u2212 LHS \u2265 0.\n(2 + \u03b6)2q2\nNow we extend this conclusion to the case where v \u2264 0. In this case,\n\ni \u2212 4(1 + \u03b7)(1 + \u03b6)q2\n\n4(\u03b6+1) \u2212 1, thus the determinant is\n\nRHS = \u03b4 \u00b7 (w \u2212 v) + \u03b6 \u00b7 (\u2212v) \u00b7 (w \u2212 v) + \u03b7 \u00b7 (w \u2212 v)2\n\nsince w, \u03b6, \u03b4, \u03b7 are all positive, this is apparently a decreasing function of v, thus it achieves minimum\nat v = 0, which is what we have already proven.\nSo far, we\u2019ve proven the lemma in the case of w \u2265 0, v \u2265 0 and w \u2265 0, v \u2264 0, and symmetrically\nit holds for w \u2264 0, v \u2264 0 and w \u2264 0, v \u2265 0, which indicates that we can extend D to be a set\ncontaining both positive and negative numbers, and we can reset D to be\n\nwhere\n\nD = {\u2212qn,\u00b7\u00b7\u00b7 ,\u2212q1, q0, q1,\u00b7\u00b7\u00b7 , qn\u22121}\n\nq0 = 0, qi+1 \u2212 qi = \u03b4 + \u03b6qi\n\nNow we have proven all the lemmas we need. Next, we make some small modi\ufb01cations to the\nassumptions (weakening them) so that our theorems are shown in a more general sense. For\nassumption 2, we change it to:\n\n19\n\n\fAssumption 7. All the gradients of the loss functions fi are L1-Lipschitz continuous in the sense of\n1-norm to p-norm, that is,\n\n\u2200i \u2208 {1, 2,\u00b7\u00b7\u00b7 n}, \u2200x, y,\n\n(cid:107)\u2207fi(x) \u2212 \u2207fi(y)(cid:107)1 \u2264 L1 ||x \u2212 y||p\n\nWhile in the body of the paper and in our experiments we choose p = 2 for simplicity, here we are\n(cid:80)\ngoing to prove that a generalization of Theorem 1 holds for all real numbers p. We also need a similar\ngeneralization of Assumption 3.\ni fi is \u00b51\u2212 strongly convex near the\nAssumption 8. The average of the loss functions f = 1\nn\noptimal point in the sense of p-norm, that is,\n\n\u2200w,\n\nwith p being any real number.\n\n||w \u2212 w\u2217||2\n\np \u2264 f (w) \u2212 f (w\u2217)\n\n\u00b51\n2\n\nThis assumption is essentially the same as the assumption for strong convexity that we stated before,\nsince in practice we would choose p = 2 and then \u00b51 and \u00b5 would be the same. But here we are\nactually presenting our result in a stronger sense in that we can choose any real number p and the\nproof goes the same.\nNow we are ready to prove the theorems. Note that the result of the following proof contains \u00b51 since\nwe are proving a more general version of our theorems; substituting them with \u00b5 will lead to the\nsame result that we stated before.\n\nProof of Theorem 1. In low-precision SGD, we have:\n\nby lemma 3, we know that\n\n(cid:104)(cid:107)wt+1 \u2212 w\u2217(cid:107)2\n\n(cid:105)\n\n2\n\nE\n\n= E\n\n\u2264 E\n\n= E\n\nut+1 = wt \u2212 \u03b1\u2207 \u02dcft(wt), wt+1 = Q(ut+1)\n\n(cid:21)\n\n2\n\n(cid:20)(cid:13)(cid:13)(cid:13)Q(wt \u2212 \u03b1\u2207 \u02dcft(wt)) \u2212 w\u2217(cid:13)(cid:13)(cid:13)2\n(cid:21)\n(cid:20)(cid:13)(cid:13)(cid:13)wt \u2212 \u03b1\u2207 \u02dcft(wt) \u2212 w\u2217(cid:13)(cid:13)(cid:13)2\n(cid:104)(cid:107)wt \u2212 w\u2217(cid:107)2\n(cid:105) \u2212 2\u03b1E\n(cid:104)\n(cid:21)\n(cid:20)(cid:13)(cid:13)(cid:13)\u2207 \u02dcft(wt)\n(cid:13)(cid:13)(cid:13)2\n(cid:104)\n(cid:104)(cid:107)wt \u2212 w\u2217(cid:107)2\n(cid:105) \u2212 2\u03b1E\n(cid:21)\n(cid:20)(cid:13)(cid:13)(cid:13)\u2207 \u02dcft(wt)\n(cid:13)(cid:13)(cid:13)2\n(cid:105)\n(cid:104)(cid:107)wt \u2212 w\u2217(cid:107)2\n\n+ \u03b1\u03b4E\n\n+ \u03b1\u03b4E\n\n2\n\n2\n\n2\n\n2\n\n+ \u03b12E\n\n2\n\n(cid:105)\n\n2\n\n+ \u03b4E\n\n(cid:104)(cid:13)(cid:13)(cid:13)\u03b1\u2207 \u02dcft(wt)\n(cid:13)(cid:13)(cid:13)1\n(cid:105)\n(cid:104)(cid:13)(cid:13)(cid:13)\u2207 \u02dcft(wt)\n(cid:13)(cid:13)(cid:13)1\n(cid:105)\n(cid:104)(cid:13)(cid:13)(cid:13)\u2207 \u02dcft(wt)\n(cid:13)(cid:13)(cid:13)1\n(cid:105)\n(cid:20)(cid:13)(cid:13)(cid:13)\u2207 \u02dcft(wt)\n(cid:13)(cid:13)(cid:13)2\n\n(cid:21)\n\n\u00b5\n2\n\n2\n\n(wt \u2212 w\u2217)T\u2207 \u02dcft(wt)\n\n+ \u03b12E\n\n\u2264 E\n\n+ \u03b12E\n\n= (1 \u2212 \u03b1\u00b5)E\n\n\u2212 2\u03b1E [(f (wt) \u2212 f (w\u2217))]\n\n(f (wt) \u2212 f (w\u2217)) +\n\n(cid:107)wt \u2212 w\u2217(cid:107)2\n\n2\n\n(cid:105)\n(cid:104)(cid:13)(cid:13)(cid:13)\u2207 \u02dcft(wt)\n(cid:13)(cid:13)(cid:13)1\n\n(cid:105)\n\n+ \u03b1\u03b4E\n\nwhere the second inequality holds due to the strongly convexity assumption. According to the\nassumptions we had, we have:\n\n2 + 2(\u2207fi(w) \u2212 \u2207fi(w\u2217))T\u2207fi(w\u2217) + (cid:107)\u2207fi(w\u2217)(cid:107)2\n2 + (cid:107)\u2207fi(w\u2217)(cid:107)2\n\n2\n\n2\n\n(cid:105)\n\n(cid:105)\n\n(cid:104)(cid:107)\u2207fi(w)(cid:107)2\n\n2\n\n(cid:105)\n\nE\n\n(cid:105)\n\n(cid:104)(cid:107)\u2207fi(w) \u2212 \u2207fi(w\u2217) + \u2207fi(w\u2217)(cid:107)2\n(cid:104)(cid:107)\u2207fi(w) \u2212 \u2207fi(w\u2217)(cid:107)2\n(cid:104)(cid:107)\u2207fi(w) \u2212 \u2207fi(w\u2217)(cid:107)2\n(cid:104)(cid:107)w \u2212 w\u2217(cid:107)2\n\n+ \u03c32\n\n(cid:105)\n\n2\n\n2\n\n= E\n\n= E\n\n= E\n\u2264 L2 \u00b7 E\n\nE [(cid:107)\u2207fi(w)(cid:107)1] = E [(cid:107)\u2207fi(w) \u2212 \u2207fi(w\u2217) + \u2207fi(w\u2217)(cid:107)1]\n\n\u2264 E [(cid:107)\u2207fi(w) \u2212 \u2207fi(w\u2217)(cid:107)1 + (cid:107)\u2207fi(w\u2217)(cid:107)1]\n\u2264 L1 \u00b7 E [(cid:107)w \u2212 w\u2217(cid:107)2] + \u03c31\n\n20\n\n\fwhere the last inequality holds due to assumption 2 where we let p = 2. Applying this result to the\nprevious formula and we will have:\n\n(cid:104)(cid:107)wt+1 \u2212 w\u2217(cid:107)2\n\n2\n\n(cid:105) \u2264 (1 \u2212 \u03b1\u00b5)E\n\n(cid:104)(cid:107)wt \u2212 w\u2217(cid:107)2\n\nE\n\n(cid:105)\n(cid:104)(cid:107)wt \u2212 w\u2217(cid:107)2\n\n+ \u03b12E\n\n2\n\n2\n\n(cid:21)\n\n(cid:20)(cid:13)(cid:13)(cid:13)\u2207 \u02dcft(wt)\n(cid:13)(cid:13)(cid:13)2\n(cid:105)\n\n2\n\n+ \u03b1\u03b4L1E [(cid:107)wt \u2212 w\u2217(cid:107)2]\n\n(cid:13)(cid:13)(cid:13)1\n(cid:104)(cid:13)(cid:13)(cid:13)\u2207 \u02dcft(wt)\n\n(cid:105)\n\n+ \u03b1\u03b4E\n\n\u2212 2\u03b1E [(f (wt) \u2212 f (w\u2217))]\n\n\u2264 (1 \u2212 \u03b1\u00b5 + \u03b12L2)E\n\n\u2212 2\u03b1E [(f (wt) \u2212 f (w\u2217))] + \u03b12\u03c32 + \u03b1\u03b4\u03c31\n\nHere we introduce a positive constant C that we\u2019ll set later, and by basic inequality we get\n\n\u03b1\u03b4L1E [(cid:107)wt \u2212 w\u2217(cid:107)2] \u2264 CE [(cid:107)wt \u2212 w\u2217(cid:107)2]2 +\n\n(cid:104)(cid:107)wt+1 \u2212 w\u2217(cid:107)2\n\n2\n\n(cid:105) \u2264 (1 \u2212 \u03b1\u00b5 + \u03b12L2 + C)E\n\n\u03b12\u03b42L2\n1\n\n\u2264 CE\n\n4C\n\n(cid:104)(cid:107)wt \u2212 w\u2217(cid:107)2\n\n2\n\nthus\n\nE\n\n(cid:105)\n\n(cid:104)(cid:107)wt \u2212 w\u2217(cid:107)2\n(cid:105) \u2212 2\u03b1E [(f (wt) \u2212 f (w\u2217))]\n\n\u03b12\u03b42L2\n1\n\n4C\n\n+\n\n2\n\n\u03b12\u03b42L2\n1\n\n+ \u03b12\u03c32 + \u03b1\u03b4\u03c31 +\none setting C to be \u03b1\u00b5 \u2212 \u03b12L2, we will have:\n2\u03b1E [(f (wt) \u2212 f (w\u2217))] \u2264 E\nsince we can set \u03b1 to be small enough such that \u03b1L2 \u2264 \u00b5\n\n4C\n\n2\n\n2\n\n(cid:104)(cid:107)wt+1 \u2212 w\u2217(cid:107)2\n(cid:105)\u2212E\n(cid:104)(cid:107)wt \u2212 w\u2217(cid:107)2\n(cid:105)\n(cid:104)(cid:107)wt+1 \u2212 w\u2217(cid:107)2\n(cid:105) \u2212 E\n(cid:104)(cid:107)wt \u2212 w\u2217(cid:107)2\n(cid:105)\n(cid:105)\n(cid:104)(cid:107)wT \u2212 w\u2217(cid:107)2\n\n(cid:107)w0 \u2212 w\u2217(cid:107)2\n\n2\n\n2\n\n2 \u2212 E\n2\u03b1T\n\nT\u22121(cid:88)\n\nt=o\n\n1\nT\n\nE [(f (wt) \u2212 f (w\u2217))] \u2264\n\n2\u03b1E [(f (wt) \u2212 f (w\u2217))] \u2264 E\n+ \u03b12\u03c32 + \u03b1\u03b4\u03c31 +\nnow we sum up this inequality from t = 0 to t = T \u2212 1 and divide by 2\u03b1T , then we get:\n\n2\n\n\u03b1\u03b42L2\n1\n\n2\u00b5\n\n+\u03b12\u03c32+\u03b1\u03b4\u03c31+\n\n\u03b12\u03b42L2\n1\n\n4(\u03b1\u00b5 \u2212 \u03b12L2)\n\n2 , then the result will become:\n\n+\n\n\u03b1\u03c32 + \u03b4\u03c31\n\n2\n\n+\n\n\u03b42L2\n1\n4\u00b5\n\n+\nand since we sample \u00afwT uniformly from (wo, w1,\u00b7\u00b7\u00b7 , wT\u22121), we get\n\n2\u03b1T\n\n+\n\n2\n\n\u2264 (cid:107)w0 \u2212 w\u2217(cid:107)2\n\n2\n\n\u03b1\u03c32 + \u03b4\u03c31\n\n\u03b42\u03ba2\n1\u00b5\n4\n\nE [(f ( \u00afwT ) \u2212 f (w\u2217))] \u2264 1\n2\u03b1T\n\n(cid:107)w0 \u2212 w\u2217(cid:107)2\n\n2 +\n\n\u03b1\u03c32 + \u03b4\u03c31\n\n2\n\n+\n\n\u03b42\u03ba2\n1\u00b5\n4\n\nProof of Theorem 2. In low-precision SGD:\n\nut+1 = wt \u2212 \u03b1t\u2207fit(wt), wt+1 = Q(ut+1)\n\nthus\n\nf (wt+1) = f (Q(ut+1))\n\n= f (ut+1 + (Q(ut+1) \u2212 ut+1))\n= f (ut+1) + (Q(ut+1) \u2212 ut+1)T \u2207f (ut+1)\n\n(Q(ut+1) \u2212 ut+1)T(cid:2)\u22072f (\u03bet)(cid:3) (Q(ut+1) \u2212 ut+1)\n\n+\n\n1\n2\n\n(cid:54) f (ut+1) + (Q(ut+1) \u2212 ut+1)T \u2207f (ut+1) +\n\nL(cid:107)Q(ut+1) \u2212 ut+1(cid:107)2\n\n2\n\n1\n2\n\n21\n\n\fSince E [Q(ut+1)] = ut+1, by taking the expectation, we get:\n\nE [f (wt+1)] (cid:54) E [f (ut+1)] + 0 +\n\n1\n2\n\nLE\n\n= E [f (wt \u2212 \u03b1t\u2207fit(wt))] +\n\n1\n2\n\nLE\n\namong which,\n\nf (wt \u2212 \u03b1t\u2207fit(wt))\n\n= f (wt) \u2212 \u03b1t\u2207f (wt)T\u2207fit (wt) +\n(cid:54) f (wt) \u2212 \u03b1t\u2207f (wt)T\u2207fit(wt) +\n= f (wt) \u2212 \u03b1t\u2207f (wt)T\u2207fit(wt) +\n\n\u03b12\nt\n2\n\u03b12\nt L\n2\n\u03b12\nt L\n2\n\n(cid:104)(cid:107)Q(ut+1) \u2212 ut+1(cid:107)2\n\n(cid:105)\n\n(cid:105)\n\n2\n\n(cid:104)(cid:107)Q(ut+1) \u2212 ut+1(cid:107)2\nt)(cid:3)\u2207fit(wt)\n\n2\n\n\u2207fit(wt)T(cid:2)\u22072f (\u03be(cid:48)\n\n(cid:107)\u2207fit(wt)(cid:107)2\n(cid:107)\u2207f (wt) + (\u2207fit (wt) \u2212 \u2207f (wt))(cid:107)2\n\n2\n\n2\n\nsince E [fit] = f, by taking the expectation of previous terms we have:\n\nE [f (wt \u2212 \u03b1t\u2207fit(wt))]\n\n= E [f (wt)] \u2212 \u03b1tE\n= E [f (wt)] \u2212\n\n2\n\n+\n\n(cid:18)\n\u03b1t \u2212 \u03b12\nt L\n2\n\n(cid:104)(cid:107)\u2207f (wt)(cid:107)2\n(cid:19)\n(cid:20) 1\n\n\u03b12\nt L\n2\n\n(cid:110)\n(cid:105)\n(cid:104)(cid:107)\u2207f (wt)(cid:107)2\n(cid:21)\n\nL(cid:107)Q(ut+1) \u2212 ut+1(cid:107)2\n\nE\n\nE\n\n2\n\n2\n\nE\n\n(cid:105)\n\n2\n\n(cid:104)(cid:107)\u2207f (wt)(cid:107)2\n\n+\n\n\u03b12\nt L\n2\n\nE\n\nmeanwhile, according to previous lemma and assumption 4, we know that\n\n(cid:105)(cid:111)\n\n(cid:105)\n2 + (cid:107)\u2207fit(wt) \u2212 \u2207f (wt)(cid:107)2\n\n(cid:104)(cid:107)\u2207fit(wt) \u2212 \u2207f (wt)(cid:107)2\n\n2\n\n2\n\n(cid:54) 1\n2\n1\n=\n2\n(cid:54) 1\n2\n\nL\u03b4 \u00b7 E [(cid:107)ut+1 \u2212 wt(cid:107)1]\nL\u03b4E [(cid:107)\u03b1t\u2207fit(wt)(cid:107)1]\n\n\u03b1tL\u03b4\u03c31\n\nnow according to the assumption of the variance of \u2207fit(wt) and substitute the results into previous\nexpression, we will get:\n\nthen we can sum the result from t = 0 to T \u2212 1, and we have:\n\n\u03b1tL\u03b4\u03c31\n\n(cid:104)(cid:107)\u2207f (wt)(cid:107)2\n(cid:105)\n(cid:104)(cid:107)\u2207f (wt)(cid:107)2\n\n1\n2\n\n+\n\n2\n\n2\n\n2\n\n(cid:105)\n(cid:105)\n\nE\n\nE\n\n+\n\n1\n2\n\n\u03b12\nt L\n2\n\n\u03b1t \u2212 \u03b12\nt L\n2\n\n(cid:54) E [f (wt)] \u2212\n\n(cid:54) E [f (wt)] \u2212\n\n(cid:19)\nE [f (wt+1)] (cid:54) E [f (wt \u2212 \u03b1t\u2207fit (wt))] +\n(cid:19)\n\n(cid:18)\n(cid:104)(cid:107)\u2207fit(wt) \u2212 \u2207f (wt)(cid:107)2\n(cid:18)\n(cid:19)\n(cid:104)(cid:107)\u2207f (wt)(cid:107)2\n(cid:105)\n(cid:104)(cid:107)\u2207f ( \u00afwT )(cid:107)2\n(cid:26) f (w0) \u2212 E [f (wT )]\n\n\u03b1t \u2212 \u03b12\nt L\n2\n\n\u03b1t \u2212 \u03b12\nt L\n2\n\n(cid:18)\n\n1\nT\n\nt=0\n\n=\n\nE\n\nE\n\nE\n\n2\n\n2\n\n2\n\nT\u22121(cid:88)\n\nt=0\n\n(cid:105) (cid:54) f (w0) \u2212 E [f (wT )] + T \u00b7 \u03b12\nT\u22121(cid:88)\n\n(cid:104)(cid:107)\u2207f (wt)(cid:107)2\n\n(cid:105)\n\nE\n\n2\n\n\u03b1tL\u03b4\u03c31\n\n+\n\nt \u03c32\n\u03b12\n\n0L + \u03b1tL\u03b4\u03c31\n\nt \u03c32\n\n0L + \u03b1tL\u03b4\u03c31\n\n2\n\n2\n\n(cid:27)\n\nif we set \u03b1t = \u03b1 and select \u00afwT uniformly at random from {w0, w1, . . . , wT\u22121}, we get:\n\n\u03b12\u03c32\n\n0L + \u03b1L\u03b4\u03c31\n\n+\n\n2\n\n(cid:54)\n\n(cid:54)\n\n=\n\n2\u03b1 \u2212 \u03b12L\n\n2\n\n2\u03b1 \u2212 \u03b12L\n\n2\n\n2\u03b1 \u2212 \u03b12L\n\nT\nf (w0) \u2212 f\u2217\nf (w0) \u2212 f\u2217\n\nT\n\nT\n\n+\n\n+\n\n\u03b12\u03c32\n\n\u03b1\u03c32\n\n0L + \u03b1L\u03b4\u03c31\n2\u03b1 \u2212 \u03b12L\n0L + L\u03b4\u03c31\n2 \u2212 \u03b1L\n\n22\n\n\fProof of Theorem 3 . In low-precision SGD, we have:\n\n2\n\n+ \u03b7E\n\n2\n\n2\n\nE\n\n+ \u03b4E\n\n+ \u03b1\u03b4E\n\n2\n\n+ \u03b6E\n\n(cid:21)\n\n2\n\n(cid:21)\n\n2\n\n+ \u03b12E\n\n\u2264 E\n\n\u2264 E\n\n= E\n\n2\n\n\u2264 E\n\nby lemma 5, we know that\n\nut+1 = wt \u2212 \u03b1\u2207 \u02dcft(wt), wt+1 = Q(ut+1)\n\n(cid:104)(cid:107)wt+1 \u2212 w\u2217(cid:107)2\n(cid:105)\n(cid:20)(cid:13)(cid:13)(cid:13)wt \u2212 \u03b1\u2207 \u02dcft(wt) \u2212 w\u2217(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)1\n(cid:104)(cid:13)(cid:13)(cid:13)\u03b1\u2207 \u02dcft(wt)\n(cid:105)\n(cid:104)\n(cid:105) \u2212 2\u03b1E\n(cid:104)(cid:107)wt \u2212 w\u2217(cid:107)2\n(cid:13)(cid:13)(cid:13)1\n(cid:104)(cid:13)(cid:13)(cid:13)\u2207 \u02dcft(wt)\n(cid:105)\n(cid:105) \u2212 2\u03b1E\n(cid:104)(cid:107)wt \u2212 w\u2217(cid:107)2\n(cid:104)(cid:13)(cid:13)(cid:13)\u2207 \u02dcft(wt)\n(cid:13)(cid:13)(cid:13)1\n(cid:105)\n(cid:104)(cid:107)wt \u2212 w\u2217(cid:107)2\n(cid:104)\n((cid:107)wt \u2212 w\u2217(cid:107)2 + (cid:107)w\u2217(cid:107)2)\n(cid:105)\n(cid:104)(cid:107)\u2207fi(w)(cid:107)2\n\n(cid:20)(cid:13)(cid:13)(cid:13)Q(wt \u2212 \u03b1\u2207 \u02dcft(wt)) \u2212 w\u2217(cid:13)(cid:13)(cid:13)2\n(cid:21)\n(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)\u03b1\u2207 \u02dcft(wt)\n(cid:105)\n(cid:104)(cid:107)wt(cid:107)2\n(cid:105)\n(cid:104)\n\n(wt \u2212 w\u2217)T\u2207 \u02dcft(wt)\n\n(cid:20)(cid:13)(cid:13)(cid:13)\u03b1\u2207 \u02dcft(wt)\n(cid:13)(cid:13)(cid:13)2\n(cid:21)\n(cid:20)(cid:13)(cid:13)(cid:13)\u2207 \u02dcft(wt)\n(cid:13)(cid:13)(cid:13)2\n(cid:20)(cid:13)(cid:13)(cid:13)\u2207 \u02dcft(wt)\n(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)\u2207 \u02dcft(wt)\n(cid:105)\n(cid:21)\n(cid:20)(cid:13)(cid:13)(cid:13)\u2207 \u02dcft(wt)\n((cid:107)wt \u2212 w\u2217(cid:107)2 + (cid:107)w\u2217(cid:107)2) \u00b7 \u03b1\n(cid:13)(cid:13)(cid:13)2\n(cid:105)\n(cid:20)(cid:13)(cid:13)(cid:13)\u2207 \u02dcft(wt)\n(cid:21)\n(cid:13)(cid:13)(cid:13)\u2207 \u02dcft(wt)\n(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)2\n(cid:104)\n(cid:105)\n(cid:21)\n(cid:20)(cid:13)(cid:13)(cid:13)\u2207 \u02dcft(wt)\n((cid:107)wt \u2212 w\u2217(cid:107)2 + (cid:107)w\u2217(cid:107)2)\n(cid:13)(cid:13)(cid:13)2\n(cid:104)(cid:13)(cid:13)(cid:13)\u2207 \u02dcft(wt)\n(cid:13)(cid:13)(cid:13)1\n(cid:13)(cid:13)(cid:13)\u2207 \u02dcft(wt)\n(cid:13)(cid:13)(cid:13)2\n(cid:105)\n(cid:105)\n(cid:105)\n(cid:104)(cid:107)\u2207fi(w) \u2212 \u2207fi(w\u2217) + \u2207fi(w\u2217)(cid:107)2\n(cid:104)(cid:107)\u2207fi(w) \u2212 \u2207fi(w\u2217)(cid:107)2\n(cid:104)(cid:107)\u2207fi(w) \u2212 \u2207fi(w\u2217)(cid:107)2\n(cid:104)(cid:107)w \u2212 w\u2217(cid:107)2\n\n2 + 2(\u2207fi(w) \u2212 \u2207fi(w\u2217))T\u2207fi(w\u2217) + (cid:107)\u2207fi(w\u2217)(cid:107)2\n2 + (cid:107)\u2207fi(w\u2217)(cid:107)2\n\n\u2212 2\u03b1E [(f (wt) \u2212 f (w\u2217))]\n\nwhere the third inequality holds due to the strongly convexity assumption. According to the assump-\ntions we had, we have:\n\n(f (wt) \u2212 f (w\u2217)) +\n\n= E\n\u2264 L2 \u00b7 E\n\n= (1 \u2212 \u03b1\u00b5)E\n\n(cid:107)wt \u2212 w\u2217(cid:107)2\n\n+ (1 + \u03b7)\u03b12E\n\n+ \u03b7\u03b12E\n\n+ \u03b7\u03b12E\n\n(cid:104)\n\n(cid:105)\n\n= E\n\n2\n\n+ \u03b12E\n\n2\n\n(cid:105)\n\n2\n\n+ \u03b1\u03b6E\n\n2\n\n2\n\n+ \u03b1\u03b6E\n\n+ \u03b1\u03b4E\n\n+ \u03b1\u03b4E\n\n(cid:105)\n\n+ \u03c32\n\n2\n\n(cid:21)\n\n2\n\n2\n\n+ \u03b6E\n\n2\n\n= E\n\n\u00b5\n2\n\nE\n\n2\n\n(cid:105)\n\n2\n\n(cid:107)\u2207fi(w)(cid:107)2 = (cid:107)\u2207fi(w) \u2212 \u2207fi(w\u2217) + \u2207fi(w\u2217)(cid:107)2\n\n(cid:107)\u2207fi(w)(cid:107)1 = (cid:107)\u2207fi(w) \u2212 \u2207fi(w\u2217) + \u2207fi(w\u2217)(cid:107)1\n\n\u2264 (cid:107)\u2207fi(w) \u2212 \u2207fi(w\u2217)(cid:107)2 + (cid:107)\u2207fi(w\u2217)(cid:107)2\n\u2264 L \u00b7 (cid:107)w \u2212 w\u2217(cid:107)2 + \u03c3\n\u2264 (cid:107)\u2207fi(w) \u2212 \u2207fi(w\u2217)(cid:107)1 + (cid:107)\u2207fi(w\u2217)(cid:107)1\n\u2264 L1 \u00b7 (cid:107)w \u2212 w\u2217(cid:107)2 + \u03c31\n\nwhere the last inequality holds due to assumption 2 where we let p = 2. Apply this result to the\nprevious formula, denote \u03b7(cid:48) = 1 + \u03b7, and then we will have:\n\n(cid:104)(cid:107)wt+1 \u2212 w\u2217(cid:107)2\n\n(cid:105)\n(cid:104)(cid:107)wt \u2212 w\u2217(cid:107)2\n\n2\n\n\u2264 (1 \u2212 \u03b1\u00b5)E\n\nE\n\n(cid:104)\n\n+ \u03b1\u03b6E\n\n((cid:107)wt \u2212 w\u2217(cid:107)2 + (cid:107)w\u2217(cid:107)2)\n\n\u2212 2\u03b1E [(f (wt) \u2212 f (w\u2217))]\n\n2\n\n(cid:21)\n\n+ \u03b7(cid:48)\u03b12E\n\n(cid:105)\n(cid:104)(cid:107)wt \u2212 w\u2217(cid:107)2\n\n(cid:20)(cid:13)(cid:13)(cid:13)\u2207 \u02dcft(wt)\n(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)\u2207 \u02dcft(wt)\n(cid:13)(cid:13)(cid:13)2\n(cid:104)(cid:13)(cid:13)(cid:13)\u2207 \u02dcft(wt)\n(cid:13)(cid:13)(cid:13)1\n(cid:105)\n(cid:105)\n(cid:105)\n(cid:105) \u2212 2\u03b1E [(f (wt) \u2212 f (w\u2217))]\n(cid:104)(cid:107)wt \u2212 w\u2217(cid:107)2\n\n+ \u03b1\u03b4E\n\n2\n\n2\n\n2\n\n\u2264 (1 \u2212 \u03b1\u00b5 + \u03b7(cid:48)\u03b12L2)E\n\n+ \u03b1\u03b4L1E [(cid:107)wt \u2212 w\u2217(cid:107)2] + \u03b7(cid:48)\u03b12\u03c32 + \u03b1\u03b4\u03c31\n+ \u03b1\u03b6E [((cid:107)wt \u2212 w\u2217(cid:107)2 + (cid:107)w\u2217(cid:107)2)(L \u00b7 (cid:107)w \u2212 w\u2217(cid:107)2 + \u03c3)] \u2212 2\u03b1E [(f (wt) \u2212 f (w\u2217))]\n\n= (1 \u2212 \u03b1\u00b5 + \u03b1\u03b6L + \u03b7(cid:48)\u03b12L2)E\n\n+ (\u03b1\u03b4L1 + \u03b1\u03b6L(cid:107)w\u2217(cid:107)2 + \u03b1\u03b6\u03c3)E [(cid:107)wt \u2212 w\u2217(cid:107)2] + \u03b7(cid:48)\u03b12\u03c32 + \u03b1\u03b4\u03c31 + \u03b1\u03b6\u03c3 (cid:107)w\u2217(cid:107)2\n\n23\n\n\fHere we introduce a positive constant C that we\u2019ll set later, and by basic inequality we get\n\n(\u03b1\u03b4L1 + \u03b1\u03b6L(cid:107)w\u2217(cid:107)2 + \u03b1\u03b6\u03c3)E [(cid:107)wt \u2212 w\u2217(cid:107)2]\n\n(cid:105)\n\u2264 CE [(cid:107)wt \u2212 w\u2217(cid:107)2]2 +\n\u2264 CE\n\n(cid:104)(cid:107)wt \u2212 w\u2217(cid:107)2\n\n+\n\n(\u03b1\u03b4L1 + \u03b1\u03b6L(cid:107)w\u2217(cid:107)2 + \u03b1\u03b6\u03c3)2\n\u03b12(\u03b4L1 + \u03b6L(cid:107)w\u2217(cid:107)2 + \u03b6\u03c3)2\n\n4C\n\n(cid:104)(cid:107)wt+1 \u2212 w\u2217(cid:107)2\n\n2\n\nthus\n\nE\n\n2\n\n(cid:105) \u2264 (1 \u2212 \u03b1\u00b5 + \u03b1\u03b6L + \u03b7(cid:48)\u03b12L2 + C)E\n\n4C\n\n(cid:104)(cid:107)wt \u2212 w\u2217(cid:107)2\n\n(cid:105) \u2212 2\u03b1E [(f (wt) \u2212 f (w\u2217))]\n\n2\n\n+ \u03b7(cid:48)\u03b12\u03c32 + \u03b1\u03b4\u03c31 + \u03b1\u03b6\u03c3 (cid:107)w\u2217(cid:107)2 +\n\n\u03b12(\u03b4L1 + \u03b6L(cid:107)w\u2217(cid:107)2 + \u03b6\u03c3)2\n\none setting C to be \u03b1\u00b5 \u2212 \u03b1\u03b6L \u2212 \u03b7(cid:48)\u03b12L2, we will have:\n\n2\u03b1E [(f (wt) \u2212 f (w\u2217))] \u2264 E\n\n(cid:104)(cid:107)wt \u2212 w\u2217(cid:107)2\n\n(cid:105) \u2212 E\n\n2\n\n(cid:104)(cid:107)wt+1 \u2212 w\u2217(cid:107)2\n\n+ \u03b7(cid:48)\u03b12\u03c32 + \u03b1\u03b4\u03c31 + \u03b1\u03b6\u03c3 (cid:107)w\u2217(cid:107)2 +\n\nsince we can set \u03b1 to be small enough such that \u03b1\u00b5 \u2212 \u03b1\u03b6L \u2212 \u03b7(cid:48)\u03b12L2 \u2265 1\nbecome:\n\n(cid:104)(cid:107)wt \u2212 w\u2217(cid:107)2\n\n(cid:105) \u2212 E\n\n2\n\n(cid:104)(cid:107)wt+1 \u2212 w\u2217(cid:107)2\n\n2\u03b1E [(f (wt) \u2212 f (w\u2217))] \u2264 E\n\n4C\n\n2\n\n(cid:105)\n\u03b12(\u03b4L1 + \u03b6L(cid:107)w\u2217(cid:107)2 + \u03b6\u03c3)2\n4(\u03b1\u00b5 \u2212 \u03b1\u03b6L \u2212 \u03b7(cid:48)\u03b12L2)\n(cid:105)\n\u03b1(\u03b4L1 + \u03b6L(cid:107)w\u2217(cid:107)2 + \u03b6\u03c3)2\n\n2\n\n2 \u03b1\u00b5, then the result will\n\n2\u00b5\n\n+ \u03b7(cid:48)\u03b12\u03c32 + \u03b1\u03b4\u03c31 + \u03b1\u03b6\u03c3 (cid:107)w\u2217(cid:107)2 +\n\nnow we sum up this inequality from t = 0 to t = T \u2212 1 and divide by 2\u03b1T , then we get:\n\nT\u22121(cid:88)\n\n1\nT\n(cid:107)w0 \u2212 w\u2217(cid:107)2\n\nt=o\n\nE [(f (wt) \u2212 f (w\u2217))]\n\n(cid:104)(cid:107)wT \u2212 w\u2217(cid:107)2\n\n(cid:105)\n\n2\n\n+\n\n\u03b7(cid:48)\u03b1\u03c32 + \u03b4\u03c31 + \u03b6\u03c3 (cid:107)w\u2217(cid:107)2\n\n2 \u2212 E\n2\u03b1T\n\u03b7(cid:48)\u03b1\u03c32 + \u03b4\u03c31 + \u03b6\u03c3 (cid:107)w\u2217(cid:107)2\n\n\u2264\n\u2264 (cid:107)w0 \u2212 w\u2217(cid:107)2\nand since we sample \u00afwT uniformly from (wo, w1,\u00b7\u00b7\u00b7 , wT\u22121), we get\n\u03b7(cid:48)\u03b1\u03c32 + \u03b4\u03c31 + \u03b6\u03c3 (cid:107)w\u2217(cid:107)2\nE [(f ( \u00afwT ) \u2212 f (w\u2217))] \u2264 1\n2\u03b1T\n\n(cid:107)w0 \u2212 w\u2217(cid:107)2\n2+\n\n2\u03b1T\n\n4\u00b5\n\n+\n\n+\n\n+\n\n2\n\n2\n\n2\n\n2\n\n(\u03b4L1 + \u03b6L(cid:107)w\u2217(cid:107)2 + \u03b6\u03c3)2\n\n(\u03b4L1 + \u03b6L(cid:107)w\u2217(cid:107)2 + \u03b6\u03c3)2\n\n4\u00b5\n\n(\u03b4L1 + \u03b6L(cid:107)w\u2217(cid:107)2 + \u03b6\u03c3)2\n\n4\u00b5\n\n+\n\nProof of Theorem 4. In the case of non-linear quantization:\n\n(cid:104)(cid:107)Q(ut+1) \u2212 ut+1(cid:107)2\n\n2\n\nE\n\n(cid:105) \u2264 \u03b4E [(cid:107)ut+1 \u2212 wt(cid:107)1] + \u03b6E [(cid:107)wt(cid:107)2 (cid:107)ut+1 \u2212 wt(cid:107)2] + \u03b7E\n(cid:104)(cid:107)\u2207fit(wt)(cid:107)2\n\n(cid:54) \u03b4E [(cid:107)\u03b1t\u2207fit(wt)(cid:107)1] + \u03b6E [R0 (cid:107)\u03b1t\u2207fit(wt)(cid:107)2] + \u03b7E\n(cid:54) \u03b1t\u03b4\u03c31 + \u03b1t\u03b6R0E [(cid:107)\u2207fit(wt)(cid:107)2] + \u03b12\n\nt \u03b7E\n\n(cid:104)(cid:107)ut+1 \u2212 wt(cid:107)2\n(cid:105)\n(cid:104)(cid:107)\u03b1t\u2207fit(wt)(cid:107)2\n(cid:105)\n\n2\n\n2\n\n2\n\n(cid:105)\n\nsimilar to the analysis in convex case, we introduce a positive constant C that can be decided later,\nand we have:\n\nthus\n\nE\n\n\u03b1t\u03b6R0 (cid:107)\u2207fit(wt)(cid:107)2\n\n(cid:104)(cid:107)Q(ut+1) \u2212 ut+1(cid:107)2\n\n(cid:105) (cid:54) \u03b12\n\n2\n\n(cid:54) C\u03b12\n\n2 +\n\nt (cid:107)\u2207fit(wt)(cid:107)2\n\n(cid:104)(cid:107)\u2207fit(wt)(cid:107)2\n\n(cid:105)\n\n2\n\n(\u03b6R0)2\n\n4C\n\n+ \u03b1t\u03b4\u03c31 +\n\n(\u03b6R0)2\n\n4C\n\nt (\u03b7 + C)E\n\n24\n\n\fsubstituting into previous results, denote \u03b7(cid:48) = \u03b7 + 1, and we have:\n\nE [f (wt+1)] (cid:54) E [f (wt)] \u2212\n\nE\n\n\u03b1t \u2212 \u03b12\nt L\n2\n\n(cid:19)\n(cid:18)\n(cid:104)(cid:107)\u2207f (wt)(cid:107)2\n(cid:104)(cid:107)Q(ut+1) \u2212 ut+1(cid:107)2\n(cid:105)\n(cid:19)\n(cid:18)\n\u03b1t \u2212 (\u03b7(cid:48) + C)\u03b12\nL(\u03b6R0)2\n\n(cid:105)\n(cid:104)(cid:107)\u2207f (wt)(cid:107)2\n\nt L\n\n+\n\nE\n\n2\n\n2\n\n2\n\n2\n\n(cid:105)\n\nt \u03c32\n\u03b12\n0L\n2\n\n+\n\n1\n2\n\nLE\n\n(cid:54) E [f (wt)] \u2212\n\n\u03b12\nt \u03c32\n\n0L + \u03b1tL\u03b4\u03c31\n\n+\n\n2\n\n+\n\n8C\n\nNext, we sum the result from t = 0 to T \u2212 1, set \u03b1t = \u03b1 and select \u00afwT uniformly at random from\n{w0, w1, . . . , wT\u22121}, then we get:\n\n(cid:104)(cid:107)\u2207f ( \u00afwT )(cid:107)2\n\n2\n\n(cid:105)\n\nE\n\n=\n\n(cid:105)\n\n(cid:104)(cid:107)\u2207f (wt)(cid:107)2\nT\u22121(cid:88)\n(cid:26) f (w0) \u2212 E [f (wT )]\n\n1\nT\n\nt=0\n\nE\n\n2\n\n(cid:27)\n\n(cid:54)\n\n(cid:54)\n\n=\n\n2\n\n2\u03b1 \u2212 (\u03b7(cid:48) + C)\u03b12L\n\n2\n\n2\u03b1 \u2212 (\u03b7(cid:48) + C)\u03b12L\n\n2\n\n2\u03b1 \u2212 (\u03b7(cid:48) + C)\u03b12L\n\nT\nf (w0) \u2212 f\u2217\n\nT\n\nf (w0) \u2212 f\u2217\n\nT\n\n+\n\n\u03b12\u03c32\n\n0L + \u03b1L\u03b4\u03c31\n\n2\n\n+\n\nL(\u03b6R0)2\n\n8C\n\n+\n\n+\n\n\u03b12\u03c32\n0L + \u03b1L\u03b4\u03c31\n2\u03b1 \u2212 (\u03b7(cid:48) + C)\u03b12L\n\u03b1\u03c32\n0L + L\u03b4\u03c31\n2 \u2212 (\u03b7(cid:48) + C)\u03b1L\n\n+\n\n+\n\nL(\u03b6R0)2\n\n4C(2\u03b1 \u2212 (\u03b7(cid:48) + C)\u03b12L)\n\nL(\u03b6R0)2\n\n4C(2\u03b1 \u2212 (\u03b7(cid:48) + C)\u03b12L)\n\nOne possible setting of C is C = 1\n\n(cid:104)(cid:107)\u2207f ( \u00afwT )(cid:107)2\n\n2\n\nE\n\n\u03b1L \u2212 \u03b7(cid:48), then we get:\n\n(cid:105) (cid:54) f (w0) \u2212 f\u2217\n\n1\n2 \u03b1T\n\nf (w0) \u2212 f\u2217\n\n=\n\n+ \u03b1\u03c32\n\n0L + L\u03b4\u03c31 +\n\n+ \u03b1\u03c32\n\n0L + L\u03b4\u03c31 +\n\nL(\u03b6R0)2\n4\u03b1 1\u2212\u03b1\u03b7(cid:48)L\n\n\u03b1L\nL2(\u03b6R0)2\n\n4(1 \u2212 (\u03b7 + 1)\u03b1L)\n\n1\n2 \u03b1T\nand if we set \u03b1 small enough such that \u03b1 <\n\n(cid:104)(cid:107)\u2207f ( \u00afwT )(cid:107)2\n\n2\n\n(cid:105) (cid:54) 2(f (w0) \u2212 f\u2217)\n\nE\n\n\u03b1T\n\n1\n\n2(\u03b7+1)L, then we get:\n\n+ \u03b1\u03c32\n\n0L + L\u03b4\u03c31 +\n\n1\n2\n\n(L\u03b6R0)2\n\nThis is the result we stated in the theorem.\nNext we show the reasonability of our choice of constant C in the process of proof. Alternatively, we\nconsider the optimal setting of C which minimizes the size of the noise ball (when T approaches\nin\ufb01nity), that is:\n\n(cid:26) \u03b1\u03c32\n(cid:26) \u03b1\u03c32\n\n0L + L\u03b4\u03c31\n2 \u2212 (\u03b7(cid:48) + C)\u03b1L\n0L + L\u03b4\u03c31\n\n(2 \u2212 \u03b7(cid:48)\u03b1L) \u2212 \u03b1LC\n\nC = argmin\n\n= argmin\n\n(cid:27)\n\n(cid:27)\n\n+\n\nL(\u03b6R0)2\n\n4C(2\u03b1 \u2212 (\u03b7(cid:48) + C)\u03b12L)\n\n+\n\nL(\u03b6R0)2/4\u03b1\n\nC[(2 \u2212 \u03b7(cid:48)\u03b1L) \u2212 \u03b1LC]\n\ndenote\n\nwhere\n\nS =\n\nC1\n\nA \u2212 Bx\n\n+\n\nC2\n\nx(A \u2212 Bx)\n\nA = 2 \u2212 \u03b7(cid:48)\u03b1L, B = \u03b1L, C1 = \u03b1\u03c32\n\n0L + \u03b4\u03c31L, C2 = L(\u03b6R0)2/4\u03b1\n\nthen the optimal setting of C can be solved by\n\n\u2202S\n\u2202x\n\n=\n\nBC1\n\n(A \u2212 Bx)2 +\n\nC2(2Bx \u2212 A)\nx2(A \u2212 Bx)2 = 0\n\n25\n\n\fthe solution to this equation is\n\nx =\n\n\u2212BC2 +(cid:112)(BC2)2 + ABC1C2\n(cid:114)\n(cid:18)\n\nBC1\n\u2212 C2\nC1\n\u2212 C2\nC1\n\nAC1\nBC2\nAC1\n2BC2\n\nA\n2B\n\n(cid:19)\n\n1 +\n\n1 +\n\n=\n\n=\n\nC2\nC1\n\u2248 C2\nC1\n\n=\n\n1\n\u03b1L\n\n\u03b7(cid:48)\n\n\u2212 1\n2\n\nthe approximation in the last step is based on the fact that the term inside the square root:\n\n4(2 \u2212 \u03b7(cid:48)\u03b1L)(\u03b1\u03c30 + \u03b4\u03c31)\n\nAC1\nBC2\n\n=\n\nL(\u03b6R0)2\n\u03b1L is large, and \u03b7(cid:48) = 1 + \u03b7 is small, we can see\n2 \u03b7(cid:48), thus is a reasonable\n\n\u03b1L \u2212 1\n\nis a small number. Meanwhile, since \u03b1 is small, 1\nthat our choice of C = 1\nsetting.\n\n\u03b1L \u2212 \u03b7(cid:48) is close enough to the optimal setting 1\n(cid:18)\n\n(cid:19)\n\nProof of Theorem 5. In the normal FPQ case, the set of quantization points are:\n\n(cid:26)\n\nD = {0} \u222a\n\ns \u00b7\n\n1 +\n\nx\nnm\n\n\u00b7 2y | x = 0, 1,\u00b7\u00b7\u00b7 , nm \u2212 1, y = \u2212 ne\n2\n\n+ 2,\u00b7\u00b7\u00b7 ,\n\nne\n2\n\n(cid:27)\n\n\u2212 1\n\nthen the parameters for the nonlinear quantization bound can be computed as:\n\n\u03b4 = s \u00b7 2\u2212 ne\n\n2 +2 =\n\n(cid:0)\u221a\n\n2(cid:1)ne , \u03b6 =\n\n4s\n\n1\nnm\n\n, \u03b7 =\n\n\u03b6 2\n\n4(1 + \u03b6)\n\n=\n\n1\n\n4nm(nm + 1)\n\nFor NLQ-SGD, the noise ball size according to theorem 3 is:\n\n\u03b4\u03c31 + \u03b6\u03c3 (cid:107)w\u2217(cid:107)2\n\n2\n\n+\n\n(\u03b4L1 + \u03b6L(cid:107)w\u2217(cid:107)2 + \u03b6\u03c3)2\n\n4\u00b5\n\n2 A + 1\n\n4\u00b5 B2. When b is large, \u03b4, \u03b6, \u03b7 are small, then the dominating term for the noise\n\nDenote this as 1\nball is\n\nlet the derivative over ne to be 0 and we get:\n\n1(cid:0)\u221a\n2(cid:1)ne + \u03c3 (cid:107)w\u2217(cid:107)2\nA = \u03b4\u03c31 + \u03b6\u03c3 (cid:107)w\u2217(cid:107)2 = 4s\u03c31\n1(cid:0)\u221a\n2(cid:1)ne + \u03c3 (cid:107)w\u2217(cid:107)2\n(cid:19)\n(cid:18) 2(ln 2)s\u03c31C\n\n= \u22122(ln 2)s\u03c31\n\n\u2202A\n\u2202ne\n\n1\nC\n\n(cid:18)\n\n1\nnm\n\n= 4s\u03c31\n\n(cid:16)\u221a\n\n2\n\n= 0,\n\nne\nC\n\n1(cid:0)\u221a\n2(cid:1)ne + \u03c3 (cid:107)w\u2217(cid:107)2\n(cid:17)ne\n(cid:18) 2(ln 2)s\u03c31\n\n2(ln 2)s\u03c31C\n(cid:19)(cid:19)\n\u03c3 (cid:107)w\u2217(cid:107)2\n\n=\n\n2b + 2 log2\n\nne = 2 log2\n\n, be = log2\n\n\u03c3 (cid:107)w\u2217(cid:107)2\n\u03c3 (cid:107)w\u2217(cid:107)2\nAnd when b is small, \u03b4, \u03b6, \u03b7 are large and the dominating term for the noise ball is\n1(cid:0)\u221a\n2(cid:1)ne +(L(cid:107)w\u2217(cid:107)2+\u03c3)\n1(cid:0)\u221a\n2(cid:1)ne +(L(cid:107)w\u2217(cid:107)2+\u03c3)\nB = \u03b4L1+\u03b6L(cid:107)w\u2217(cid:107)2+\u03b6\u03c3 = 4sL1\n(cid:16)\u221a\n(cid:17)ne\n1(cid:0)\u221a\n2(cid:1)ne + (L(cid:107)w\u2217(cid:107)2 + \u03c3)\n(cid:18) 2(ln 2)sL1\n(cid:18)\n(cid:18) 2(ln 2)sL1C\n\n2(ln 2)sL1C\n(cid:19)(cid:19)\nL(cid:107)w\u2217(cid:107)2 + \u03c3\n\nlet the derivative of ne to be 0 and we get:\n\n= \u22122(ln 2)sL1\n\n\u2202B\n\u2202ne\n\n= 4sL1\n\n1\nnm\n\n(cid:19)\n\n= 0,\n\n1\nC\n\n=\n\n2\n\nne\nC\n\nne = 2 log2\n\nL(cid:107)w\u2217(cid:107)2 + \u03c3\n\n, be = log2\n\n2b + 2 log2\n\nL(cid:107)w\u2217(cid:107)2 + \u03c3\n\n26\n\n\fFor b such that neither the terms dominates the result, we know the noise ball size is:\n\n1\n4\u00b5\nthen the derivative of ne is:\n\nA +\n\n1\n2\n\nB2 =\n\n\u03b4\u03c31 + \u03b6\u03c3 (cid:107)w\u2217(cid:107)2\n(cid:19)\n\n2\n\n(cid:18) 1\n\n+\n\n(\u03b4L1 + \u03b6L(cid:107)w\u2217(cid:107)2 + \u03b6\u03c3)2\n\n4\u00b5\n\nA +\n\n\u2202\n\u2202B\n\u2202ne\n\u2202ne\nare increasing functions and we know that:\n\n\u2202A\n\u2202ne\n\n1\n4\u00b5\n\nB\n2\u00b5\n\nB2\n\n1\n2\n\n+\n\n=\n\n2\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)ne=2 log2\n\n\u2202B\n\u2202ne\n\n(cid:16) 2(ln 2)sL1C\n\nL(cid:107)w\u2217(cid:107)2+\u03c3\n\n(cid:17) = 0\n\n(cid:16) 2(ln 2)s\u03c31C\n(cid:16) 1\n\n\u03c3(cid:107)w\u2217(cid:107)2\n\n2 A + 1\n\n(cid:17) = 0,\n4\u00b5 B2(cid:17)\n\n(cid:16) 2(ln 2)s\u03c31C\n\n\u03c3(cid:107)w\u2217(cid:107)2\n\n(cid:17)\n\nand since both \u2202A\n\u2202ne\n\nand \u2202B\n\u2202ne\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)ne=2 log2\n(cid:17)\n\n.\n\n\u2202A\n\u2202ne\n\n(cid:16) 2(ln 2)sL1C\n\nL(cid:107)w\u2217(cid:107)2+\u03c3\n\nand 2 log2\n\nthen we know the solution of \u2202\n\u2202ne\n\n= 0 is in the interval between 2 log2\n\n(cid:27)\n\n(cid:26)\n\nD =\n\n\u222a\n\n(cid:18)\n\n(cid:26)\n\ns \u00b7 x\nnm\ns \u00b7\n\nProof of Theorem 6. In the denormal FPQ case, the set of quantization points are:\n\n\u00b7 2\u2212 ne\n\n2 +3 | x = 0, 1,\u00b7\u00b7\u00b7 , nm \u2212 1\n\n(cid:19)\n\n1 +\n\nx\nnm\n\n\u00b7 2y | x = 0, 1,\u00b7\u00b7\u00b7 , nm \u2212 1, y = \u2212 ne\n2\n\n+ 3,\u00b7\u00b7\u00b7 ,\n\n\u2212 1\n\nne\n2\n\nthen the parameters for the nonlinear quantization bound is:\n\n\u03b4 = s \u00b7 1\nnm\n\n\u00b7 2\u2212 ne\n\n2 +3 =\n\n\u00b7\n\n8\nC\n\n1\nnm\n\n, \u03b7 =\n\n\u03b6 2\n\n4(1 + \u03b6)\n\n=\n\n1\n\n4nm(nm + 1)\n\n(cid:0)\u221a\n\n2(cid:1)ne , \u03b6 =\n\nsne\n\n(cid:27)\n\nne\nC\n\n= 0\n\nFor NLQ-SGD, the noise ball size according to theorem 3 is:\n\n\u03b4\u03c31 + \u03b6\u03c3 (cid:107)w\u2217(cid:107)2\n\n(\u03b4L1 + \u03b6L(cid:107)w\u2217(cid:107)2 + \u03b6\u03c3)2\n\n+\n\n2\n\n4\u00b5\n\n2 A + 1\n\n4\u00b5 B2. When b is large, \u03b4, \u03b6, \u03b7 are small and the dominating term for the noise\n\nDenote this as 1\nball is\n\nA = \u03b4\u03c31 + \u03b6\u03c3 (cid:107)w\u2217(cid:107)2 =\n\n8s\u03c31\n\nC\n\nlet the derivative over ne to be 0 and we get:\n\n(cid:0)\u221a\n\nne\n\n2(cid:1)ne + \u03c3 (cid:107)w\u2217(cid:107)2\n(cid:0)\u221a\n\n2(cid:1)ne\n\n1 \u2212 (ln\n\n2)ne\n\n\u221a\n\n\u2202A\n\u2202ne\n\n=\n\n8s\u03c31\n\nC\n\n+ \u03c3 (cid:107)w\u2217(cid:107)2\n\n1\nC\n\n= 0\n\n1\nnm\n\n=\n\n8s\u03c31\n\nC\n\n(cid:0)\u221a\n\n2(cid:1)ne + \u03c3 (cid:107)w\u2217(cid:107)2\n\nne\n\ndenote V (x) = x \u00b7 ex, and Lambert W function W (y) = V \u22121(y), y \u2265 \u2212 1\n\n1 \u2212 (ln\n\n(cid:0)\u221a\n\n\u221a\n\n2(cid:1)ne\n\n2)ne\n\n\u2202A\n\u2202ne\n\n=\n\n8s\u03c31\n\nC\n\nthus we have:\n\n+ \u03c3 (cid:107)w\u2217(cid:107)2\n(cid:19)\n\n(cid:18) e\u03c3 (cid:107)w\u2217(cid:107)2\n\n8s\u03c31\n\n1\nC\n\n=\n\n8s\u03c31\neC\n\nV (1 \u2212 (ln\n\n(cid:20)\n\nne = 1 \u2212 2\nln 2\n\nW\n\n, be = log2\n\n1 \u2212 2\nln 2\n\nW\n\n\u221a\n\ne . then we need\n2)ne) + \u03c3 (cid:107)w\u2217(cid:107)2\n1\nC\n(cid:19)(cid:21)\n\n(cid:18) e\u03c3 (cid:107)w\u2217(cid:107)2\n\n8s\u03c31\n\nAnd when b is small, \u03b4, \u03b6, \u03b7 are large and the dominating term for the noise ball is\n\nB = \u03b4L1 + \u03b6L(cid:107)w\u2217(cid:107)2 + \u03b6\u03c3 =\n\n8sL1\n\nC\n\n\u00b7\n\n27\n\n(cid:0)\u221a\n\n2(cid:1)ne + (L(cid:107)w\u2217(cid:107)2 + \u03c3)\n\nne\n\nne\nC\n\n\flet the derivative of ne to be 0 and we get:\n\n=\n\n8sL1\n\n\u2202B\n\u2202ne\nthus we have:\n\nC\n\n1 \u2212 (ln\n\n(cid:0)\u221a\n\n\u221a\n\n2)ne\n\n2(cid:1)ne\n(cid:18) e(L(cid:107)w\u2217(cid:107)2 + \u03c3)\n\n+(L(cid:107)w\u2217(cid:107)2 +\u03c3)\n(cid:19)\n\n1\nC\n\n=\n\n8sL1\neC\n\nV (1\u2212(ln\n\n\u221a\n\n(cid:20)\n\nne = 1 \u2212 2\nln 2\n\nW\n\n8sL1\n\nbe = 2 log2\n\n1 \u2212 2\nln 2\n\nW\n\n= 0\n\n1\nC\n\n2)ne)+(L(cid:107)w\u2217(cid:107)2 +\u03c3)\n(cid:19)(cid:21)\n\n(cid:18) e(L(cid:107)w\u2217(cid:107)2 + \u03c3)\n\n8sL1\n\nFor b such that neither the terms dominates the result, we know the noise ball size is:\n\n1\n4\u00b5\nthen the derivative of ne is:\n\nA +\n\n1\n2\n\nB2 =\n\n\u03b4\u03c31 + \u03b6\u03c3 (cid:107)w\u2217(cid:107)2\n(cid:19)\n\n2\n\n(cid:18) 1\n\n+\n\n(\u03b4L1 + \u03b6L(cid:107)w\u2217(cid:107)2 + \u03b6\u03c3)2\n\n4\u00b5\n\nA +\n\n\u2202\n\u2202B\n\u2202ne\n\u2202ne\nare increasing functions and we know that:\n\n\u2202A\n\u2202ne\n\nB\n2\u00b5\n\n1\n4\u00b5\n\nB2\n\n1\n2\n\n+\n\n=\n\n2\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)ne=1\u2212 2\n\n(cid:16) e\u03c3(cid:107)w\u2217(cid:107)2\n(cid:16) 1\n2 A + 1\n\n(cid:17) = 0,\n4\u00b5 B2(cid:17)\n\n8s\u03c31\n\n\u2202B\n\u2202ne\n\n(cid:16) e(L(cid:107)w\u2217(cid:107)2+\u03c3)\n\n(cid:17) = 0\n= 0 is in the interval between 1\u2212 2\n\nln 2 W\n\n8sL1\n\nln 2 W\n\n(cid:16) e\u03c3(cid:107)w\u2217(cid:107)2\n\n8s\u03c31\n\n(cid:17)\n\nand since both \u2202A\n\u2202ne\n\n\u2202A\n\u2202ne\n\nand \u2202B\n\u2202ne\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)ne=1\u2212 2\n(cid:16) e(L(cid:107)w\u2217(cid:107)2+\u03c3)\n\nln 2 W\n\n(cid:17)\n\nthen we know the solution of \u2202\n\u2202ne\nand 1 \u2212 2\n.\n\nln 2 W\n\n8sL1\n\nProof of Theorem 7. In normal \ufb02oating point quantization, we know that\n\u03b6 2\n\n\u03b4normal =\n\n4s\n22be\u22121 ,\n\n\u03b6 = 2\u2212bm,\n\n\u03b7 =\n\n.\n\n4(\u03b6 + 1)\n\nsame as before, denote ne = 2be, nm = 2bm , C = 2b = 2be+bm, then\n\n(cid:0)\u221a\n\n2(cid:1)ne , \u03b6 =\n\n4s\n\n\u03b4normal =\n\n1\nnm\n\n=\n\nne\nC\n\n, \u03b7 =\n\n1\n\n4nm(nm + 1)\n\nand the noise ball size we wish to minimize (as T \u2192 \u221e, \u03b1 \u2192 0), according to Theorem 4 is:\n\nL\u03b4\u03c31 +\n\n1\n2\n\n(L\u03b6R0)2\n\n0n2\ne\n\nL2R2\n2C 2\n\n,\n\n\u2202S\n\u2202ne\n\n= \u2212 2(ln 2)sL\u03c31\n\n(cid:0)\u221a\n\n2(cid:1)ne +\n\nL2R2\n0\nC 2 ne\n\ndenote this result as S, then\n\n4sL\u03c31\n\n2(cid:1)ne +\n\n(cid:0)\u221a\n(cid:17)ne\n\nS =\n\n(cid:16)\u221a\n\nne\n\n2\n\n=\n\nthe noise ball is minimized when \u2202S\n\u2202ne\n\n= 0, that is:\n\nlet V (x) = x \u00b7 ex, and Lambert W function is W (y) = V \u22121(y), y \u2265 \u2212 1\n\ne , then the solution is\n\nne =\n\n2\n\nln 2\n\nW\n\nthus\n\n2(ln 2)s\u03c31C 2\n\n,\n\n(\n\n1\n2\n\nLR2\n0\n\n(cid:19)\n(cid:18) (ln 2)2s\u03c31C 2\n(cid:20) 2\n\nLR2\n0\n\nbe = log2 ne = log2\n\nln 2 \u00b7 ne)e\n\n1\n\n2 ln 2\u00b7ne =\n\n(ln 2)2s\u03c31C 2\n\nLR2\n0\n\n(cid:19)\n\n(cid:18) (ln 2)2s\u03c3122b\n(cid:19)(cid:21)\n\nLR2\n0\n\n(cid:18) (ln 2)2s\u03c3122b\n\n2\nln 2\n\nW\n\n=\n\nW\n\nLR2\n0\n\nln 2\n\n28\n\n\fFor de-normal FPQ on non-convex objectives,\n\n2)ne\n\n+\n\nL2R2\n0\nC 2 ne\n\n1 \u2212 (ln\n\n(cid:0)\u221a\n\n\u221a\n\n2(cid:1)ne\n(cid:17)\n\n2)ne\n\n= 0\n\nS =\n\nC(cid:0)\u221a\n\n8sL\u03c31ne\n\n2(cid:1)ne +\n(cid:16)\u221a\n\nne\n\n0n2\ne\n\nL2R2\n2C 2\n\n,\n\n\u2202S\n\u2202ne\n\n=\n\n8s\u03c31L\n\nC\n\n(cid:17)ne\n\n2\n\n+\n\n8s\u03c31C\nLR2\n0\n\n(cid:16)\n\n\u221a\n\n1 \u2212 (ln\n\nlet \u2202S\n\u2202ne\n\n= 0, then we have:\n\nthis is a transcendental equation, which does not have an analytical solution, or does not have solutions\nat all. If there does exist a solution, we can solve it numerically and use it as the optimal setting of\nexponent bits.\n\n29\n\n\f", "award": [], "sourceid": 6257, "authors": [{"given_name": "Zheng", "family_name": "Li", "institution": "Tsinghua University"}, {"given_name": "Christopher", "family_name": "De Sa", "institution": "Cornell"}]}