{"title": "Solving Random Systems of Quadratic Equations via Truncated Generalized Gradient Flow", "book": "Advances in Neural Information Processing Systems", "page_first": 568, "page_last": 576, "abstract": "This paper puts forth a novel algorithm, termed \\emph{truncated generalized gradient flow} (TGGF), to solve for $\\bm{x}\\in\\mathbb{R}^n/\\mathbb{C}^n$ a system of $m$ quadratic equations $y_i=|\\langle\\bm{a}_i,\\bm{x}\\rangle|^2$, $i=1,2,\\ldots,m$, which even for $\\left\\{\\bm{a}_i\\in\\mathbb{R}^n/\\mathbb{C}^n\\right\\}_{i=1}^m$ random is known to be \\emph{NP-hard} in general. We prove that as soon as the number of equations $m$ is on the order of the number of unknowns $n$, TGGF recovers the solution exactly (up to a global unimodular constant) with high probability and complexity growing linearly with the time required to read the data $\\left\\{\\left(\\bm{a}_i;\\,y_i\\right)\\right\\}_{i=1}^m$. Specifically, TGGF proceeds in two stages: s1) A novel \\emph{orthogonality-promoting} initialization that is obtained with simple power iterations; and, s2) a refinement of the initial estimate by successive updates of scalable \\emph{truncated generalized gradient iterations}. The former is in sharp contrast to the existing spectral initializations, while the latter handles the rather challenging nonconvex and nonsmooth \\emph{amplitude-based} cost function. Numerical tests demonstrate that: i) The novel orthogonality-promoting initialization method returns more accurate and robust estimates relative to its spectral counterparts; and ii) even with the same initialization, our refinement/truncation outperforms Wirtinger-based alternatives, all corroborating the superior performance of TGGF over state-of-the-art algorithms.", "full_text": "Solving Random Systems of Quadratic Equations via\n\nTruncated Generalized Gradient Flow\n\nGang Wang\u2217,\u2020 and Georgios B. Giannakis\u2020\n\n\u2217 ECE Dept. and Digital Tech. Center, Univ. of Minnesota, Mpls, MN 55455, USA\n\u2020 School of Automation, Beijing Institute of Technology, Beijing 100081, China\n\n{gangwang, georgios}@umn.edu\n\nAbstract\n\nThis paper puts forth a novel algorithm, termed truncated generalized gradi-\nent \ufb02ow (TGGF), to solve for x \u2208 Rn/Cn a system of m quadratic equations\nyi = |(cid:104)ai, x(cid:105)|2, i = 1, 2, . . . , m, which even for {ai \u2208 Rn/Cn}m\ni=1 random is\nknown to be NP-hard in general. We prove that as soon as the number of equations\nm is on the order of the number of unknowns n, TGGF recovers the solution\nexactly (up to a global unimodular constant) with high probability and complexity\ngrowing linearly with the time required to read the data {(ai; yi)}m\ni=1. Speci\ufb01cally,\nTGGF proceeds in two stages: s1) A novel orthogonality-promoting initialization\nthat is obtained with simple power iterations; and, s2) a re\ufb01nement of the initial es-\ntimate by successive updates of scalable truncated generalized gradient iterations.\nThe former is in sharp contrast to the existing spectral initializations, while the\nlatter handles the rather challenging nonconvex and nonsmooth amplitude-based\ncost function. Empirical results demonstrate that: i) The novel orthogonality-\npromoting initialization method returns more accurate and robust estimates relative\nto its spectral counterparts; and, ii) even with the same initialization, our re\ufb01ne-\nment/truncation outperforms Wirtinger-based alternatives, all corroborating the\nsuperior performance of TGGF over state-of-the-art algorithms.\n\n1\n\nIntroduction\n\nConsider a system of m quadratic equations\nyi = |(cid:104)ai, x(cid:105)|2 ,\n\ni \u2208 [m] := {1, 2, . . . , m}\n\n(1)\nT and feature vectors ai \u2208 Rn/Cn, collected in the m\u00d7 n matrix\nwhere data vector y := [y1 \u00b7\u00b7\u00b7 ym]\nH are known, whereas vector x \u2208 Rn/Cn is the wanted unknown. When {ai}m\nA := [a1 \u00b7\u00b7\u00b7 am]\ni=1\nand/or x are complex, their amplitudes are given but phase information is lacking; whereas in the real\ncase only the signs of {(cid:104)ai, x(cid:105)} are unknown. Supposing that the system of equations in (1) admits\na unique solution x (up to a global unimodular constant), our objective is to reconstruct x from m\nphaseless quadratic equations, or equivalently, recover the missing signs/phases of (cid:104)ai, x(cid:105) in the\nreal-/complex-valued settings. Indeed, it has been established that m \u2265 2n\u22121 or m \u2265 4n\u22124 generic\ndata {(ai; yi)}m\ni=1 as in (1) suf\ufb01ce for uniqueness of an n-dimensional real- or complex-valued vector\nx [1, 2], respectively, and the former with equality has also been shown to be necessary [1].\nThe problem in (1) constitutes an instance of nonconvex quadratic programming, that is generally\nknown to be NP-hard [3]. Speci\ufb01cally for real-valued vectors, this can be understood as a com-\nbinatorial optimization since one seeks a series of signs si = \u00b11, such that the solution to the\nsystem of linear equations (cid:104)ai, x(cid:105) = si\u03c8i, where \u03c8i :=\nyi, obeys the given quadratic system (1).\nT , apparently there are a\nConcatenating all amplitudes {\u03c8i}m\ntotal of 2m different combinations of {si}m\ni=1, among which only two lead to x up to a global sign.\n\ni=1 to form the vector \u03c8 := [\u03c81 \u00b7\u00b7\u00b7 \u03c8m]\n\n\u221a\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fThe complex case becomes even more complicated, where instead of a set of signs {si}m\ni=1, one must\nspecify for uniqueness a collection of unimodular complex scalars {\u03c3i \u2208 C}m\ni=1. In many \ufb01elds of\nphysical sciences and engineering, the problem of recovering the phase from intensity/magnitude-only\nmeasurements is commonly referred to as phase retrieval [4, 5]. The plethora of applications include\nX-ray crystallography, optics, as well as array imaging, where due to physical limitations, optical\ndetectors can record only (squared) modulus of the Fresnel or Fraunhofer diffraction pattern, while\nlosing the phase of the incident light reaching the object [5]. It has been shown that reconstructing a\ndiscrete, \ufb01nite-duration signal from its Fourier transform magnitude is NP-complete [6]. Despite its\nsimple form and practical relevance across various \ufb01elds, tackling the quadratic system (1) under\nreal-/complex-valued settings is challenging and NP-hard in general.\n\n1.1 Nonconvex Optimization\n\nAdopting the least-squares criterion, the task of recovering x can be recast as that of minimizing the\nfollowing intensity-based empirical loss\n\ni z(cid:12)(cid:12)2(cid:17)2\n(cid:16)\nyi \u2212(cid:12)(cid:12)aH\n(cid:0)\u03c8i \u2212(cid:12)(cid:12)aH\ni z(cid:12)(cid:12)(cid:1)2\n\n.\n\nm(cid:88)\nm(cid:88)\n\ni=1\n\ni=1\n\nor, the amplitude-based one\n\nmin\nz\u2208Cn\n\nf (z) :=\n\nmin\nz\u2208Cn\n\n(cid:96)(z) :=\n\n1\n2m\n\n1\n2m\n\n(2)\n\n(3)\n\nUnfortunately, both cost functions (2) and (3) are nonconvex. Minimizing nonconvex objectives,\nwhich may exhibit many stationary points, is in general NP-hard [7]. In a nutshell, solving problems\nof the form (2) or (3) is challenging.\nExisting approaches to solving (2) (or related ones using the Poisson likelihood; see, e.g., [8]) or\n(3) fall under two categories: nonconvex and convex ones. Popular nonconvex solvers include the\nalternating projection such as Gerchberg-Saxton [9] and Fineup [10], AltMinPhase [11], and (Trun-\ncated) Wirtinger \ufb02ow (WF/TWF) [12, 8], as well as trust-region methods [13]. Convex approaches\non the other hand rely on the so-called matrix-lifting technique to obtain the solvers abbreviated as\nPhaseLift [14] and PhaseCut [15].\nIn terms of sample complexity for Gaussian {ai} designs, convex approaches enable exact recovery\nfrom1 O(n) noiseless measurements [16], while they require solving a semide\ufb01nite program of a\nmatrix variable with size n \u00d7 n, thus incurring worst-case computational complexity on the order of\nO(n4.5) [15], that does not scale well with dimensionality n. Upon exploiting the underlying problem\nstructure, O(n4.5) can be reduced to O(n3) [15]. Solving for vector variables, nonconvex approaches\nachieve signi\ufb01cantly improved computational performance. Using formulation (3), AltMinPhase\nadopts a spectral initialization and establishes exact recovery with sample complexity O(n log3 n)\nunder Gaussian {ai} designs with resampling [11]. Concerning formulation (2), WF iteratively re\ufb01nes\nthe spectral initial estimate by means of a gradient-like update [12]. The follow-up TWF improves\nupon WF through a truncation procedure to separate gradient components of excessively extreme\nsizes. Likewise, at the initialization stage, since the term (aT\nresponsible for the spectral\ninitialization is heavy-tailed, data {yi}m\ni=1 are pre-screened in the truncated spectral initialization to\nyield improved initial estimates [8]. Under Gaussian sampling models, WF allows exact recovery\nfrom O(n log n) measurements in O(mn2 log(1/\u0001)) time/\ufb02ops to yield an \u0001-accurate solution for\nany given \u0001 > 0 [12], while TWF advances these to O(n) measurements and O(mn log(1/\u0001))\ntime [8]. Interestingly, the truncation procedure in the gradient stage turns out to be useful in avoiding\nspurious stationary points in the context of nonconvex optimization. Although for large-scale linear\nregressions, similar ideas including censoring have been studied [17, 18]. It is worth mentioning\nthat when m \u2265 Cn log3 n for suf\ufb01ciently large C > 0, the objective function in (3) admits benign\ngeometric structure that allows certain iterative algorithms (e.g., trust-region methods) to ef\ufb01ciently\n\ufb01nd a global minimizer with random initializations [13].\nAlthough achieving a linear (in the number of unknowns n) sample and computational complexity,\nthe state-of-the-art TWF scheme still requires at least 4n \u223c 5n equations to yield a stable empirical\nsuccess rate (e.g., \u2265 99%) under the real Gaussian model [8, Section 3], which are more than twice\nthe known information-limit of m = 2n \u2212 1 [1]. Similar though less obvious results hold also in\n\ni x)2aiaH\n\ni\n\n1The notation \u03c6(n) = O(g(n)) means that there is a constant c > 0 such that |\u03c6(n)| \u2264 c|g(n)|.\n\n2\n\n\fthe complex-valued scenario. Even though the truncated spectral initialization improves upon the\n\u201cplain vallina\u201d spectral initialization, its performance still suffers when the number of measurements\nis relatively small and its advantage (over the untruncated version) narrows as the number of mea-\nsurements grows. Further, it is worth stressing that extensive numerical and experimental validation\ncon\ufb01rms that the amplitude-based cost function performs better than the intensity-based one; that is,\nformulation (3) is superior over (2) [19]. Hence, besides enhancing initialization, markedly improved\nperformance in the gradient stage could be expected by re-examining the amplitude-based cost\nfunction and incorporating judiciously designed truncation rules.\n\n2 Algorithm: Truncated Generalized Gradient Flow\n\nAlong the lines of suitably initialized nonconvex schemes, and building upon the amplitude-based\nformulation (3), this paper develops a novel linear-time (in both m and n) algorithm, referred to\nas truncated generalized gradient \ufb02ow (TGGF), that provably recovers x \u2208 Rn/Cn exactly from a\nnear-optimal number of noise-free measurements, while also featuring a near-perfect statistical per-\nformance in the noisy setup. Our TGGF proceeds in two stages: s1) A novel orthogonality-promoting\ninitialization that relies on simple power iterations to markedly improve upon spectral initialization;\nand, s2) a re\ufb01nement of the initial estimate by successive updates of truncated generalized gradient\niterations. Stages s1) and s2) are delineated next in reverse order. For concreteness, our analysis\nwill focus on the real Gaussian model with x \u2208 Rn and independently and identically distributed\n(i.i.d.) design vectors ai \u2208 Rn \u223c N (0, In), whereas numerical implementations for the complex\nGaussian model having x \u2208 Cn and i.i.d. ai \u223c CN (0, In) := N (0, In/2) + jN (0, In/2) will\nbe discussed brie\ufb02y. To start, de\ufb01ne the Euclidean distance of any estimate z to the solution set:\ndist(z, x) := min(cid:107)z \u00b1 x(cid:107) for real signals, and dist(z, x) := min\u03c6\u2208[0,2\u03c0) (cid:107)z \u2212 xei\u03c6(cid:107) for complex\nones [12]. De\ufb01ne also the indistinguishable global phase constant in real-valued settings as\n\n(cid:26) 0,\n\n\u03c0,\n\n\u03c6(z) :=\n\n(cid:107)z \u2212 x(cid:107) \u2264 (cid:107)z + x(cid:107),\notherwise.\n\n(4)\n\nHenceforth, \ufb01xing x to be any solution of the given quadratic system (1), we always assume that\n\u03c6 (z) = 0; otherwise, z is replaced by e\u2212j\u03c6(z)z, but for simplicity of presentation, the constant\nphase adaptation term e\u2212j\u03c6(z) is dropped whenever it is clear from the context.\nNumerical tests comparing TGGF, TWF, and WF will be presented throughout our analysis, so let us\n\ufb01rst describe our basic test settings. Simulated estimates will be averaged over 100 independent Monte\nCarlo (MC) realizations without mentioning this explicitly each time. Performance is evaluated in\nterms of the relative root mean-square error, i.e., Relative error := dist(z, x)/(cid:107)x(cid:107), and the success\nrate among 100 trials, where a success will be claimed for a trial if the resulting estimate incurs\nrelative error less than 10\u22125 [8]. Simulated tests under both noiseless and noisy Gaussian models\nwith i.i.d. ai \u223c N (0, In) or ai \u223c CN (0, In).\n\n(cid:12)(cid:12) with \u03b7i = 0 and \u03b7i \u223c N (0, \u03c32) [11], respectively,\n\nare performed, corresponding to \u03c8i =(cid:12)(cid:12)aH\n\ni x + \u03b7i\n\n2.1 Truncated generalized gradient stage\n\nLet us rewrite the amplitude-based cost function in a matrix-vector form as\n\n(cid:13)(cid:13)\u03c8 \u2212 |Az|(cid:13)(cid:13)2\n\n(cid:96)(z) =\n\n1\n2m\n\n(5)\n\nwhere |Az| :=(cid:2)|aT\n\nmin\nz\u2208Rn\n\nmz|(cid:3)T\n\n1 z| \u00b7\u00b7\u00b7 |aT\n\n. Apart from being nonconvex, (cid:96)(z) is nondiffentiable. In the\npresence of smoothness or convexity, convergence analysis of iterative algorithms relies either on\ncontinuity of the gradient (gradient methods) [20], or, on the convexity of the objective functional\n(subgradient methods) [20]. Although subgradient methods have found widespread applicability in\nnonsmooth optimization, they are limited to the class of convex functions [20, Page 4]. In nonconvex\nnonsmooth optimization, the so-termed generalized gradient broadens the scope of the (sub)gradient\nto the class of almost everywhere differentiable functions [21]. Consider a continuous function\nh(z) \u2208 R de\ufb01ned over an open region S \u2286 Rn.\nDe\ufb01nition 1 [22, De\ufb01nition 1.1] The generalized gradient of a function h at z, denoted by \u2202h, is the\nconvex hull of the set of limits of the form lim\u2207h(zk), where zk \u2192 z as k \u2192 +\u221e, i.e.,\n\n3\n\n\f(cid:110)\n\n\u2202h(z) := conv\n\nk\u2192+\u221e\u2207h(zk) : zk \u2192 z, zk /\u2208 G(cid:96)\n\nlim\n\n(cid:111)\n\nwhere the symbol \u2018conv\u2019 signi\ufb01es the convex hull of a set, and G(cid:96) denotes the set of points in S at\nwhich h fails to be differentiable.\n\nHaving introduced the notion of generalized gradient, and with t denoting the iteration number, our\napproach to solving (5) amounts to iteratively re\ufb01ning the initial guess z0 by means of the ensuing\ntruncated generalized gradient iterations\n\n(6)\nwhere \u00b5t > 0 is the stepsize, and a piece of the (truncated) generalized gradient \u2202(cid:96)tr(zt) is given by\n\nzt+1 = zt \u2212 \u00b5t\u2202(cid:96)tr(zt)\n\n\u2202(cid:96)tr(zt) :=\n\nai\n\n(7)\n\n(cid:88)\n\n(cid:16)\n\ni\u2208It+1\n\n(cid:17)\n\ni zt \u2212 \u03c8i\naT\n\naT\ni zt\ni zt|\n|aT\n\nfor some index set It+1 \u2286 [m] to be designed shortly; and the convention aT\ni zt\ni zt| := 0 is adopted, if\n|aT\naT\ni zt = 0. Further, it is easy to verify that the update in (6) monotonically decreases the objective\nvalue in (5).\nRecall that since they offer descent iterations, the alter-\nnating projection variants are guaranteed to converge\nto a stationary point of (cid:96)(z), and any limit point z\u2217\nadheres to the following \ufb01xed-point equation [23]\n\nAT(cid:16)\n\n(cid:17)\n\nAz\u2217 \u2212 \u03c8 (cid:12) Az\u2217\n|Az\u2217|\n\n= 0\n\n(8)\nfor entry-wise product (cid:12), which may have many so-\nlutions. Clearly, if z\u2217 is a solution, so is \u2212z\u2217. Fur-\nther, both solutions/global minimizers x and \u2212x sat-\nisfy (8) due to Ax \u2212 \u03c8 (cid:12) Ax|Ax| = 0. Consider-\ning any stationary point z\u2217 (cid:54)= \u00b1x that has been\nadapted such that \u03c6(z\u2217) = 0, one can write z\u2217 =\n\nx+(AT A)\u22121AT(cid:2)\u03c8(cid:12)(cid:0) Az\u2217\n\n(cid:1)(cid:3). A necessary\n\nFigure 1: Empirical success rate for WF,\nTWF, and TGGF with the same truncated\nspectral initialization under the noiseless real\nGaussian model.\n\n|Az\u2217| \u2212 Ax|Ax|\ncondition for z\u2217 (cid:54)= x is Az\u2217\n|Az\u2217| (cid:54)= Ax|Ax|. Expressed dif-\nferently, there must be sign differences between Az\u2217\n|Az\u2217|\nand Ax|Ax| whenever one gets stuck with an undesirable\nstationary point z\u2217. Building on this observation, it is reasonable to devise algorithms that can detect\nand separate out the generalized gradient components corresponding to mistakenly estimated signs\naT\ni zt| along the iterates {zt}. Precisely, if zt and x lie in different sides of the hyperplane aT\ni zt\ni z = 0,\n|aT\ni x; that is, aT\ni x| (cid:54)= aT\nthen the sign of aT\ni z\ni z|. Speci\ufb01cally, one\n|aT\n|aT\n(cid:17)\n(cid:16) aT\ncan write the i-th generalized gradient component\ni x| \u2212 aT\ni z\n|aT\n|aT\ni z|\n\ni zt will be different than that of aT\n\n\u2202(cid:96)i(z) =\n\n(cid:16)\n\n(cid:17)\n\nai +\n\n\u03c8iai\n\ni x\n\ni x\n\n(cid:16)\n\n(cid:17)\naT\ni z \u2212 \u03c8i\naT\ni z\n(cid:16) aT\ni z|\n|aT\ni x| \u2212 aT\ni z\ni x\n|aT\n|aT\ni z|\n\ni h +\n\nai =\n\n= aiaT\n\n(9)\nwhere h := z \u2212 x. Apparently, the strong law of large numbers (SLLN) asserts that averaging the\n\ufb01rst term aiaT\ni h over m instances approaches h, which quali\ufb01es it as a desirable search direction.\nHowever, certain generalized gradient entries involve erroneously estimated signs of aT\ni x; hence,\nnonzero ri terms exert a negative in\ufb02uence on the search direction h by dragging the iterate away\nfrom x, and they typically have sizable magnitudes. To see why, recall that quantities maxi\u2208[m] \u03c8i\n(cid:107)h(cid:107) \u2264 \u03c1(cid:107)x(cid:107) for some small constant 0 < \u03c1 \u2264 1/10, to be discussed shortly. To maintain a\nmeaningful search direction, those \u2018bad\u2019 generalized gradient entries should be detected and excluded\nfrom the search direction.\n\nm(cid:107)x(cid:107) and(cid:112)\u03c0/2(cid:107)x(cid:107), respectively, whereas\n\nand (1/m)(cid:80)m\n\ni=1 \u03c8i have magnitudes on the order of\n\n\u221a\n\naT\ni z \u2212 \u03c8i\naT\ni x\n(cid:17)\n|aT\ni x|\n(cid:52)\n= aiaT\ni h + ri\n\n\u03c8iai\n\n4\n\nm/n for x\u2208 R1,0001234567Empirical success rate00.20.40.60.81WFTWFTAF\fNevertheless, it is dif\ufb01cult or even impossible to check whether the sign of aT\ni zt equals that of\naT\ni x. Fortunately, when the initialization is accurate enough, most spurious gradient entries (those\ncorrupted by nonzero ri terms) provably hover around the watershed hyperplane aT\ni zt = 0. For this\nreason, TGGF includes only those components having zt suf\ufb01ciently away from its watershed\n\n(cid:110)\n\n(cid:12)(cid:12)(cid:12)|aT\ni zt|\ni x| \u2265 1\n|aT\n\n1 + \u03b3\n\n(cid:111)\n\n,\n\nIt+1 :=\n\n1 \u2264 i \u2264 m\n\nt \u2265 0\n\n(10)\n\ni /\u2208 Tt+1) introduces large bias into (1/m)(cid:80)m\n\nfor an appropriately selected threshold \u03b3 > 0. It is worth stressing that our novel truncation rule\ndeviates from the intuition behind TWF. Among its complicated truncation procedures, TWF also\nthrows away large-size gradient components corresponding to (10), which is not the case with TGGF.\nAs demonstrated by our analysis, it rarely happens that a generalized gradient component having a\nlarge |aT\ni x. Further, discarding too many samples (those\ni ht, thus rendering TWF less effective\ni\u2208Tt+1\nwhen m/n is small. Numerical comparison depicted in Fig. 1 suggests that even starting with\nthe same truncated spectral initialization, TGGF\u2019s re\ufb01nement outperforms those of TWF and WF,\ncorroborating the merits of our novel truncation and update rule over TWF/WF.\n\ni zt|/(cid:107)zt(cid:107) yields an incorrect sign of aT\n\naiaT\n\n2.2 Orthogonality-promoting initialization stage\n\n(cid:80)\n\ni\u2208T0\n\nyiaiaT\n\nLeveraging the SLLN, spectral methods estimate x\nusing the (appropriately scaled) leading eigenvector\ni , where T0 is an index\nof Y := 1\nm\nset accounting for possible truncation. As asserted\nin [8], each summand (aT\ni x)2aiaT\ni follows a heavy-\ntail probability density function lacking a moment\ngenerating function. This causes major performance\ndegradation especially when the number of measure-\nments is limited. Instead of spectral initialization, we\nshall take another route to bypass this hurdle. To gain\nintuition for selecting our alternate route, a motivat-\ning example is presented \ufb01rst that reveals fundamen-\ntal characteristics among high-dimensional random\nvectors.\nExample: Fixing any nonzero vector x \u2208 Rn, gen-\nerate data \u03c8i = |(cid:104)ai, x(cid:105)| using i.i.d. ai \u223c N (0, In),\n\u2200i \u2208 [m], and evaluate the squared normalized inner-\nproduct\n\nFigure 2: Ordered squared normalized inner-\nproduct for pairs x and ai, \u2200i \u2208 [m] with\nm/n varying by 2 from 2 to 10, and n = 103.\n\n(11)\n\ncos2 \u03b8i :=\n\n|(cid:104)ai, x(cid:105)|2\n(cid:107)ai(cid:107)2(cid:107)x(cid:107)2 =\n\n\u03c82\ni\n\n(cid:107)ai(cid:107)2(cid:107)x(cid:107)2 , \u2200i \u2208 [m]\n\n(cid:3)T\n\nand collectively denote them as \u03be :=(cid:2)cos2 \u03b8[m] \u00b7\u00b7\u00b7 cos2 \u03b8[1]\n\nwhere \u03b8i is the angle between ai and x. Consider ordering all cos2 \u03b8i\u2019s in an ascending fashion,\nwith cos2 \u03b8[1] \u2265 \u00b7\u00b7\u00b7 \u2265 cos2 \u03b8[m].\nFig. 2 plots the ordered entries in \u03be for m/n varying by 2 from 2 to 10 with n = 103. Observe that\nalmost all {ai} vectors have a squared normalized inner-product smaller than 10\u22122, while half of the\ninner-products are less than 10\u22123, which implies that x is nearly orthogonal to many ai\u2019s.\nThis example corroborates that random vectors in high-dimensional spaces are almost always nearly\northogonal to each other [24]. This inspired us to pursue an orthogonality-promoting initialization\nmethod. Our key idea is to approximate x by a vector that is most orthogonal to a subset of vectors\n{ai}i\u2208I0, where I0 is a set with cardinality |I0| < m that includes indices of the smallest squared\nnot in\ufb02uence their ordering. Henceforth, we assume without loss of generality that (cid:107)x(cid:107) = 1.\nUsing {(ai; \u03c8i)}, evaluate cos2 \u03b8i according to (11) for each pair x and ai. Instrumental for the\nensuing derivations is noticing that the summation of cos2 \u03b8i over indices i \u2208 I0 is very small, while\nrigorous justi\ufb01cation is deferred to Section 3 and supplementary materials. Thus, a meaningful\napproximation denoted by z0 \u2208 Rn can be obtained by solving\n\n(cid:9). Since (cid:107)x(cid:107) appears in all inner-products, its exact value does\n\nnormalized inner-products(cid:8)cos2 \u03b8i\n\n5\n\nNumber of points100101102103104Squared normalized inner-product10-1410-1210-1010-810-610-410-2100m=2nm=4nm=6nm=8nm=10n\f(cid:32)\n\n(cid:88)\n\ni\u2208I0\n\n1\n|I0|\n\n(cid:33)\n\nz\n\naiaT\n(cid:107)ai(cid:107)2\n\ni\n\nzT\n\nmin(cid:107)z(cid:107)=1\n\n(cid:80)\n\n1|I0|\n\n(12)\n\ni\n\naiaT\n(cid:107)ai(cid:107)2 .\nwhich amounts to \ufb01nding the smallest eigenvalue and the associated eigenvector of\nYet \ufb01nding the smallest eigenvalue calls for eigen-decomposition or matrix inversion, each requiring\ncomputational complexity O(n3). Such a computational burden can be intractable when n grows\nlarge. Applying a standard concentration result simpli\ufb01es greatly those computations next [25].\nSince ai/(cid:107)ai(cid:107) has unit norm and is uniformly distributed on the unit sphere, it is uniformly spheri-\ncally distributed.2 Spherical symmetry implies that ai/(cid:107)ai(cid:107) has zero mean and covariance matrix\naiaT\n(cid:107)ai(cid:107)2 approaches\nIn/n [25]. Appealing again to the SLLN, the sample covariance matrix 1\nm\naiaT\naiaT\n(cid:107)ai(cid:107)2 (cid:117)\n1\n\nn In as m grows. Simple derivations lead to (cid:80)\nn In \u2212(cid:80)\n\naiaT\n(cid:107)ai(cid:107)2 , where I0 is the complement of I0 in the set [m].\n\n(cid:80)m\n(cid:107)ai(cid:107)2 \u2212(cid:80)\n\ni\u2208I0\n\ni\u2208I0\n\ni\u2208I0\n\n(cid:107)ai(cid:107)2 = (cid:80)m\n\naiaT\n\ni\n\ni=1\n\ni=1\n\ni\n\ni\u2208I0\n\ni\n\nm\n\nDe\ufb01ne S := [a1/(cid:107)a1(cid:107) \u00b7\u00b7\u00b7 am/(cid:107)am(cid:107)]\nindices do not belong to I0. The task of seeking the smallest eigenvalue of Y0 := 1|I0| ST\nto computing the largest eigenvalue of Y0 := 1\n\nT \u2208 Rm\u00d7n, and form S0 by removing the rows of S if their\n0 S0 reduces\n\n0 S0, namely,\n\ni\n\ni\n\n|I0| ST\n\u02dcz0 := arg max\n(cid:107)z(cid:107)=1\n\nzT Y0z\n\n(13)\n\ni=1 yi), thus yielding z0 =(cid:112)(cid:80)m\n(cid:80)m\n\nwhich can be ef\ufb01ciently solved using simple power iterations. If, on the other hand, (cid:107)x(cid:107) (cid:54)= 1, the\nestimate \u02dcz0 from (13) is further scaled so that its norm matches approximately that of x (which\nis estimated to be 1\ni=1 yi/m\u02dcz0. It is worth stressing that the\nm\nconstructed matrix Y0 does not depend on {yi} explicitly, saving our initialization from suffering\nheavy-tails of the fourth order of {ai} in spectral initialization schemes.\nFig. 3 compares three initialization schemes showing\ntheir relative errors versus the measurement/unknown\nratio m/n under the noise-free real Gaussian model,\nwhere x \u2208 R1,000 and m/n increases by 2 from 2 to\n20. Apparently, all schemes enjoy improved perfor-\nmance as m/n increases. In particular, the proposed\ninitialization method outperforms its spectral alterna-\ntives. Interestingly, the spectral and truncated spectral\nschemes exhibit similar performance when m/n is\nsuf\ufb01ciently large (e.g., m/n \u2265 14). This con\ufb01rms\nthat truncation helps only if m/n is relatively small.\nIndeed, truncation is effected by discarding measure-\nments of excessively large sizes emerging from the\nheavy tails of the data distribution. Hence, its advan-\ntage over the untruncated one narrows as the number\nof measurements increases, thus straightening out\nthe heavy tails. On the contrary, the orthogonality-\npromoting initialization method achieves consistently\nsuperior performance over its spectral alternatives.\n\nFigure 3: Relative error versus m/n for: i)\nthe spectral method; ii) the truncated spectral\nmethod; and iii) our orthogonality-promoting\nmethod for noiseless real Gaussian model.\n\n3 Main results\n\nTGGF is summarized in Algorithm 1 with default values set for pertinent algorithmic parameters.\nPostulating independent samples {(ai; \u03c8i)}, the following result establishes the performance of our\nTGGF approach.\n\n2A random vector z \u2208 Rn is said to be spherical (or spherically symmetric) if its distribution does not change\nunder rotations of the coordinate system; that is, the distribution of P z coincides with that of z for any given\northogonal n \u00d7 n matrix P .\n\n6\n\nm/n for x\u2208R1,00025101520Relative error0.30.40.50.60.70.80.911.11.21.3SpectralTruncated spectralOrthogonality-promoting\fAlgorithm 1 Truncated generalized gradient \ufb02ow (TGGF) solvers\n1: Input: Data {\u03c8i}m\ni=1; the maximum number of iterations T =\n1, 000; by default, take constant step size \u00b5 = 0.6/1 for real/complex Gaussian models, truncation\nthresholds |I0| = (cid:100) 1\n(\u03c8i/(cid:107)ai(cid:107))\u2019s.\n\ni=1 and feature vectors {ai}m\n6 m(cid:101) ((cid:100)\u00b7(cid:101) the ceil operation), and \u03b3 = 0.7.\n\n2: Evaluate \u03c8i/(cid:107)ai(cid:107), \u2200i \u2208 [m], and \ufb01nd I0 comprising indices corresponding to the |I0| largest\n\ni=1 \u03c82\n\ni /m\u02dcz0, where \u02dcz0 is the unit leading eigenvector of Y0\n\n:=\n\n1\n|I0|\n\n3: Initialize z0 to (cid:112)(cid:80)m\n\n(cid:80)\nwhere It+1 :=(cid:8)1 \u2264 i \u2264 m(cid:12)(cid:12)|aT\n\n4: Loop: for t = 0 to T \u2212 1\n\naiaT\n(cid:107)ai(cid:107)2 .\n\ni\u2208I0\n\ni\n\n(cid:88)\nzt+1 = zt \u2212 \u00b5\nm\ni zt| \u2265 1\n\ni\u2208It+1\n1+\u03b3 \u03c8i\n\n(cid:18)\n(cid:9).\n\n5: Output: zT\n\n(cid:19)\n\nai\n\ni zt \u2212 \u03c8i\naT\n\naT\ni zt\ni zt|\n|aT\n\nTheorem 1 Let x \u2208 Rn be an arbitrary signal vector, and consider (noise-free) measurements\n\u03c8i = |aT\ni.i.d.\u223c N (0, In), 1 \u2264 i \u2264 m. Then with probability at least 1 \u2212 (m +\n5)e\u2212n/2 \u2212 e\u2212c0m \u2212 1/n2 for some universal constant c0 > 0, the initialization z0 returned by the\northogonality-promoting method in Algorithm 1 satis\ufb01es\n\ni x|, in which ai\n\ndist(z0, x) \u2264 \u03c1(cid:107)x(cid:107)\n\n(14)\nwith \u03c1 = 1/10 (or any suf\ufb01ciently small positive constant), provided that m \u2265 c1|I0| \u2265 c2n for\nsome numerical constants c1, c2 > 0, and suf\ufb01ciently large n. Further, choosing a constant step size\n\u00b5 \u2264 \u00b50 along with a \ufb01xed truncation level \u03b3 \u2265 1/2, and starting from any initial guess z0 satisfying\n(14), successive estimates of the TGGF solver (tabulated in Algorithm 1) obey\n\n(15)\nfor some 0 < \u03bd < 1, which holds with probability exceeding 1 \u2212 (m + 5)e\u2212n/2 \u2212 8e\u2212c0m \u2212 1/n2.\nTypical parameters are \u00b5 = 0.6, and \u03b3 = 0.7.\n\nt = 0, 1, . . .\n\ndist (zt, x) \u2264 \u03c1 (1 \u2212 \u03bd)t (cid:107)x(cid:107) ,\n\nTheorem 1 asserts that: i) TGGF recovers the solution x exactly as soon as the number of equations is\nabout the number of unknowns, which is theoretically order optimal. Our numerical tests demonstrate\nthat for the real Gaussian model, TGGF achieves a success rate of 100% when m/n is as small as 3,\nwhich is slightly larger than the information limit of m/n = 2 (Recall that m \u2265 2n \u2212 1 is necessary\nfor a unique solution); this is a signi\ufb01cant reduction in the sample complexity ratio, which is 5 for\nTWF and 7 for WF. Surprisingly, TGGF enjoys also a success rate of over 50% when m/n is 2, which\nhas not yet been presented for any existing algorithm under Gaussian sampling models and thus, our\nTGGF bridges the gap; see further discussion in Section 4; and, ii) TGGF converges exponentially fast.\nSpeci\ufb01cally, TGGF requires at most O(log(1/\u0001)) iterations to achieve any given solution accuracy\n\u0001 > 0 (a.k.a., dist(zt, x) \u2264 \u0001(cid:107)x(cid:107)), with iteration cost O(mn). Since truncation takes time on the\norder of O(m), the computational burden of TGGF per iteration is dominated by evaluating the\ngeneralized gradients. The latter involves two matrix-vector multiplications that are computable in\nO(mn) \ufb02ops, namely, Azt yields ut, and AT vt the generalized gradient, where vt := ut\u2212\u03c8(cid:12) ut|ut|.\nHence, the total running time of TGGF is O(mn log(1/\u0001)), which is proportional to the time taken\nto read the data O(mn). The proof of Theorem 1 can be found in the supplementary material.\n4 Simulated tests and conclusions\n\nAdditional numerical tests evaluating performance of the proposed scheme relative to TWF/WF\nare presented in this section. For fairness, all pertinent algorithmic parameters involved in each\nscheme are set to their default values. The Matlab implementations of TGGF are available at\nhttp://www.tc.umn.edu/\u02dcgangwang/TAF. The initial estimate was found based on 50\npower iterations, and was subsequently re\ufb01ned by T = 103 gradient-like iterations in each scheme.\nLeft panel in Fig. 4 presents average relative error of three initialization methods on a series of\nnoiseless/noisy real Gaussian problems with m/n = 6 \ufb01xed, and n varying from 500 to 104,\n\n7\n\n\fFigure 4: The average relative error using: i) the spectral method [11, 12]; ii) the truncated spectral\nmethod [8]; and iii) the proposed orthogonality-promoting method on noise-free (solid) and noisy\n(dotted) instances with m/n = 6, and n varying from 500/100 to 10, 000/5, 000 for real/complex\nvectors. Left: Real Gaussian model with x \u223c N (0, In), ai \u223c N (0, In), and \u03c32 = 0.22 (cid:107)x(cid:107)2. Right:\nComplex Gaussian model with x \u223c CN (0, In), ai \u223c CN (0, In), and \u03c32 = 0.22 (cid:107)x(cid:107)2.\n\nFigure 5: Empirical success rate for WF, TWF, and TGGF with n = 1, 000 and m/n varying from 1\nto 7. Left: Noiseless real Gaussian model with x \u223c N (0, In) and ai \u223c N (0, In); Right: Noiseless\ncomplex Gaussian model with x \u223c CN (0, In) and ai \u223c CN (0, In).\n\nwhile those for the corresponding complex Gaussian instances are shown in the right panel. Fig. 5\ncompares empirical success rate of three schemes under both real and complex Gaussian models\nwith n = 103 and m/n varying by 1 from 1 to 7. Apparently, the proposed initialization method\nreturns more accurate and robust estimates than the spectral ones. Moreover, for real-valued vectors,\nTGGF achieves a success rate of over 50% when m/n = 2, and guarantees perfect recovery from\nabout 3n measurements; while for complex-valued ones, TGGF enjoys a success rate of 95% when\nm/n = 3.4, and ensures perfect recovery from about 4.5n measurements. Regarding running times,\nTGGF converges slightly faster than TWF, while both are markedly faster than WF. Curves in Fig. 5\nclearly corroborate the merits of TGGF over Wirtinger alternatives.\nThis paper developed a linear-time algorithm termed TGGF for solving random systems of quadratic\nequations. TGGF builds on three key ingredients: a novel orthogonality-promoting initialization,\nalong with a simple yet effective truncation rule, as well as simple scalable gradient-like iterations.\nNumerical tests corroborate the superior performance of TGGF over state-of-the-art solvers.\n\nAcknowledgements\n\nWork in this paper was supported in part by NSF grants 1500713 and 1514056.\n\n8\n\nReal signal dimension n5002,0004,0006,0008,00010,000Relative error0.60.70.80.911.11.2SpectralTruncated spectralOrthogonality-promotingSpectral (noisy)Truncated spectral (noisy)Orthogonality-promoting (noisy)Complex signal dimension n1001,0002,0003,0004,0005,000Relative error0.750.80.850.90.9511.051.11.151.21.25SpectralTruncated spectralOrthogonality-promotingSpectral (noisy)Truncated spectral (noisy)Orthogonality-promoting (noisy)m/n for x\u2208R1,0001234567Empirical success rate00.20.40.60.81WFTWFTGGFm/n for x\u2208C1,0001234567Empirical success rate00.20.40.60.81WFTWFTGGF\fReferences\n[1] R. Balan, P. Casazza, and D. Edidin, \u201cOn signal reconstruction without phase,\u201d Appl. Comput. Harmon.\n\nAnal., vol. 20, no. 3, pp. 345\u2013356, May 2006.\n\n[2] A. Conca, D. Edidin, M. Hering, and C. Vinzant, \u201cAn algebraic characterization of injectivity in phase\n\nretrieval,\u201d Appl. Comput. Harmon. Anal., vol. 38, no. 2, pp. 346\u2013356, Mar. 2015.\n\n[3] P. M. Pardalos and S. A. Vavasis, \u201cQuadratic programming with one negative eigenvalue is NP-hard,\u201d J.\n\nGlobal Optim., vol. 1, no. 1, pp. 15\u201322, 1991.\n\n[4] H. A. Hauptman, \u201cThe phase problem of X-ray crystallography,\u201d Rep. Prog. Phys., vol. 54, no. 11, p. 1427,\n\n1991.\n\n[5] E. J. Cand`es, Y. C. Eldar, T. Strohmer, and V. Voroninski, \u201cPhase retrieval via matrix completion,\u201d SIAM\n\nRev., vol. 57, no. 2, pp. 225\u2013251, May 2015.\n\n[6] H. Sahinoglou and S. D. Cabrera, \u201cOn phase retrieval of \ufb01nite-length sequences using the initial time\n\nsample,\u201d IEEE Trans. Circuits and Syst., vol. 38, no. 8, pp. 954\u2013958, Aug. 1991.\n\n[7] K. G. Murty and S. N. Kabadi, \u201cSome NP-complete problems in quadratic and nonlinear programming,\u201d\n\nMath. Prog., vol. 39, no. 2, pp. 117\u2013129, 1987.\n\n[8] Y. Chen and E. J. Cand`es, \u201cSolving random quadratic systems of equations is nearly as easy as solving\n\nlinear systems,\u201d Comm. Pure Appl. Math., 2016 (to appear).\n\n[9] R. W. Gerchberg and W. O. Saxton, \u201cA practical algorithm for the determination of phase from image and\n\ndiffraction,\u201d Optik, vol. 35, pp. 237\u2013246, Nov. 1972.\n\n[10] J. Fienup, \u201cPhase retrieval algorithms: A comparison,\u201d Appl. Opt., vol. 21, no. 15, pp. 2758\u20132769, 1982.\n[11] P. Netrapalli, P. Jain, and S. Sanghavi, \u201cPhase retrieval using alternating minimization,\u201d IEEE Trans. Signal\n\nProcess., vol. 63, no. 18, pp. 4814\u20134826, Sept. 2015.\n\n[12] E. J. Cand`es, X. Li, and M. Soltanolkotabi, \u201cPhase retrieval via Wirtinger \ufb02ow: Theory and algorithms,\u201d\n\nIEEE Trans. Inf. Theory, vol. 61, no. 4, pp. 1985\u20132007, Apr. 2015.\n\n[13] J. Sun, Q. Qu, and J. Wright, \u201cA geometric analysis of phase retrieval,\u201d arXiv:1602.06664, 2016.\n\n[14] E. J. Cand`es, T. Strohmer, and V. Voroninski, \u201cPhaseLift: Exact and stable signal recovery from magnitude\nmeasurements via convex programming,\u201d Appl. Comput. Harmon. Anal., vol. 66, no. 8, pp. 1241\u20131274,\nNov. 2013.\n\n[15] I. Waldspurger, A. d\u2019Aspremont, and S. Mallat, \u201cPhase recovery, maxcut and complex semide\ufb01nite\n\nprogramming,\u201d Math. Prog., vol. 149, no. 1-2, pp. 47\u201381, 2015.\n\n[16] E. J. Cand`es and X. Li, \u201cSolving quadratic equations via PhaseLift when there are about as many equations\n\nas unknowns,\u201d Found. Comput. Math., vol. 14, no. 5, pp. 1017\u20131026, 2014.\n\n[17] G. Wang, D. Berberidis, V. Kekatos, and G. B. Giannakis, \u201cOnline reconstruction from big data via\ncompressive censoring,\u201d in IEEE Global Conf. Signal and Inf. Process., Atlanta, GA, 2014, pp. 326\u2013330.\n\n[18] D. K. Berberidis, V. Kekatos, G. Wang, and G. B. Giannakis, \u201cAdaptive censoring for large-scale regres-\nsions,\u201d in IEEE Intl. Conf. Acoustics, Speech and Signal Process., South Brisbane, QLD, Australia, 2015,\npp. 5475\u20135479.\n\n[19] L.-H. Yeh, J. Dong, J. Zhong, L. Tian, M. Chen, G. Tang, M. Soltanolkotabi, and L. Waller, \u201cExperimental\nrobustness of Fourier ptychography phase retrieval algorithms,\u201d Opt. Express, vol. 23, no. 26, pp. 33 214\u2013\n33 240, Dec. 2015.\n\n[20] N. Z. Shor, K. C. Kiwiel, and A. Ruszcay`nski, Minimization Methods for Non-differentiable Functions.\n\nSpringer-Verlag New York, Inc., 1985.\n\n[21] F. H. Clarke, Optimization and Nonsmooth Analysis. SIAM, 1990, vol. 5.\n[22] \u2014\u2014, \u201cGeneralized gradients and applications,\u201d T. Am. Math. Soc., vol. 205, pp. 247\u2013262, 1975.\n\n[23] P. Chen, A. Fannjiang, and G.-R. Liu, \u201cPhase retrieval with one or two diffraction patterns by alternating\n\nprojections of the null vector,\u201d arXiv:1510.07379v2, 2015.\n\n[24] T. Cai, J. Fan, and T. Jiang, \u201cDistributions of angles in random packing on spheres,\u201d J. Mach. Learn. Res.,\n\nvol. 14, no. 1, pp. 1837\u20131864, Jan. 2013.\n\n[25] R. Vershynin, \u201cIntroduction to the non-asymptotic analysis of random matrices,\u201d arXiv:1011.3027, 2010.\n\n9\n\n\f", "award": [], "sourceid": 317, "authors": [{"given_name": "Gang", "family_name": "Wang", "institution": "University of Minnesota"}, {"given_name": "Georgios", "family_name": "Giannakis", "institution": "University of Minnesota"}]}