{"title": "Nearly Isometric Embedding by Relaxation", "book": "Advances in Neural Information Processing Systems", "page_first": 2631, "page_last": 2639, "abstract": "Many manifold learning algorithms aim to create embeddings with low or no distortion (i.e. isometric). If the data has intrinsic dimension d, it is often impossible to obtain an isometric embedding in d dimensions, but possible in s > d dimensions. Yet, most geometry preserving algorithms cannot do the latter. This paper proposes an embedding algorithm that overcomes this problem. The algorithm directly computes, for any data embedding Y, a distortion loss(Y), and iteratively updates Y in order to decrease it. The distortion measure we propose is based on the push-forward Riemannian metric associated with the coordinates Y. The experiments confirm the superiority of our algorithm in obtaining low distortion embeddings.", "full_text": "Nearly Isometric Embedding by Relaxation\n\nJames McQueen\n\nMarina Meil\u02d8a\n\nDominique Perrault-Joncas\n\nDepartment of Statistics\nUniversity of Washington\n\nSeattle, WA 98195\n\njmcq@u.washington.edu\n\nmmp@stat.washington.edu\n\nDepartment of Statistics\nUniversity of Washington\n\nSeattle, WA 98195\n\nGoogle\n\nSeattle, WA 98103\n\ndcpjoncas@gmail.com\n\nAbstract\n\nMany manifold learning algorithms aim to create embeddings with low or no dis-\ntortion (isometric). If the data has intrinsic dimension d, it is often impossible to\nobtain an isometric embedding in d dimensions, but possible in s > d dimensions.\nYet, most geometry preserving algorithms cannot do the latter. This paper pro-\nposes an embedding algorithm to overcome this. The algorithm accepts as input,\nbesides the dimension d, an embedding dimension s \u2265 d. For any data embedding\nY, we compute a Loss(Y), based on the push-forward Riemannian metric associ-\nated with Y, which measures deviation of Y from from isometry. Riemannian\nRelaxation iteratively updates Y in order to decrease Loss(Y). The experiments\ncon\ufb01rm the superiority of our algorithm in obtaining low distortion embeddings.\n\n1\n\nIntroduction, background and problem formulation\n\nSuppose we observe data points sampled from a smooth manifold M with intrinsic dimension d\nwhich is itself a submanifold of D-dimensional Euclidean space M \u2282 RD. The task of manifold\nlearning is to provide a mapping \u03c6 : M \u2192 N (where N \u2282 Rs) of the manifold into lower\ndimensional space s \u226a D. According to the Whitney Embedding Theorem [11] we know that\nM can be embedded smoothly into R2d using one homeomorphism \u03c6. Hence we seek one smooth\nmap \u03c6 : M \u2192 Rs with d \u2264 s \u2264 2d \u226a D.\nSmooth embeddings preserve the topology of the original M. Nevertheless, in general, they distort\nthe geometry. Theoretically speaking1, preserving the geometry of an embedding is embodied in the\nconcepts of Riemannian metric and isometric embedding. A Riemannian metric g is a symmetric\npositive de\ufb01nite tensor \ufb01eld on M which de\ufb01nes an inner product <, >g on the tangent space TpM\nfor every point p \u2208 M. A Riemannian manifold is a smooth manifold with a Riemannian metric at\nevery point. A diffeomorphism \u03c6 : M \u2192 N is called an isometry iff for all p \u2208 M, u, v \u2208 TpM\nwe have < u, v >gp =< d\u03c6pu, d\u03c6pv >h\u03c6(p). By Nash\u2019s Embedding Theorem [13], it is known that\nany smooth manifold of class C k, k \u2265 3 and intrinsic dimension d can be embedded isometrically\nin the Euclidean space Rs with s polynomial in d.\nIn unsupervised learning, it is standard to assume that (M, g0) is a submanifold of RD and that it\ninherits the Euclidean metric from it2. An embedding \u03c6 : M \u2192 \u03c6(M) = N de\ufb01nes a metric g on\nN given by < u, v >g(\u03c6(p))=< d\u03c6\u22121u, d\u03c6\u22121v >g0(p) called the pushforward Riemannian metric;\n(M, g0) and (N , g) are isometric.\nMuch previous work in non-linear dimension reduction[16, 20, 19] has been driven by the desire\nto \ufb01nd smooth embeddings of low dimension that are isometric in the limit of large n. This work\nhas met with mixed success. There exists the constructive implementation [19] of Nash\u2019s proof\n\n1For a more complete presentation the reader is referred to [8] or [15] or [10].\n2Sometimes the Riemannian metric on M is not inherited, but user-de\ufb01ned via a kernel or distance function.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\ftechnique, which guarantees consistence and isometry. However, the algorithm presented falls short\nof being practical, as the embedding dimension s it requires is signi\ufb01cantly higher than the minimum\nnecessary, a major drawback in practice. Overall, the algorithm leads to mappings \u03c6 that, albeit\nhaving the desired properties, are visually unintuitive, even for intrinsic dimensions as low as d = 1.\nThere are many algorithms, too many for an exhaustive list, which map the data using a cleverly\nchosen reconstruction criterion. The criterion is chosen so that the mapping \u03c6 can be obtained as the\nunique solution of a \u201cclassic\u201d optimization problem, e.g. Eigendecomposition for Laplacian Eigen-\nmaps [2], Diffusion Maps [12] and LTSA [21], Semide\ufb01nite Programming for Maximum Variance\nUnfolding [20] or Multidimensional Scaling for Isomap [3]. These embedding algorithms some-\ntimes come with guarantees of consistency [2] and, only in restricted cases, isometry [3].\nIn this paper we propose an approach which departs from both these existing directions. The main\ndifference, from the algorithmic point of view, is that the loss function we propose does not have a\nform amenable to a standard solver (and is not even guaranteed to be convex or unimodal). Thus, we\ndo not obtain a mapping \u03c6 in \u201cone shot\u201d, as the previous algorithms do, but by the gradual improve-\nments of an initial guess, i.e. by gradient descent. Nevertheless, the loss we de\ufb01ne directly measures\nthe deviation from isometry; therefore, when this loss is (near) 0, (near) isometry is achieved.\nThe algorithm is initialized with a smooth embedding Y = \u03c6(M) \u2286 Rs, s \u2265 d; we de\ufb01ne the\nobjective function Loss(Y) as the averaged deviation of the pushforward metric from isometry. Then\nY is iteratively changed in a direction that decreases Loss. To construct this loss function, we exploit\nthe results of [15] who showed how a pushforward metric can be estimated, for \ufb01nite samples and\nin any given coordinates, using a discrete estimator of the Laplace-Beltrami operator \u2206M. The\noptimization algorithm is outlined in Algorithm 1.\n\nInput\n\n:data X \u2208 Rn\u00d7D, kernel function Kh(), weights w1:n, intrinsic dimension d, embedding dimension s\nInitial coordinates Y \u2208 Rn\u00d7s, with Yk,: representing the coordinates of point k.\n:Compute Laplacian matrix L \u2208 Rn\u00d7n using X and Kh().\n\nInit\nwhile not converged do\n\nCompute H = [Hk]k=1:n \u2208 Rn\u00d7s\u00d7s the (dual) pushforward metric at data points from Y and L.\nCompute Loss(H1:n) and \u2207Y Loss(H)\nTake a gradient step Y \u2190 Y \u2212 \u03b7\u2207Y Loss(H)\n\nend\nOutput : Y\n\nAlgorithm 1: Outline of the Riemannian Relaxation Algorithm.\n\nA remark on notation is necessary. Throughout the paper, we denote by M, p\u2208M, TpM, \u2206M a\nmanifold, a point on it, the tangent subspace at p, and the Laplace-Beltrami operator in the abstract,\ncoordinate free form. When we describe algorithms acting on data, we will use coordinate and \ufb01nite\nsample representations. The data is X \u2208 Rn\u00d7D, and an embedding thereof is denoted Y \u2208 Rn\u00d7s;\nrows k of X, Y, denoted Xk, Yk are coordinates of data point k, while the columns, e.g Yj represent\nfunctions of the points, i.e restrictions to the data of functions on M. The construction of L (see be-\nlow) requires a kernel, which can be the (truncated) gaussian kernel Kh(z) = exp(z2/h), |z| < rh\n\nfor some \ufb01xed r > 0 [9, 17]. Besides these, the algorithm is given a set of weights w1:n, Pk wk = 1.\n\nThe construction of the loss is based on two main sets of results that we brie\ufb02y review here. First,\nan estimator L of the Laplace-Beltrami operator \u2206M of M, and second, an estimator of the push-\nforward metric g in the current coordinates Y.\nTo construct L we use the method of [4], which guarantees that, if the data are sampled from a\nmanifold M, L converges to \u2206M [9, 17]. Given a set of points in high-dimensional Euclidean space\nRD, represented by the n\u00d7D matrix X, construct a weighted neighborhood graph G = ({1 : n}, W )\nover them, with W = [Wkl]k,l=1:n. The weight Wkl between Xk: and Xl: is the heat kernel [2]\nWkl \u2261 Kh(||Xk: \u2212 Xl:||) with h a bandwidth parameter \ufb01xed by the user, and || || the Euclidean\nnorm. Next, construct L = [Lkl]ij of G by\n\nD= diag(W1) ,\n\n\u02dcW = D\u22121WD\u22121 ,\n\n\u02dcD = diag( \u02dcW1) , and L = \u02dcD\n\n\u22121 \u02dcW\n\n(1)\n\nEquation (1) represents the discrete versions of the renormalized Laplacian construction from [4].\nNote that W, D, \u02dcD, \u02dcW, L all depend on the bandwidth h via the heat kernel. The consistency of L\nhas been proved in e.g [9, 17].\n\n2\n\n\fThe second fact we use is the relationship between the Laplace-Beltrami operator and the Rieman-\nnian metric on a manifold [11]. Based on this, [15] gives a a construction method for a discrete\nestimator of the Riemannian metric g, in any given coordinate system, from an estimate L of \u2206M.\nIn a given coordinate representation Y, a Riemannian metric g at each point is an s \u00d7 s positive\nsemide\ufb01nite matrix of rank d. The method of [15] obtains the matrix Moore-Penrose pseudoinverse\nof this metric (which must be therefore inverted to obtain the pushforward metric). We denote this\ninverse at point k by Hk; let H = [Hk, k = 1, . . . n] be the three dimensional array containing the\ninverse for each data point. Note that H is itself the (discrete estimate of) a Riemannian metric,\ncalled the dual (pushforward) metric. With these preliminaries, the method of [15] computes H by\n\nHij =\n\n1\n\n2 hL(Yi \u00b7 Yj) \u2212 Yi \u00b7 (LYj) \u2212 Yj \u00b7 (LYi)i\n\n(2)\n\nWhere here Hij is the vector whose kth entry is the ijth element of the dual pushforward metric H\nat the point k and \u00b7 denotes element-by-element multiplication.\n\n2 The objective function Loss\n\nThe case s = d (embedding dimension equals intrinsic dimension). Under this condition, it\ncan be shown [10] that \u03c6 : M \u2192 Rd is an isometry iff gp, p \u2208 M expressed in a normal coordinate\nsystem equals the unit matrix Id. Based on this observation, it is natural to measure the quality of the\ndata embedding Y as the departure of the Riemannian metric obtained via (2) from the unit matrix.\nThis is the starting idea for the distortion measure we propose to optimize. We develop it further as\nfollows. First, we choose to use the dual of g, evaluated by H instead of pushforward metric itself.\nNaturally Hk = Id iff H\u22121\nk = Id, so the dual metric identi\ufb01es isometry as well. When no isometric\ntransformation exists, it is likely that optimizing w.r.t g and optimizing w.r.t h will arrive to different\nembeddings. There is no mathematically compelling reason, however, to prefer optimizing one\nover the other. We choose to optimize w.r.t h for three reasons; (1) it is computationally faster, (2) it\nis numerically more stable, and (3) in our experience users \ufb01nd H more interpretable. 3\nSecond, we choose to measure the distortion of Hk by ||Hk \u2212I|| where || || denotes the matrix spectral\nnorm. This choice will be motivated shortly. Third, we choose the weights w1:n to be proportional\nto \u02dcD from (1). As [4] show, these values converge to the sampling density \u03c0 on M. Putting these\ntogether, we obtain the loss function\n\nLoss(Y; L, w) =\n\nn\n\nXk=1\n\nwk ||Hk \u2212 Id||2 .\n\n(3)\n\nTo motivate the choice of a \u201csquared loss\u201d instead of simply using ||Hk \u2212 Id||, notice (the proofs are\nstraightforward) that || || is not differentiable at 0, but || ||2 is.\nA natural question to ask about Loss is if it is convex. The following proposition proved in the\nSupplement summarizes a set of relevant convexity facts.\n\nProposition 1 Denote by \u03bb1:d(Hk) \u2265 0 the eigenvalues of Hk, in decreasing order and assume Y\nis in a compact, convex set. Then\n\n1. \u03bb1(Hk), \u03bb1(Hk) \u2212 \u03bbd(Hk) and \u03bb1(Hk) \u2212Pd\n\n2. ||Hk \u2212 Id|| is convex in Y for (\u03bb1(Hk) + \u03bbd(Hk))/2 \u2265 1 and concave otherwise.\n3. ||Hk \u2212 Id||2 is convex in Y whenever ||Hk \u2212 Id|| is convex and differentiable in Y.\n\nd\u2032=1 \u03bbd\u2032(Hk) are convex in Y.\n\nThis proposition shows that Loss may not be convex near its minimum, and moreover that squaring\nthe loss only improves convexity.\n\nChoosing the right measure of distortion The norm of a Hermitian bilinear functional (i.e\nsymmetric tensor of order 2) g : Rs \u00d7 Rs \u2192 R is de\ufb01ned as supu6=0 |g(u, u)|/||u||.\nIn a\n\ufb01xed orthonormal base of Rs, g(u, v) = u\u2032Gv, ||g|| = supu6=0 |u\u2032Gu|. One can de\ufb01ne norms\nwith respect to any metric g0 on Rs (where g0 is represented in coordinates by G0, a symmetric,\npositive de\ufb01nite matrix), by ||u||G0 = u\u2032G0u, respectively ||g||G0 = supu6=0 |u\u2032Gu|/||u||G0 =\n\n3Hk represents the direction & degree of distortion as opposed to the scaling required to \u201ccorrect\" the space.\n\n3\n\n\f0\n\n0\n\n0\n\n0\n\nGG\u22121/2\n\nGG\u22121/2\n\n\u02dcu|/||\u02dcu|| = \u03bbmax(G\u22121/2\n\nsup\u02dcu6=0 |\u02dcu\u2032G\u22121/2\n). In particular, since any Riemannian\nmetric at a point k is a g as above, setting g and g0 respectively to Hk and Id we measure the opera-\ntor norm of the distortion by ||Hk \u2212 Id||. In other words, the appropriate operator norm we seek can\nbe expresed as a matrix spectral norm.\nThe expected loss over the data set, given a distribution represented by the weights w1:n is then\nidentical to the expression of Loss in (3). If the weights are computed as in (1), it is easy to see that\nthe loss function in (3) is the \ufb01nite sample version of the squared L2 distance between h and g0 on\nthe space of Riemannian metrics on M, w.r.t base measure \u03c0dVg0\n\n||h \u2212 g0||2\n\ng0 = ZM\n\n||h \u2212 g0||2\n\ng0 \u03c0dVg0, with dVg0volume element on M.\n\n(4)\n\nDe\ufb01ning Loss for embeddings with s > d dimensions Consider G, G0 \u2208 Rs\u00d7s, two symmetric\nmatrices with G0 semipositive de\ufb01nite of rank d < s. We would like to extend the G0 norm of G to\nthis case. We start with the family of norms ||||G0+\u03b5Is for \u01eb > 0 and we de\ufb01ne\n\n||G||G0 = lim\n\u01eb\u21920\n\n||G||G0+\u03b5Is.\n\n(5)\n\nProposition 2 Let G, G0 \u2208 Rs\u00d7s be symmetric matrices, with G0 semipositive de\ufb01nite of rank\nd < s, and let \u01eb > 0, \u03b3(u, \u03b5) =\n\nu\u2032Gu\n\nu\u2032G0u+\u01eb||u||2 . Then,\n\n1. ||G||G0+\u03b5Is = || \u02dcG||2 with \u02dcG = (G0 + \u01ebI)\u22121/2G(G0 + \u01ebI)\u22121/2.\n2. If ||G||G0+\u03b5Is < r, then \u03bb\u2020(G) < \u01ebr with \u03bb\u2020(G) = supv\u2208Null(G0) \u03b3(v, \u03b5),\n\n3. ||||G0 is a matrix norm that takes in\ufb01nite values when Null G0 6\u2286 Null G.\n\nHence, || ||G0+\u03b5Is can be computed as the spectral norm of a matrix. The computation of || ||G0 is\nsimilar, with the additional step of checking \ufb01rst if Null G0 6\u2286 Null G, in which case we output\nthe value \u221e. Let B\u01eb(0, r) (B(0, r)) denote the r-radius ball centered at 0 in the || ||G0+\u03b5Is (|| ||G0).\nFrom Proposition 2 it follows that if G \u2208 B\u01eb(0, r) then \u03bb\u2020(G) < \u01ebr and if G \u2208 B(0, r) then\nNull(G0) \u2286 Null(G). In particular, if rank G = rank G0 then Null(G) = Null(G0).\nTo de\ufb01ne the loss for s > d we set G = Hk and G0 = UkU\u2032\nk, with Uk an orthonormal basis for\nTkM the tangent subspace at k. The norms || ||G0+\u03b5Is, || ||G0 act as soft and hard barrier functions\nconstraining the span of Hk to align with the tangent subspace of the data manifold.\n\nLoss(Y; L, w, d, \u03b5orth) =\n\nn\n\nXk=1\n\nwk|| (UkU\u2032\n\nk + \u03b52\n\north\n\n|\n\nIs)\u22121/2(cid:0)Hk \u2212 UkU\u2032\n\nk(cid:1) (UkU\u2032\n\nk + \u03b52\n\north\n\n\u02dcGk\n\n{z\n\n3 Optimizing the objective\n\nLet Lk denote the kth row of L, then Hk can be rewritten in the convenient form\n\nHk(Y) =\n\n1\n2\n\nY\u2032[trace(Lk) \u2212 (eke\u2032\nk\n\nL) \u2212 (eke\u2032\nk\n\nL)\u2032]Y \u2261\n\n1\n2\n\nY\u2032LkY\n\nIs)\u22121/2\n\n||2.\n\n}\n\n(6)\n\n(7)\n\nwhere ek refers to the kth standard basis vector of Rn and Lk is a symmetric positive semi-de\ufb01nite\nmatrix precomputed from entries in L; Lk has non-zero rows only for the neighbors of k.\n\nProposition 3 Let Lossk denote term k of Loss. If s = d, the gradient of Lossk as given by (3) is\n\n\u2202 Lossk\n\n\u2202Y\n\n= 2wk\u03bb\u2217\nk\n\nLkYuku\u2032\nk,\n\nk the largest eigenvalue of Hk \u2212 Id and uk is the corresponding eigenvector.\n\nwith \u03bb\u2217\nIf s > d, the gradient of Lossk of (6) is\n\u2202 Lossk\n\n= 2wk\u03bb\u2217\nk\n\nLkY\u03a0kuku\u2032\nk\n\n\u03a0\u2032\nk\n\n\u2202Y\n\n(8)\n\n(9)\n\nwhere \u03a0k = (UkU\u2032\ncorresponding eigenvector.\n\nk + (\u03b5orth)kIs)\u22121/2, \u03bb\u2217\n\nk is the largest eigenvalue of \u02dcGk of (6) and uk is the\n\n4\n\n\fWhen embedding in s > d dimensions, the loss function depends at each point k on \ufb01nding the\nd-dimensional subspace Uk. Mathematically, this subspace coincides with the span of the Jacobian\nDYk which can be identi\ufb01ed with the d-principal subspace of Hk. When computing the gradient of\nLoss we assume that U1:n are \ufb01xed. Since the derivatives w.r.t Y are taken only of H and not of the\ntangent subspace Uk, the algorithm below is actually an alternate minimization algorithm, which\nreduces the cost w.r.t Y in one step, and w.r.t U1:n in the alternate step.\n\n3.1 Algorithm\n\nvation above). The projection consists of imposingPk\n\nWe optimize the loss (3) or (6) by projected gradient descent with line search (subject to the obser-\nYk = 0, which we enforce by centering \u2207Y\nbefore taking a step. This eliminates the degeneracy of the Loss in (3) and (6) w.r.t constant shift\nin Y. To further improve the good trade-off between time per iteration and number of iterations,\nwe found that a heavy-ball method with parameter \u03b1 is effective. At each iteration computing the\ngradient is O((S + s3)n) where S is the number of nonzero entries of L.\n\nInput\n\n:data X, kernel function Kh(), initial coordinates Y0, weights w1:n, intrinsic dimension d,\northonormal tolerance \u03b5orth, heavy ball parameter \u03b1 \u2208 [0, 1)\n:Compute: graph Laplacian L by (1), matrices L1:n as in (7). Set S = 0\n\nInit\nwhile not converged do\nCompute \u2207 Loss:\nfor all k do\n\n1. Calculate Hk via (2);\n2. If s > d\n\n(a) Compute Uk by SVD from Hk;\n(b) Compute gradient of \u2207 Lossk(Y) using (9);\n\n3. Else (s = d): calculate gradient \u2207 Lossk(Y) using (8);\n4. Add \u2207 Lossk(Y) to the total gradient;\n\nend\nTake a step in Y:\n\n1. Compute projected direction S and project S \u2190 (In \u2212 ene\u2032\n2. Find step size \u03b7 by line search and update Y \u2190 Y \u2212 \u03b7S;\n\nn)\u2207 Loss +\u03b1S;\n\nend\nOutput : Y\n\nAlgorithm 2: RIEMANNIANRELAXATION (RR)\n\n3.2 For large or noisy data\n\nHere we describe an extension of the RR Algorithm which can naturally adapt to large or noisy data,\nwhere the manifold assumption holds only approximately. The idea is to subsample the data, but in\na highly non-uniform way that improves the estimation of the geometry.\nA simple peliminary observation is that, when an embedding is smooth, optimizing the loss on a\nsubset of the data will be suf\ufb01cient. Let I \u2282 {1, . . . n} be set of size n\u2032 < n. The subsampled\nloss LossI will be computed only for the points k\u2032 \u2208 I.\nIf every point k has O(d) neighbors in I,\nthis assures that the gradient of LossI will be a good approximation of \u2207 Loss at point k, even if\nk 6\u2208 I, and does not have a term containing Hk in LossI. To optimize LossI by RR, it is suf\ufb01cient\nto run the \u201cfor\u201d loop over k\u2032 \u2208 I. Algorithm PCS-RR below describes how we choose a \u201cgood\"\nsubsample I, with the help of the PRINCIPALCURVES algorithm of [14].\n\nInput\n\n:data X, kernel function Kh(), initial coordinates Y0, intrinsic dimension d, subsample size n\u2032, other\nparameters for RR\n\nCompute \u02c6X = PRINCIPALCURVES(X, Kh, d)\nTake a uniform sample I0 of size n\u2032 from {1, . . . n} (without replacement).\nfor k\u2032 in I0 do\n\nFind Xl the nearest neigbor in X of \u02c6Xk\u2032 , and add l to I (removing duplicates)\n\nend\nOutput : Y = RR(Y0, Kh, d, I, . . .)\n\nAlgorithm 3: PRINCIPALCURVES-RIEMANNIANRELAXATION (PCS-RR)\n\n5\n\n\fsphere + noise\n\nhourglass + noise\n\nfinal embedding\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n\u22122.5\n\n\u22123\n\n\u22123.5\n\n\u22124\n \n0\n\n0.005\n\nsigma vs. (log10) loss and MSE\n\n \n\nlog10(MSE)\nlog10(loss)\n\n0.02\n\n0.025\n\n0.01\n\n0.015\n\nnoise standard deviation\n\nFigure 1: Hourglass to sphere. From left to right: target Y (noisy sphere), initialization Y0 of RR (noisy\nhourglass), output of RR, mean-squared error and Loss vs. noise level \u03c3 (on a log10 scale). Convergence of\nRR was achieved after 400 iterations.\n\nInformally speaking, PRINCIPALCURVES uses a form of Mean-Shift to obtain points in the d-\ndimensional manifold of highest density in the data. The result is generally biased, however [7]\nhave shown that this algorithm offers a very advantageous bias-variance trade-off in case of mani-\nfolds with noise. We use the output \u02c6Y of PRINCIPALCURVES to \ufb01nd a subset of points that (1) lie\nin a high density region relative to most directions in RD and (2) are \u201cin the middle\u201d of their neigh-\nbors, or more formally, have neighborhoods of dimension at least d. In other words, this is a good\nheuristic to avoid \u201cborder effects\u201d, or other regions where the d-manifold assumption is violated.\n\n4 Experimental evaluation\n\nHourglass to sphere illustrates how the algorithm works for s = 3, d = 2. The data X is sampled\nuniformly from a sphere of radius 1 with intrinsic dimension d = 2. We sample n = 10000 points\n4, estimating the Laplacian L on\nfrom the sphere and add i.i.d. Gaussian noise with \u03a3 = \u03c32/sIs\nthe noisy data X. We initialize with a noisy \u201chourglass\u201d shape in s = 3 dimensions, with the same\nnoise distribution as the sphere. If the algorithm works correctly, by using solely the Laplacian and\nweights from X, it should morph the hourglass Y0 back into a sphere. The results after convergence\nat 400 iterations are shown in Fig. 1 (and an animation of this convergence in the Supplement). We\nsee that RR not only recovers the sphere, but it also suppresses the noise.\nThe next two experiments compare RR to several embedding algorithms w.r.t geometric recov-\nery. The algorithms are Isomap, Laplacian Eigenmaps, HLLE[6], MVU 5 . The embeddings\nYLE,M V U,HLLE need to be rescaled before being evaluated, and we use a Procrustes transforma-\ntion to the original data. The algorithms are compared w.r.t the dual metric distortion Loss, and w.r.t\nmean squared errror in pairwise distance (the loss optimized by Isomap 6 ). This is\n\ndis(Y, Ytrue) = 2/n(n\u22121) Xk6=k\u2032(cid:0)||Yk \u2212 Yk\u2032|| \u2212 ||Ytrue\n\nk \u2212 Ytrue\n\nk\u2032\n\n||(cid:1)2\n\n(10)\n\nwhere Y is the embedding resulting from the chosen method and Ytrue are the true noiseless coor-\ndinates. Note that none of Isomap, MVU, HLLE could have been tested on the hourglass to sphere\ndata of the previous example, because they work only for s = d. The sample size is n = 3000 in\nboth experiments, and noise is added as described above.\nFlat \u201cswiss roll\u201d manifold, s = d = 2. The results are displayed in Fig. 2.\nCurved \u201chalf sphere\u201d manifold, s = d = 2. Isometric embedding into 2D is not possible. We\nexamine which of the algorithms achieves the smallest distortions in this scenario. The true distances\nwere computed as arc-lengths on the half-sphere. The results are displayed in Fig 2.\nRR was initialized at each method. In almost every initalization and noise level, RR achieves a\ndecrease in dis, in some cases signi\ufb01cant decreases. Isomap also performs well and even though\nRR optimizes a different loss function it never increases dis and often improves on it. This demon-\nstrates the ability of the Riemannian Metric to encode simultaneously all aspects of manifold geom-\n\n4For this arti\ufb01cial noise, adding dimensions beyond s has no effect except to increase \u03c3.\n5embeddings were computed using drtoolbox: https://lvdmaaten.github.io/drtoolbox/\n6Isomap estimates the true distances using graph shortest path\n\n6\n\n\fIsomap\n\nLaplacian Eigenmaps\n\nMVU\n\nHLLE\n\nRR\n\nRR\n\nRR\n\nRR\n\nLeigs\n\nIsomap\n\nHLLE\n\nMVU\n\nLeigs\n\nIsomap\n\nHLLE\n\nMVU\n\n)\n0\n1\ng\no\nl\n(\n \nn\no\n\ni\n\ni\nt\nr\no\nt\ns\nd\n \ne\ng\na\nr\ne\nv\na\n\n)\n0\n1\ng\no\nl\n(\n \ns\ns\no\nl\n \ne\ng\na\nr\ne\nv\na\n\n1.54\n\n1.52\n\n1.5\n\n1.48\n\n1.46\n\n1.44\n\n1.42\n\n1.4\n\n1.38\n\n1.36\n \n0\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n \n0\n\n \n\n1\n\n \n\n0.2\n\n0.4\n\n\u03c3\n\n0.6\n\n0.8\n\nLeigs\n\nIsomap\n\nHLLE\n\nMVU\n\n0.2\n\n0.4\n\n\u03c3\n\n0.6\n\n0.8\n\n1\n\n)\n0\n1\ng\no\nl\n(\n \nn\no\n\ni\n\ni\nt\nr\no\nt\ns\nd\n \ne\ng\na\nr\ne\nv\na\n\n)\n0\n1\ng\no\nl\n(\n \ns\ns\no\nl\n \ne\ng\na\nr\ne\nv\na\n\n0\n\n\u22120.1\n\n\u22120.2\n\n\u22120.3\n\n\u22120.4\n\n\u22120.5\n\n\u22120.6\n\n\u22120.7\n\n\u22120.8\n\n\u22120.9\n \n0\n\n\u22120.1\n\n\u22120.2\n\n\u22120.3\n\n\u22120.4\n\n\u22120.5\n\n\u22120.6\n\n\u22120.7\n\n\u22120.8\n\n\u22120.9\n \n0\n\n \n\n0.02\n\n0.04\n\n0.06\n\n0.08\n\n\u03c3\n\n0.1\n\n0.12\n\n0.14\n\n0.16\n\nLeigs\n\nIsomap\n\nHLLE\n\nMVU\n\n \n\n0.02\n\n0.04\n\n0.06\n\n0.08\n\n\u03c3\n\n0.1\n\n0.12\n\n0.14\n\n0.16\n\nFigure 2: Swiss hole (left) & half sphere (right). Top plots display example initial embeddings and their\nRiemannian Relaxed versions. Middle row displays dis value vs. noise level \u03c3. Bottom row displays Loss\nvalue vs. noise level \u03c3. As RR was initialized at each method dashed lines indicated relaxed embeddings\n\netry. Convergence of RR varies with the initialization but was in all cases faster than Isomap. The\nextension of RR to PCS-RR allows for scaling to much larger data sets.\n\n4.1 Visualizing the main SDSS galaxy sample in spectra space\n\nThe data consists of spectra of galaxies from the Sloan Digital Sky Survey7 [1]. We extracted a\nsubset of spectra whose SNR was suf\ufb01ciently high, known as the main sample. This set contains\n675,000 galaxies observed in D = 3750 spectral bins, preprocessed by \ufb01rst moving them to a\ncommon rest-frame wavelength and \ufb01lling-in missing data following [18] but using the more sophis-\nticated weighted PCA algorithm of [5], before computing a sparse neighborhood graph and pairwise\ndistances between neighbors in this graph. A log-log plot of the average number neighbors m(r)\nvs. neighborhood radius r (shown in the Supplement), indicates that the intrinsic dimension of these\ndata varies with the scale r. In particular, in order to support m = O(d) neighbors, the radius must\nbe above 60, in which case d \u2264 3. We embedded the whole data set by Laplacian Eigenmaps, ob-\ntaining the graph in Fig. 3 a. This \ufb01gure strongly suggests that d is not constant for this data cloud,\nand that the embedding is not isometric (Fig 3, b). We \u201crescaled\u201d the data along the three evident\n\n7 www.sdss.org\n\n7\n\n\fa\n\nc\n\n)\n|\n|\n\nH\n\n|\n|\n(\n0\n1\ng\no\n\nl\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n-1\n\n-1.5\n\n-2\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n-1\n\n-1.5\n\n-2\n\n4\n\n3.5\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n)\n|\n|\n\nH\n\n|\n|\n(\n0\n1\ng\no\n\nl\n\nb\n\ni\n\ni\n\n)\nn\no\ns\ns\nm\nE\n\u03b1\n\n \n\nH\n(\n0\n1\ng\no\n\nl\n\nd\n\na: Initial LE embedding from D = 3750 to s = 3 dimensions, with the principal curves \u02c6Y\nFigure 3:\nsuperimposed. For clarity, we only show a small subsample of the Y0; a larger one is in the Supplement; b:\nsame embedding, only points \u201con\u201d principal curves, colored by log10 ||Hk|| (hence, 0 represents isometry); c:\nsame points as in (b), after RR(color on the same scale as in (b)); d: 40,000 galaxies in the coordinates from (c),\ncolored by the strength of Hydrogen \u03b1 emission, a very nonlinear feature which requires dozens of dimensions\nto be captured in a linear embedding. Convergence of PCS-RR was achieved after 1000 iterations and took 2.5\nhours optimizing a Loss with n\u2032 = 2000 terms over the n \u00d7 s = 105 \u00d7 3 coordinates, corresponding to the\nhighest density points. (Please zoom for better viewing)\n\nprincipal curves shown in Figure 3 a by running PCS-RR (Y, n = 105, n\u2032 = 2000, s = 3, d = 1).\nIn the new coordinates (Fig 3, c), Y is now close to isometric along the selected curves, while in\nFig. 3,b, ||Hk|| was in the thousands on the uppermost \u201carm\u201d. This means that, at the largest scale,\nthe units of distance in the space of galaxy spectra are being preserved (almost) uniformly along\nthe sequences, and that they correspond to the distances in the original D = 3750 data. Moreover,\nwe expect the distances along the \ufb01nal embedding to be closer on average to the true distance, be-\ncause of the denoising effect of the embedding. Interpreting the coordinates along these \u201carms\u201d is in\nprogress. As a next step of the analysis, RR with s = d = 3 will be used to rescale the high-density\nregion at the con\ufb02uence of the three principal curves.\n\n5 Discussion\n\nContributions: we propose a new, natural, way to measure the distortion from isometry of any\nembedding Y \u2208 Rn\u00d7s of a data set X \u2208 Rn\u00d7D, and study its properties. The distortion loss is based\non an estimate of the push-forward Riemannian metric into Euclidean space Rs.\nThe RR we propose departs from existing non-linear embedding algorithms in several ways. First,\ninstead of a heuristically chosen loss, like pairwise distances, or local linear reconstruction error, it\ndirectly optimizes the (dual) Riemannian metric of the embedding Y. When this is successful, and\nthe loss is 0 all geometric properties (lengths, angles, volumes) are preserved simultaneously. From\nthe computational point of view, the non-convex loss is optimized iteratively by projected gradient.\nThird, our algorithm explicitly requires both an embedding dimension s and an intrinsic dimension\nd as inputs. Estimating the intrinsic dimension of a data set is not a solved problem, and beyond\nthe scope of this work. However, as a rule of thumb, we propose chosing the smallest d for which\nLoss is not too large, for s \ufb01xed, or, if d is known (something that all existing algorithms assume),\nincreasing s until the loss becomes almost 0. Most existing embedding algorithms, as Isomap, LLE,\nHLLE, MVU, LTSA only work in the case s = d, while Laplacian Eigenmaps/Diffusion Maps\nrequires only s but does not attempt to preserve geometric relations. Finally, RR is computationally\ncompetitive with existing algorithms, and can be seamlessly adapted to a variety of situations arising\nin the analysis of real data sets.\n\n8\n\n\fReferences\n\n[1] K. N. Abazajian et al. The Seventh Data Release of the Sloan Digital Sky Survey. Astrophysical\n\nJournal Supplement Series, 182:543\u2013558, June 2009.\n\n[2] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data represen-\n\ntation. Neural Computation, 15:1373\u20131396, 2002.\n\n[3] M. Bernstein, V. deSilva, J. C. Langford, and J. Tennenbaum. Graph approximations to\n\ngeodesics on embedded manifolds. Science, 290, 2000.\n\n[4] R. R. Coifman and S. Lafon. Diffusion maps. Applied and Computational Harmonic Analysis,\n\n21(1):6\u201330, 2006.\n\n[5] L. Delchambre. Weighted principal component analysis: a weighted covariance eigendecom-\nposition approach. Monthly Notices of the Royal Astronomical Society, 446(4):3545\u20133555,\n2015.\n\n[6] David L. Donoho and Carrie Grimes. Hessian eigenmaps: Locally linear embedding tech-\n\nniques for high-dimensional data. Proc Natl Acad Sci, 100(10):5591\u20135596, May 2003.\n\n[7] Christopher Genovese, Marco Perone-Paci\ufb01co, Isabella Verdinelli, and Larry Wasserman. Min-\nimax manifold estimation. Journal of Machine Learning Research, 13:1263\u00e2\u00b4L\u0160\u20131291, May\n2012.\n\n[8] M. Hein and J.-Y. Audibert. Intrinsic dimensionality estimation of submanifolds in Rd. In\nProceedings of the 22nd international conference on Machine learning, ICML, pages 289\u2013\n296, 2005.\n\n[9] M. Hein, J.-Y. Audibert, and U. von Luxburg. Graph Laplacians and their Convergence on\nRandom Neighborhood Graphs. Journal of Machine Learning Research, 8:1325\u20131368, 2007.\n\n[10] J. M. Lee. Riemannian Manifolds: An Introduction to Curvature, volume M. Springer, New\n\nYork, 1997.\n\n[11] J. M. Lee. Introduction to Smooth Manifolds. Springer, New York, 2003.\n\n[12] B. Nadler, S. Lafon, R. R. Coifman, and Kevrekidis. Diffusion maps, spectral clustering and\nreaction coordiantes of dynamical systems. Applied and Computational Harmonic Analysis,\n21:113\u2013127, 2006.\n\n[13] J. Nash. The imbedding problem for Riemannian manifolds. Annals of Mathematics, 63, pages\n\n20\u201363, 1956.\n\n[14] Umut Ozertem and Deniz Erdogmus. Locally de\ufb01ned principal curves and surfaces. Journal\n\nof Machine Learning Research, 12:1249\u20131286, 2011.\n\n[15] Dominique Perrault-Joncas and Marina Meila. Non-linear dimention reduction: Riemannian\n\nmetric estimation and the problem of geometric recovery. arXiv:1305.7255v1, 2013.\n\n[16] J. Tenenbaum, V. deSilva, and J. C. Langford. A global geometric framework for nonlinear\n\ndimensionality reduction. Science, 290:2319\u20132323, 2000.\n\n[17] D. Ting, L Huang, and M. I. Jordan. An analysis of the convergence of graph Laplacians. In\n\nICML, pages 1079\u20131086, 2010.\n\n[18] Jake Vanderplas and Andrew Connolly. Reducing the dimensionality of data: Locally linear\n\nembedding of sloan galaxy spectra. The Astronomical Journal, 138(5):1365, 2009.\n\n[19] Nakul Verma. Distance preserving embeddings for general n-dimensional manifolds. Journal\n\nof Machine Learning Research, 14:2415\u20132448, 2013.\n\n[20] K.Q. Weinberger and L.K. Saul. Unsupervised learning of image manifolds by semide\ufb01nite\nprogramming. International Journal of Computer Vision, 70:77\u201390, 2006. 10.1007/s11263-\n005-4939-z.\n\n[21] Z. Zhang and H. Zha. Principal manifolds and nonlinear dimensionality reduction via tangent\n\nspace alignment. SIAM J. Scienti\ufb01c Computing, 26(1):313\u2013338, 2004.\n\n9\n\n\f", "award": [], "sourceid": 1365, "authors": [{"given_name": "James", "family_name": "McQueen", "institution": "University of Washington"}, {"given_name": "Marina", "family_name": "Meila", "institution": "University of Washington"}, {"given_name": "Dominique", "family_name": "Joncas", "institution": "Google"}]}