{"title": "Learning to Traverse Image Manifolds", "book": "Advances in Neural Information Processing Systems", "page_first": 361, "page_last": 368, "abstract": null, "full_text": "Learning to Traverse Image Manifolds\n\nPiotr Doll\u00b4ar, Vincent Rabaud and Serge Belongie\n{pdollar,vrabaud,sjb}@cs.ucsd.edu\n\nUniversity of California, San Diego\n\nAbstract\n\nWe present a new algorithm, Locally Smooth Manifold Learning (LSML), that\nlearns a warping function from a point on an manifold to its neighbors. Important\ncharacteristics of LSML include the ability to recover the structure of the manifold\nin sparsely populated regions and beyond the support of the provided data. Appli-\ncations of our proposed technique include embedding with a natural out-of-sample\nextension and tasks such as tangent distance estimation, frame rate up-conversion,\nvideo compression and motion transfer.\n\n1 Introduction\n\nA number of techniques have been developed for dealing with high dimensional data sets that fall\non or near a smooth low dimensional nonlinear manifold. Such data sets arise whenever the number\nof modes of variability of the data are much fewer than the dimension of the input space, as is the\ncase for image sequences. Unsupervised manifold learning refers to the problem of recovering the\nstructure of a manifold from a set of unordered sample points. Manifold learning is often equated\nwith dimensionality reduction, where the goal is to \ufb01nd an embedding or \u2018unrolling\u2019 of the manifold\ninto a lower dimensional space such that certain relationships between points are preserved. Such\nembeddings are typically used for visualization, with the projected dimension being 2 or 3.\nImage manifolds have also been studied in the context of measuring distance between images un-\ndergoing known transformations. For example, the tangent distance [20, 21] between two images is\ncomputed by generating local approximations of a manifold from known transformations and then\ncomputing the distance between these approximated manifolds. In this work, we seek to frame the\nproblem of recovering the structure of a manifold as that of directly learning the transformations\na point on a manifold may undergo. Our approach, Locally Smooth Manifold Learning (LSML),\nattempts to learn a warping function W with d degrees of freedom that can take any point on the\nmanifold and generate its neighbors. LSML recovers a \ufb01rst order approximation of W, and by mak-\ning smoothness assumptions on W can generalize to unseen points.\nWe show that LSML can recover the structure of the manifold where data is given, and also in\nregions where it is not, including regions beyond the support of the original data. We propose a\nnumber of uses for the recovered warping function W, including embedding with a natural out-of-\nsample extension, and in the image domain discuss how it can be used for tasks such as computation\nof tangent distance, image sequence interpolation, compression, and motion transfer. We also show\nexamples where LSML is used to simultaneously learn the structure of multiple \u201cparallel\u201d manifolds,\nand even generalize to data on new manifolds. Finally, we show that by exploiting the manifold\nsmoothness, LSML is robust under conditions where many embedding methods have dif\ufb01culty.\nRelated work is presented in Section 2 and the algorithm in Section 3. Experiments on point sets\nand results on images are shown in Sections 4 and 5, respectively. We conclude in Section 6.\n\n\f2 Related Work\n\nRelated work can be divided into two categories. The \ufb01rst is the literature on manifold learning,\nwhich serves as the foundation for this work. The second is work in computer vision and computer\ngraphics addressing image warping and generative models for image formation.\nA number of classic methods exist for recovering the structure of a manifold. Principal component\nanalysis (PCA) tries to \ufb01nd a linear subspace that best captures the variance of the original data.\nTraditional methods for nonlinear manifolds include self organizing maps, principal curves, and\nvariants of multi-dimensional scaling (MDS) among others, see [11] for a brief introduction to these\ntechniques. Recently the \ufb01eld has seen a number of interesting developments in nonlinear manifold\nlearning. [19] introduced a kernelized version of (PCA). A number of related embedding methods\nhave also been introduced, representatives include LLE [17], ISOMAP [22], and more recently SDE\n[24]. Broadly, such methods can be classi\ufb01ed as spectral embedding techniques [24]; the embed-\ndings they compute are based on an eigenvector decomposition of an n \u00d7 n matrix that represents\ngeometrical relationships of some form between the original n points. Out-of-sample extensions\nhave been proposed [3]. The goal of embedding methods (to \ufb01nd structure preserving embeddings)\ndiffers from the goals of LSML (learn to traverse the manifold).\nFour methods that we share inspiration with are [6, 13, 2, 16]. [6] employs a novel charting based\ntechnique to achieve increased robustness to noise and decreased probability of pathological behav-\nior vs. LLE and ISOMAP; we exploit similar ideas in the construction of LSML but differ in motiva-\ntion and potential applicability. [2] proposed a method to learn the tangent space of a manifold and\ndemonstrated a preliminary illustration of rotating a small bitmap image by about 1\u25e6. Work by [13]\nis based on the notion of learning a model for class speci\ufb01c variation, the method reduces to com-\nputing a linear tangent subspace that models variability of each class. [16] shares one of our goals\nas it addresses the problem of learning Lie groups, the in\ufb01nitesimal generators of certain geometric\ntransformations.\nIn image analysis, the number of dimensions is usually reduced via approaches like PCA [15], epit-\nomic representation [12], or generative models like in the realMOVES system developed by Di\nBernardo et al. [1]. Sometimes, a precise model of the data, like for faces [4] or eyes [14], is even\nused to reduce the complexity of the data. Another common approach is simply to have instances of\nan object in different conditions: [5] start by estimating feature correspondences between a novel in-\nput with unknown pose and lighting and a stored labeled example in order to apply an arbitrary warp\nbetween pictures. The applications range from video texture synthesis [18] and facial expression\nextrapolation [8, 23] to face recognition [10] and video rewrite [7].\n\n3 Algorithm\n\nLet D be the dimension of the input space, and assume the data lies on a smooth d-dimensional\nmanifold (d (cid:28) D). For simplicity assume that the manifold is diffeomorphic with a subset of Rd,\nmeaning that it can be endowed with a global coordinate system (this requirement can easily be\nrelaxed) and that there exists a continuous bijective mapping M that converts coordinates y \u2208 Rd\nto points x \u2208 RD on the manifold. The goal of most dimensionality reduction techniques given a\nset of data points xi is to \ufb01nd an embedding yi = M\u22121(xi) that preserves certain properties of the\noriginal data like the distances between all points (classical MDS) or the distances or angles between\nnearby points (e.g. spectral embedding methods).\nInstead, we seek to learn a warping function W that can take a point on the manifold and return\nany neighboring point on the manifold, capturing all the modes of variation of the data. Let us use\nW(x, \u0001) to denote the warping of x, with \u0001 \u2208 Rd acting on the degrees of freedom of the warp\naccording to the formula M: W(x, \u0001) = M(y + \u0001), where y = M\u22121(x). Taking the \ufb01rst order\napproximation of the above gives: W(x, \u0001) \u2248 x +H(x)\u0001, where each column H\u00b7k(x) of the matrix\nH(x) is the partial derivative of M w.r.t. yk: H\u00b7k(x) = \u2202/\u2202ykM(y). This approximation is valid\ngiven \u0001 small enough, hence we speak of W being an in\ufb01nitesimal warping function.\nWe can restate our goal of learning to warp in terms of learning a function H\u03b8 : RD \u2192 RD\u00d7d\nparameterized by a variable \u03b8. Only data points xi sampled from one or several manifolds are given.\nFor each xi, the set N i of neighbors is then computed (e.g. using variants of nearest neighbor such\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\nFigure 1: Overview. Twenty points (n=20) that lie on 1D curve (d=1) in a 2D space (D=2) are shown in (a).\nBlack lines denote neighbors, in this case the neighborhood graph is not connected. We apply LSML to train H\n(with f = 4 RBFs). H maps points in R2 to tangent vectors; in (b) tangent vectors computed over a regularly\nspaced grid are displayed, with original points (blue) and curve (gray) overlayed. Tangent vectors near original\npoints align with the curve, but note the seam through the middle. Regularization \ufb01xes this problem (c), the\nresulting tangents roughly align to the curve along its entirety. We can traverse the manifold by taking small\nsteps in the direction of the tangent; (d) shows two such paths, generated starting at the red plus and traversing\noutward in large steps (outer curve) and \ufb01ner steps (inner curve). This generates a coordinate system for the\ncurve resulting in a 1D embedding shown in (e). In (f) two parallel curves are shown, with n=8 samples each.\nTraining a common H results in a vector \ufb01eld that more accurately \ufb01ts each curve than training a separate H\nfor each (if the structure of the two manifolds was very different this need not be the case).\n\nas kNN or \u0001NN), with the constraint that two points can be neighbors only if they come from the\nsame manifold. To proceed, we assume that if xj is a neighbor of xi, there then exists an unknown\n\u0001ij such that W(xi, \u0001ij) = xj to within a good approximation. Equivalently: H\u03b8(xi)\u0001ij \u2248 xj \u2212 xi.\nWe wish to \ufb01nd the best \u03b8 in the squared error sense (the \u0001ij being additional free parameters that\nmust be optimized over). The expression of the error we need to minimize is therefore:\n\nnX\n\nX\n\ni=1\n\nj\u2208N i\n\n(cid:13)(cid:13)H\u03b8(xi)\u0001ij \u2212 (xj \u2212 xi)(cid:13)(cid:13)2\n\n2\n\nerror1(\u03b8) = min\n{\u0001ij}\n\nMinimizing the above error function can be interpreted as trying to \ufb01nd a warping function that can\ntransform a point into its neighbors. Note, however, that the warping function has only d degrees of\nfreedom while a point may have many more neighbors. This intuition allows us to rewrite the error\nin an alternate form. Let \u2206i be the matrix where each column is of the form (xj \u2212 xi) for each\nneighbor of xi. Let \u2206i = U i\u03a3iV i>\nbe the thin singular value decomposition of \u2206i. Then, one can\nshow [9] that error1 is equivalent to the following:\n\n(cid:13)(cid:13)H\u03b8(xi)Ei \u2212 U i\u03a3i(cid:13)(cid:13)2\n\nF\n\nnX\n\ni=1\n\nerror2(\u03b8) = min\n{Ei}\n\n(1)\n\n(2)\n\n(3)\n\nHere, the matrices Ei are the additional free parameters. Minimizing the above can be interpreted\nas searching for a warping function that directly explains the modes of variation at each point. This\nform is convenient since we no longer have to keep track of neighbors. Furthermore, if there is no\nnoise and the linearity assumption holds there are at most d non- zero singular values. In practice we\nuse the truncated SVD, keeping at most 2d singular values, allowing for signi\ufb01cant computational\nsavings.\nWe now give the remaining details of LSML for the general case [9]. For the case of images, we\npresent an ef\ufb01cient version in Section 5 which uses some basic domain knowledge to avoid solving\na large regression. Although potentially any regression technique is applicable, a linear model is\nparticularly easy to work with. Let f i be f features computed over xi. We can then de\ufb01ne H\u03b8(xi) =\n[\u03981f i \u00b7\u00b7\u00b7\u0398 Df i]>, where each \u0398k is a d \u00d7 f matrix. Re- arranging error2 gives:\n\nnX\n\nDX\n\ni=1\n\nk=1\n\n(cid:13)(cid:13)(cid:13)f i>\n\nk\u00b7\u03a3i(cid:13)(cid:13)(cid:13)2\n\n2\n\n\u0398k>\n\nEi \u2212 U i\n\nerrorlin(\u03b8) = min\n{Ei}\n\nSolving simultaneously for E and \u0398 is complex, but if either E or \u0398 is \ufb01xed, solving for the\nremaining variable becomes a least squares problem (an equation of the form AXB = C can be\nrewritten as B> \u2297 A \u00b7 vec(X) = vec(C), where \u2297 denotes the Kronecker product and vec the\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: Robustness. LSML used to recover the embedding of the S- curve under a number of sampling\nconditions. In each plot we show the original points along with the computed embedding (rotated to align\nvertically), correspondence is indicated by coloring/shading (color was determined by the y- coordinate of the\nembedding). In each case LSML was run with f = 8, d = 2, and neighbors computed by \u0001NN with \u0001 = 1\n(the height of the curve is 4). The embeddings shown were recovered from data that was: (a) densely sampled\n(n=500) (b) sparsely sampled (n=100), (c) highly structured (n=190), and (d) noisy (n=500, random Gaussian\nnoise with \u03c3 = .1). In each case LSML recovered the correct embedding. For comparison, LLE recovered good\nembeddings for (a) and (c) and ISOMAP for (a),(b), and (c). The experiments were repeated a number of times\nyielding similar results. For a discussion see the text.\n\nmatrix vectorization function). To solve for \u03b8, we use an alternating minimization procedure. In all\nexperiments in this paper we perform 30 iterations of the above procedure, and while local minima\ndo not seem to be to prevalent, we randomly restart the procedure 5 times. Finally, nowhere in\nthe construction have we enforced that the learned tangent vectors be orthogonal (such a constraint\nwould only be appropriate if the manifold was isometric to a plane). To avoid numerically unstable\nsolutions we regularize the error:\n\nDX\n\n(cid:13)(cid:13)\u0398k(cid:13)(cid:13)2\n\nF\n\n(4)\n\nerror0\n\nlin(\u03b8) = errorlin(\u03b8) +\u03bb E\n\nnX\n\n(cid:13)(cid:13)Ei(cid:13)(cid:13)2\n\nF + \u03bb\u03b8\n\ni=1\n\nk=1\n\nFor the features we use radial basis functions (RBFs) [11], the number of basis functions, f, being an\nadditional parameter. Each basis function is of the form f j(x) = exp(\u2212kx \u2212 \u00b5jk2\n2/2\u03c32) where the\ncenters \u00b5j are obtained using K- means clustering on the original data with f clusters and the width\nparameter \u03c3 is set to be twice the average of the minimum distance between each cluster and its\nnearest neighbor center. The feature vectors are then simply de\ufb01ned as f i = [f 1(xi)\u00b7\u00b7\u00b7f p(xi)]>.\nThe parameter f controls the smoothness of the \ufb01nal mapping H\u03b8; larger values result in mappings\nthat better \ufb01t local variations of the data, but whose generalization abilities to other points on the\nmanifold may be weaker. This is exactly analogous to the standard supervised setting and techniques\nlike cross validation could be used to optimize over f.\n\n4 Experiments on Point Sets\n\nWe begin with a discussion on the intuition behind various aspects of LSML. We then show exper-\niments demonstrating the robustness of the method, followed by a number of applications. In the\n\ufb01gures that follow we make use of color/shading to indicate point correspondences, for example\nwhen we show the original point set and its embedding.\nLSML learns a function H from points in RD to tangent directions that agree, up to a linear combina-\ntion, with estimated tangent directions at the original training points of the manifold. By constraining\nH to be smooth (through use of a limited number of RBFs), we can compute tangents at points not\nseen during training, including points that may not lie on the underlying manifold. This general-\nization ability of H will be central to the types of applications considered. Finally, given multiple\nnon- overlapping manifolds with similar structure, we can train a single H to correctly predict the\ntangents of each, allowing information to be shared. Fig. 1 gives a visual tutorial of these different\nconcepts.\nLSML appears quite robust. Fig. 2 shows LSML successfully applied for recovering the embedding of\nthe \u201cS- curve\u201d under a number of sampling conditions (similar results were obtained on the \u201cSwiss-\nroll\u201d). After H is learned, the embedding is computed by choosing a random point on the manifold\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 3: Reconstruction. Reconstruction examples are used to demonstrate quality and generalization of\nH. (a) Points sampled from the Swiss- roll manifold (middle), some recovered tangent vectors in a zoomed-\nin region (left) and embedding found by LSML (right). Here n = 500 f = 20, d = 2, and neighbors were\ncomputed by \u0001NN with \u0001 = 4 (height of roll is 20). Reconstruction of Swiss- roll (b), created by a backprojection\nfrom regularly spaced grid points in the embedding (traversal was done from a single original point located at\nthe base of the roll, see text for details). Another reconstruction (c), this time using all points and extending\nthe grid well beyond the support of the original data. The Swiss- roll is extended in a reasonable manner both\ninward (occluded) and outward. (d) Reconstruction of unit hemisphere (LSML trained with n = 100 f = 6,\nd = 2, \u0001NN with \u0001 = .3) by traversing outward from topmost point, note reconstruction in regions with no\npoints.\n\nand establishing a coordinate system by traversing outward (the same procedure can be used to\nembed novel points, providing a natural out- of- sample extension). Here we compare only to LLE\nand ISOMAP using published code. The densely sampled case, Fig. 2(a), is comparatively easy and\na number of methods have been shown to successfully recover an embedding. On sparsely sampled\ndata, Fig. 2(b), the problem is more challenging; LLE had problems for n < 250 (lowering LLE\u2019s\nregularization parameter helped somewhat). Real data need not be uniformly sampled, see Fig. 2(c).\nIn the presence of noise Fig. 2(d), ISOMAP and LLE performed poorly. A single outlier can distort\nthe shortest path computed by ISOMAP, and LLE does not directly use global information necessary\nto disambiguate noise. Other methods are known to be robust [6], and in [25] the authors propose a\nmethod to \u201csmooth\u201d a manifold as a preprocessing step for manifold learning algorithms; however\na full comparison is outside the scope of this work.\nHaving learned H and computed an embedding, we can also backproject from a point y \u2208 Rd to a\npoint x on the manifold by \ufb01rst \ufb01nding the coordinate of the closest point yi in the original data,\nthen traversing from xi by \u0001j = yj \u2212 yi\nj along each tangent direction j (see Fig. 1(d)). Fig. 3(a)\nshows tangents and an embedding recovered by LSML on the Swiss- roll. In Fig. 3(b) we backpro-\nject from a grid of points in R2; by linking adjacent sets of points to form quadrilaterals we can\ndisplay the resulting backprojected points as a surface. In Fig. 3(c), we likewise do a backprojection\n(this time keeping all the original points), however we backproject grid points well below and above\nthe support of the original data. Although there is no ground truth here, the resulting extension of\nthe surface seems \u201cnatural\u201d. Fig. 3(d) shows the reconstruction of a unit hemisphere by traversing\noutward from the topmost point. There is no isometric mapping (preserving distance) between a\nhemisphere and a plane, and given a sphere there is actually not even a conformal mapping (pre-\nserving angles). In the latter case an embedding is not possible, however, we can still easily recover\nH for both (only hemisphere results are shown).\n\n5 Results on Images\nBefore continuing, we consider potential applications of H in the image domain, including tangent\ndistance estimation, nonlinear interpolation, extrapolation, compression, and motion transfer. We re-\nfer to results on point- sets to aid visualization. Tangent distance estimation: H computes the tangent\nand can be used directly in invariant recognition schemes such as [21]. Compression: Fig. 3(b,d)\nsuggest how given a reference point and H nearby points can be reconstructed using d numbers\n(with distortion increasing with distance). Nonlinear interpolation and extrapolation: points can be\ngenerated within and beyond the support of given data (cf . Fig. 3); of potential use in tasks such as\nframe rate up- conversion, reconstructing dropped frames and view synthesis. Motion transfer: for\ncertain classes of manifolds with \u201cparallel\u201d structure (cf . Fig. 1(f)), a recovered warp may be used\non an entirely novel image. These applications will depend not only on the accuracy of the learned\nH but also on how close a set of images is to a smooth manifold.\n\n\f-\n\n-\n\n-\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 4: The translation manifold. Here F i = X i; s = 17, d = 2 and 9 sets of 6 translated images each\nwere used (not including the cameraman). (a) Zero padded, smoothed test image x. (b) Visualization of learned\n\u0398, see text for details. (c) H\u03b8(x) computed via convolution. (d) Several transformations obtained after multiple\nsteps along manifold for different linear combinations of H\u03b8(x). Some artifacts due to error propagation start\nto appear in the top \ufb01gures.\n\nThe key insight to working with images is that although images can live in very high dimensional\nspaces (with D \u2248 106 quite common), we do not have to learn a transformation with that many\nparameters. Let x be an image and H\u00b7k(x), k \u2208 [1, d], be the d tangent images. Here we assume\nthat each pixel in H\u00b7k(x) can be computed based only on the information in s \u00d7 s patch centered\non the corresponding pixel in x. Thus, instead of learning a function RD \u2192 RD\u00d7d we learn a\nfunction Rs2 \u2192 Rd, and to compute H we apply the per patch function at each of the D locations\nin the image. The resulting technique scales independently of D, in fact different sized images can\nbe used. The per patch assumption is not always suitable, most notably for transformations that are\nbased only on image coordinate and are independent of appearance.\nThe approach of Section 3 needs to be slightly modi\ufb01ed to accommodate patches. We rewrite each\nimage xi \u2208 RD as a s2 \u00d7 D matrix X i where each row contains pixels from one patch in xi (in\ntraining we sub-sample patches). Patches from all the images are clustered to obtain the f RBFs;\neach X i is then transformed to a f \u00d7 D matrix F i that contains the features computed for each\npatch. The per patch linear model can now be written as H\u03b8(xi) = (\u0398F i)>, where \u0398 is a d \u00d7 f\nmatrix (compare with the D \u0398s needed without the patch assumption). The error function, which is\nminimized in a similar way [9], becomes:\n\nerrorimg(\u0398) = min\n{Ei}\n\nnX\n\ni=1\n\n(cid:13)(cid:13)(cid:13)F i>\n\n\u0398>Ei \u2212 U i\u03a3i(cid:13)(cid:13)(cid:13)2\n\nF\n\n(5)\n\nWe begin with the illustrative example of translation (Fig. 4). Here, RBFs were not used, instead\nF i = X i. The learned \u0398 is a 2 \u00d7 s2 matrix, which can be visualized as two s \u00d7 s images as in Fig.\n4(b). These resemble derivative of Gaussian \ufb01lters, which are in fact the in\ufb01nitesimal generates for\ntranslation [16]. Computing the dot product of each column of \u0398 with each patch can be done using\na convolution. Fig. 4 shows applications of the learned transformations, which resemble translations\nwith some artifacts.\nFig. 5 shows the application of LSML for learning out-of-plane rotation of a teapot. On this size\nproblem training LSML (in MATLAB) takes a few minutes; convergence occurs within about 10\niterations of the minimization procedure. H\u03b8(x) for novel x can be computed with f convolutions\n(to compute cross correlation) and is also fast. The outer frames in Fig. 5 highlight a limitation\nof the approach: with every successive step error is introduced; eventually signi\ufb01cant error can\naccumulate. Here, we used a step size which gives roughly 10 interpolated frames between each\npair of original frames. With out-of-plane rotation, information must be created and the problem\nbecomes ambiguous (multiple manifolds can intersect at a single point), hence generalization across\nimages is not expected to be good.\nIn Fig. 6, results are shown on an eye manifold with 2 degrees of freedom. LSML was trained on\nsparse data from video of a single eye; H\u03b8 was used to synthesize views within and also well outside\nthe support of the original data (cf . Fig. 6(c)). In Fig. 6(d), we applied the transformation learned\nfrom one person\u2019s eye to a single image of another person\u2019s eye (taken under the same imaging\nconditions). LSML was able to start from the novel test image and generate a convincing series of\n\n\f\u0011\n\u0011\n\n\u0011\n\u0011\n\n\u0011\n\u0011\n\nQ\nQ\n\nQ\nQ\n\nQ\nQ\n\nQ\nQ\n\n\u001b\n\nQ\nQ\n\n\u0011\n\u0011\n\nQ\nQ\n\u0011\n\u0011\n\n\u0011\n\u0011\nQ\nQ\n\n\u0011\n\u0011\n\n\u0011\n\u0011\n\nQ\nQ\n\nQ\nQ\n\nQ\nQ\n\n\u0011\n\u0011\n\n\u0011\n\u0011\n\n6\n\nQQ\n\nQ\n\nQ\n\n-\n\n\u0011\u0011\n\n\u0011\n\n\u0011\n\n\u0011\n\u0011\u00113\n\u0011\nQ\nQ\nQ\nQQs\nQ\n\nQ\nQ\n\u0011\n\u0011\n\n\u0011\n\u0011\n\n-\n\nQ\n\nQQk\nQ\n\n\u0011\n\u0011\n\nQ\nQ\n\u0011\n\u0011\nQ\nQ\n\u0011\u0011+\n\u0011\n\u0011\n\nQ\nQ\n\nQ\nQ\n\n\u0011\n\u001b\n\n\u0011\n\n\u0011\u0011\n\n\u0011\n\u0011\n\n\u0011\n\u0011\n\n\u0011\n\u0011\n\n\u0011\n\u0011\n\n\u0011\n\u0011\u0011+\n\nQ\nQ\n\n\u0011\n\u0011\n\nQ\nQ\n\u0011\n\u0011\n\n\u0011\n\u0011\nQ\nQ\n\n\u0011\n\u0011\n\n\u0011\n\u0011\n\nQ\nQ\n\nQ\nQ\n\n?\n\u0011\n\u0011\n\nQ\nQ\n\n(a)\n\nQ\nQ\n\nQ\nQ\n\nQ\n\nQ\n\nQQ\n\n(b)\n\n(c)\n\n(d)\n\n~~\u00b7\u00b7\u00b7\n\n~~\u00b7\u00b7\u00b7\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 5: Manifold generated by out- of- plane rotation of a teapot (data from [23], sub- sampled and smoothed).\nHere, d = 1, f = 400 and roughly 3000 patches of width s = 13 were sampled from 30 frames. Bottom row shows the\nground truth images; dashed box contains 3 of 30 training images, representing \u223c 8\u25e6 of physical rotation. The top row\nFigure 5: Manifold generated by out- of- plane rotation of a teapot (data from [24], sub- sampled and\nshows the learned transformation applied to the central image. By observing the tip, handle and the two white blobs on\nsmoothed). Here, d = 1, f = 400 and roughly 3000 patches of width s = 13 were sampled from 30 frames.\nBottom row shows the ground truth images; dashed box contains 3 of 30 training images, representing \u223c 8\u25e6\nthe teapot, and comparing to ground truth data, we can observe the quality of the learned transformation on seen data (b)\nand unseen data (d), both starting from a single frame (c). The outmost \ufb01gures (a)(e) shows failure for large rotations.\nof physical rotation. The top row shows the learned transformation applied to the central image. By observing\nthe tip, handle and the two white blobs on the teapot, and comparing to ground truth data, we can observe the\nquality of the learned transformation on seen data (b) and unseen data (d), both starting from a single frame (c).\nFig. 5 shows the application of LSML for learning out- of- plane rotation of a teapot. On this size problem\nThe outmost \ufb01gures (a)(e) shows failure for large rotations.\ntraining LSML (in MATLAB) takes a few minutes; convergence occurs within about 10 iterations of the mini-\nmization procedure. H\u03b8(x) for novel x can be computed with f convolutions (to compute cross correlation)\nand is also fast. The outer frames in Fig. 5 highlight a limitation of the approach: with every successive\nstep error is introduced; eventually signi\ufb01cant error can accumulate. Here, we used a step size which gives\nQQ\nroughly 10 interpolated frames between each pair of original frames. With out- of- plane rotation, information\nmust be created and the problem becomes ambiguous (multiple manifolds can intersect at a single point),\nhence generalization across images is not expected to be good.\n\n\u0011\n\u0011\u00113\n\u0011\n\n\u0011\u0011\n\nQQk\nQ\n\nQ\n\n6\n\nQ\n\nQ\n\n\u0011\n\n\u0011\n\n\u0011\u0011\n\nQQs\nQ\nQ\n\n?\n\n\u0011\n\nQ\n\nQ\n\n(a)\n\nQQ\n\nFigure 1: Traversing the eye manifold. LSML trained on one eye moving along \ufb01ve different lines (3 vertical\nand 2 horizontal). Here d = 2, f = 600, s = 19 and around 5000 patches were sampled; 2 frames were\n\u0011\nconsidered neighbors if they were adjacent in time. Figure (a) shows images generated from the central image.\nThe inner 8 frames lie just outside the support of the training data (not shown), the outer 8 are extrapolated\nbeyond its support. Figure (b) details H\u03b8(x) for two images in a warping sequence: a linear combination can\nlead the iris/eyelid to move in different directions (e.g. the sum would make the iris go up). Figure (c) shows\nextrapolation far beyond the training data, i.e. an eye wide open and fully closed. Finally, Figure(d) shows how\nthe eye manifold we learned on one eye can be applied on a novel eye not seen during training.\n\n\u0011\nFigure 6: Traversing the eye manifold. LSML trained on one eye moving along \ufb01ve different lines (3 vertical\nand 2 horizontal). Here d = 2, f = 600, s = 19 and around 5000 patches were sampled; 2 frames were\nconsidered neighbors if they were adjacent in time. Figure (a) shows images generated from the central image.\nThe inner 8 frames lie just outside the support of the training data (not shown), the outer 8 are extrapolated\nbeyond its support. Figure (b) details H\u03b8(x) for two images in a warping sequence: a linear combination can\nFigure 6: Traversing the eye manifold. LSML trained on one eye moving along \ufb01ve different lines (3 vertical and 2\nlead the iris/eyelid to move in different directions (e.g. the sum would make the iris go up). Figure (c) shows\nhorizontal). Here d = 2, f = 600, s = 19 and around 5000 patches were sampled; 2 frames were considered neighbors\nextrapolation far beyond the training data, i.e. an eye wide open and fully closed. Finally, Figure(d) shows how\nif they were adjacent in time. Figure (a) shows images generated from the central image. The inner 8 frames lie just\nthe eye manifold we learned on one eye can be applied on a novel eye not seen during training.\noutside the support of the training data (not shown), the outer 8 are extrapolated beyond its support. Figure (b) details\nH\u03b8(x) for two images in a warping sequence: a linear combination can lead the iris/eyelid to move in different directions\n(e.g. the sum would make the iris go up). Figure (c) shows extrapolation far beyond the training data, i.e. an eye wide\ntransformations. Thus, motion transfer was possible - H\u03b8 trained on one series of images generalized\nopen and fully closed. Finally, Figure(d) shows how the eye manifold we learned on one eye can be applied on a novel\neye not seen during training.\nto a different set of images.\n\n(b)\n\n(d)\n\n(c)\n\n6 Conclusion\n\nIn this work we presented an algorithm, Locally Smooth Manifold Learning, for learning the struc-\nture of a manifold. Rather than pose manifold learning as the problem of recovering an embedding,\nwe posed the problem in terms of learning a warping function for traversing the manifold. Smooth-\nness assumptions on W allowed us to generalize to unseen data. Proposed uses of LSML include\ntangent distance estimation, frame rate up- conversion, video compression and motion transfer.\nWe are currently engaged in scaling the implementation to handle large datasets; the goal is to\nintegrate LSML into recognition systems to provide increased invariance to transformations.\n\n\fAcknowledgements\n\nThis work was funded by the following grants and organizations: NSF Career Grant #0448615,\nAlfred P. Sloan Research Fellowship, NSF IGERT Grant DGE-0333451, and UCSD Division of\nCalit2. We would like to thank Sameer Agarwal, Kristin Branson, Matt Tong, and Neel Joshi for\nvaluable input and Anna Shemorry for helping us make it through the deadline.\n\nReferences\n[1] E. Di Bernardo, L. Goncalves and P. Perona.US Patent 6,552,729: Automatic generation of animation of\n\nsynthetic characters., 2003.\n\n[2] Y. Bengio and M. Monperrus. Non-local manifold tangent learning. In NIPS. 2005.\n[3] Y. Bengio, J.F. Paiement, P. Vincent, O. Delalleau, N. Le Roux, and M. Ouimet. Out-of-sample extensions\n\nfor LLE, isomap, MDS, eigenmaps, and spectral clustering. In NIPS, 2004.\n\n[4] D. Beymer and T. Poggio. Face recognition from one example view. In ICCV, page 500, Washington,\n\nDC, USA, 1995. IEEE Computer Society.\n\n[5] Volker Blanz and Thomas Vetter. Face recognition based on \ufb01tting a 3D morphable model. PAMI,\n\n25(9):1063\u20131074, 2003.\n\n[6] M. Brand. Charting a manifold. In NIPS, 2003.\n[7] Christoph Bregler, Michele Covell, and Malcolm Slaney. Video rewrite: driving visual speech with audio.\n\nIn SIGGRAPH, pages 353\u2013360, 1997.\n\n[8] E. Chuang, H. Deshpande, and C. Bregler. Facial expression space learning. In Paci\ufb01c Graphics, 2002.\n[9] P. Doll\u00b4ar, V. Rabaud, and S. Belongie. Learning to traverse image manifolds. Technical Report CS2007-\n\n0876, UCSD CSE, Jan. 2007.\n\n[10] G. J. Edwards, T. F. Cootes, and C. J. Taylor. Face recognition using active appearance models. ECCV,\n\n1998.\n\n[11] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, 2001.\n[12] N. Jojic, B. Frey, and A. Kannan. Epitomic analysis of appearance and shape. In ICCV, 2003.\n[13] D. Keysers, W. Macherey, J. Dahmen, and H. Ney. Learning of variability for invariant statistical pattern\n\nrecognition. ECML, 2001.\n\n[14] T. Moriyama, T. Kanade, J. Xiao, and J. F. Cohn. Meticulously detailed eye region model. PAMI, 2006.\n[15] H. Murase and S.K. Nayar. Visual learning and recognition of 3D objects from appearance. IJCV, 1995.\n[16] R. Rao and D. Ruderman. Learning Lie groups for invariant visual perception. In NIPS, 1999.\n[17] L. K. Saul and S. T. Roweis. Think globally, \ufb01t locally: unsupervised learning of low dimensional\n\nmanifolds. JMLR, 2003.\n\n[18] A. Sch\u00a8odl, R. Szeliski, D.H. Salesin, and I. Essa. Video textures. In SIGGRAPH, 2000.\n[19] B. Sch\u00a8olkopf, A. Smola, and K. M\u00a8uller. Nonlinear component analysis as a kernel eigenvalue problem.\n\nNeur. Comp., 1998.\n\n[20] P. Simard, Y. LeCun, and J. S. Denker. Ef\ufb01cient pattern recognition using a new transformation distance.\n\nIn NIPS, 1993.\n\n[21] P. Simard, Y. LeCun, J. S. Denker, and B. Victorri. Transformation invariance in pattern recognition-\n\ntangent distance and tangent propagation. In Neural Networks: Tricks of the Trade, 1998.\n\n[22] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimen-\n\nsionality reduction. Science, 290, 2000.\n\n[23] Joshua B. Tenenbaum and William T. Freeman. Separating style and content with bilinear models. Neural\n\nComputation, 12(6):1247\u20131283, 2000.\n\n[24] K. Q. Weinberger and L. K. Saul. Unsupervised learning of image manifolds by semide\ufb01nite program-\n\nming. In CVPR04.\n\n[25] Z. Zhang and Zha. Local linear smoothing for nonlinear manifold learning. Technical report, 2003.\n\n\f", "award": [], "sourceid": 3035, "authors": [{"given_name": "Piotr", "family_name": "Doll\u00e1r", "institution": null}, {"given_name": "Vincent", "family_name": "Rabaud", "institution": null}, {"given_name": "Serge", "family_name": "Belongie", "institution": null}]}