{"title": "A Complexity-Distortion Approach to Joint Pattern Alignment", "book": "Advances in Neural Information Processing Systems", "page_first": 1425, "page_last": 1432, "abstract": null, "full_text": "A Complexity-Distortion Approach to\n\nJoint Pattern Alignment\n\nAndrea Vedaldi\n\nStefano Soatto\n\nDepartment of Computer Science\n\nUniversity of California at Los Angeles\n{vedaldi,soatto}@cs.ucla.edu\n\nLos Angeles, CA 90035\n\nAbstract\n\nImage Congealing (IC) is a non-parametric method for the joint alignment of a col-\nlection of images affected by systematic and unwanted deformations. The method\nattempts to undo the deformations by minimizing a measure of complexity of the\nimage ensemble, such as the averaged per-pixel entropy. This enables alignment\nwithout an explicit model of the aligned dataset as required by other methods (e.g.\ntransformed component analysis). While IC is simple and general, it may intro-\nduce degenerate solutions when the transformations allow minimizing the com-\nplexity of the data by collapsing them to a constant. Such solutions need to be\nexplicitly removed by regularization.\nIn this paper we propose an alternative formulation which solves this regulariza-\ntion issue on a more principled ground. We make the simple observation that\nalignment should simplify the data while preserving the useful information car-\nried by them. Therefore we trade off \ufb01delity and complexity of the aligned en-\nsemble rather than minimizing the complexity alone. This eliminates the need\nfor an explicit regularization of the transformations, and has a number of other\nuseful properties such as noise suppression. We show the modeling and computa-\ntional bene\ufb01ts of the approach to the some of the problems on which IC has been\ndemonstrated.\n\n1 Introduction\n\nJoint pattern alignment attempts to remove from an ensemble of patterns the effect of nuisance\ntransformations of a systematic nature. The aligned patterns have then a simpler structure and can\nbe processed more easily. Joint pattern alignment is not the same problem as aligning a pattern to\nanother; instead all the patterns are projected to a common \u201creference\u201d (usually a subspace) which\nis unknown and needs to be discovered in the process.\nJoint pattern alignment is useful in many applications and has been addressed by several authors.\nHere we only review the methods that are most related the present work.\nTransform Component Analysis [7] (TCA) explicitly models the aligned ensemble as a Gaussian\nlinear subspace of patterns. In fact, TCA is a direct extension of Probabilistic Principal Component\nAnalysis (PPCA) [10]: Patterns are generated as in standard PPCA and additional hidden layers\nmodel the nuisance deformations. Expectation-maximization is used to learn the model from data\nwhich result in their alignment. Unfortunately the method requires the space of transformations to\nbe quantized and it is not clear how well the approach could scale to complex scenarios.\nImage Congealing (IC) [9] takes a different perspective. The idea is that, as the nuisance deforma-\ntions should increase the complexity of the data, one should be able to identify and undo them by\n\n\fcontrasting this effect. Thus IC transforms the data to minimize an appropriate measure of the \u201ccom-\nplexity\u201d of the ensemble. With respect to TCA, IC results in a lighter formulation which enables\naddressing more complex transformations and makes fewer assumptions on the aligned ensemble.\nAn issue with the standard formulation of IC is that it does not require the aligned data to be a faithful\nrepresentation of the original data. Thus simplifying the data might not only remove the nuisance\nfactors, but also the useful information carried by the patterns. For example, if entropy is used to\nmeasure complexity, a typical degenerate solution is obtained by mapping all the data to a constant,\nwhich results in minimum (null) entropy. Such solutions are avoided by explicitly regularizing the\ntransformations, in ways that are however rather arbitrary [9].\nOne should instead search for an optimal compromise between complexity of the simpli\ufb01ed data\nand preservation of the useful information (Sect. 2). This approach is not only more direct, but also\nconceptually more straightforward as no ad hoc regularization needs to be introduced. We illus-\ntrate some of its relationship with rate-distortion theory (Sect. 2.1) and information bottleneck [2]\n(Sect. 2.2) and we contrast it to IC (Sect. 2.4).\nIn Sect. 3 we specialize our model to the problem of image alignment as done in [9]. For this case,\nwe show that the new model has the same computational complexity of IC (Sect. 3.1). We also show\nthat a Gauss-Newton based algorithm is possible, which is useful to converge quickly during the \ufb01nal\nstage of the optimization (Sect. 3.2; in a similar context a descent based algorithm was introduced\nin [1]). In Sect. 4 we illustrate the practical behavior of the algorithm, showing how the complexity-\ndistortion compromise affects the \ufb01nal solution. In particular, our results compare favorably with\nthe ones of [9], with the added simplicity and other bene\ufb01ts, such as noise suppression.\n\n2 Problem formulation\n\nWe formulate joint pattern alignment as the problem of \ufb01nding a deformed pattern ensemble which\nis simpler but faithful to the original data. This is similar to a lossy compression problem [5, 4, 3]\nand is in fact equivalent to it in some cases (Sect. 2.1).\nA pattern (or data) ensemble x \u2208 X is a random variable with density p(x). Similarly, an aligned\nensemble or alignment y \u2208 X of the ensemble x is another variable y that has conditional statistic\np(y|x). We seek for an alignment that is \u201csimpler\u201d than x but \u201cfaithful\u201d to x. The complexity R\nof the alignment y is measured by an operator R = H(y) such as, for example, the entropy of the\nrandom variable y (but we will see other options). The cost of representing x by y is expressed by a\ndistortion function d(x, y) \u2208 R+ and the faithfulness of the alignment y is quanti\ufb01ed as the expected\ndistortion D = E[d(x, y)].\nConsider a class W of deformations w : X \u2192 X acting on the patterns X .\nIn order for the\nalignment y to factor out W we consider a distortion function which is invariant to the action of W;\nin particular, given a base distortion d0(x, y), we consider the deformation invariant distortion\n\nd(x, y) = min\n\nw\u2208W d0(x, w(y))\n\nThus an aligned pattern y is faithful to a deformed pattern x if it is possible to map y to x by a\nnuisance deformation w.\nFiguring out the best alignment y boils down in optimizing p(y|x) for complexity and distortion.\nHowever, this require trading off complexity and distortion and there is no unique way of doing\nso. The distortion-complexity function D(R) gives the best distortion D that can be achieved by\nalignments of complexity R. All such distortion-optimal alignments are equally good in principle,\nand it is the application that poses an upper bound on the acceptable distortion.\nD(R) can be computed by optimizing the distortion D w.r.t. p(y|x) while keeping constant the\ncomplexity R. However it is usually easier optimize the Lagrangian\n\nmin\np(y|x)\n\nD + \u03bbR\n\n(1)\n\nwhose optimum is attained where the derivative of D(R) is equal to \u2212\u03bb. Then by varying \u03bb one\nspans the graph of D(R) and \ufb01nds all the optimal alignments for given complexities.\n\n\f2.1 Relation to rate-distortion and entropy constrained vector quantization\n\nIf one chooses the mutual information I(x, y) as complexity measure H(y) in eq. (1), then (1)\nbecomes a rate-distortion problem and the function D(R) a rate-distortion function [5]. The for-\nmulation is valid both for discrete and continuous spaces X , but yields to a mapping p(y|x) that\nis genuinely stochastic. Therefore the alignment y of a pattern x is in general not unique. This\nis because in rate-distortion y is an auxiliary variable used to derive a deterministic code for long\nsequences (x1, . . . , xn) of data, not for data x in isolation.\nIn contrast, entropy constrained vector quantization [4, 3] assumes that y is \ufb01nite (i.e. that it spans\na \ufb01nite subset of X ) and that it is functionally determined by x (i.e. y = y(x)). Then it measures\nthe complexity of y as the (discrete) entropy H(y). This is analogous to a rate-distortion problem,\nexcept that one searches for a \u201csingle letter\u201d optimal coding y of x rather than an optimal coding\nfor long sequences (x1, . . . , xn). Unlike rate-distortion, however, the aligned ensemble y is discrete\neven if the ensemble x is continuous.\n\n2.2 Relation to information bottleneck\n\nInformation Bottleneck (IB) [2] is a special rate-distortion problem in which one compresses a\nvariable x while preserving the information carried by x about another variable z, representing the\ntask of interest. In this sense IB is similar to the idea proposed here. By designing an appropriate\ndistribution p(x, z) it may also be possible to obtain an alignment effect similar to the one we seek\nhere. For example, if W is a group of transformations, one may de\ufb01ne z = z(x) = {w(x) : w \u2208\nW}, for which z is indifferent exactly to the deformations w of x.\n\n2.3 Alternative measures of complexity\n\nInstead of the entropy H(y) or the mutual information I(x, y) we can use alternative measures\nof complexity that yield to more convenient computations. An example is the averaged-per-pixel\nentropy introduced by IC [9] and discussed in Sect. 3. Generalizing this idea, we assume that the\naligned data y depend functionally on the patterns x (i.e. y = y(x)) and we express the complexity\nof y as the total entropy of lower dimensional projections \u03c61(y), . . . , \u03c6M (y), \u03c6i : X \u2192 Rkof the\nensemble.\nDistortion and entropies are estimated empirically and non-parametrically. Concretely, given an\nensemble x1, . . . , xK \u2208 X of patterns, we recover transformations w1, . . . , wK \u2208 W and aligned\npatterns y1, . . . , yK \u2208 X that minimize\n\nKX\n\n1\nK\n\nKX\n\nMX\n\n1\nK\n\nd(xi, wi(yi)) \u2212 \u03bb\n\nlog pj(\u03c6j(yi)),\n\ni=1\n\nj=1\n\ni=1\n\nwhere the densities pj(\u03c6j(y)) are estimated from the samples \u03c6j(y1), . . . , \u03c6j(yK) by histogram-\nming (discrete case) or by a Parzen estimator [6] with Gaussian kernel g\u03c3(y) of variance \u03c3 (contin-\nuous case1), i.e.\n\nNX\n\ni=1\n\nX\n\npj(\u03c6j(y)) =\n\n1\nN\n\ng\u03c3(\u03c6j(y) \u2212 \u03c6j(yi)).\n\n2.4 Comparison to image congealing\nIn IC [9], given data x1, . . . , xK \u2208 X , one looks for transformations v : X \u2192 X , x 7\u2192 y such that\nthe density p(y) estimated from samples y1 = v1(x1), . . . , yK = vK(xK) has minimum entropy. If\nthe transformations enable to do so, one can minimize the entropy by mapping all the patterns to a\nconstant; to avoid this one considers the regularized cost function\n\nH(y) + \u03b1\n\nR(vi)\n\n(2)\n\n1The Parzen estimator implies that the differential entropy of the distributions pj is always lower bounded\nby the entropy of the kernel g\u03c3. This prevents the differential entropy to have arbitrary small negative values.\n\ni\n\n\fwhere R(v) is a term penalizing unacceptable deformations. Compared to IC, in our formulation:\nI The distortion term E[d(x, y)] substitutes the arbitrary regularization R(v).\nI The aligned patterns y are not obtained by deforming the patterns x; instead y is obtained as a\nsimpli\ufb01cation of x within an acceptable level of distortion. This fact induces a noise-cancellation\neffect (Sect. 4).\nI The transformations w can be rather general, even non-invertible. IC can use complex trans-\nformations too, but most likely these would need to be heavily regularized as they would tend to\nannihilate the patterns.\n\n3 Application to joint image alignment\n\nWe apply our model to the problem of removing a family of geometric distortions from images. This\nis the same application for which IC [7] was proposed in the \ufb01rst place.\nWe are given a set I1(x), . . . , IK(x) of digital images (pattern ensemble) de\ufb01ned on a regular lattice\nx \u2208 \u039b \u2282 R2 and with range in [0, 1]. The images may be affected by parametric transformations\nwi(\u00b7) = w(\u00b7; qi) : R2 \u2192 R2, so that\n\nIi(x) = Ti(wx) + ni(x),\n\nx \u2208 \u039b\n\nfor templates (aligned ensemble2) Ti(y), y \u2208 \u039b and residuals ni(x). Here qi is the vector of param-\neters of the transformation wi (for example, wi might be a 2-D af\ufb01ne transformation y = Lx + l\nand qi the vector q = [L11 L21 L12 L22\nThe templates Ti(y), y \u2208 \u039b are digital images themselves. In order to de\ufb01ne Ti(wx) when wx 6\u2208 \u039b,\nbilinear interpolation and zero-padding are used. Therefore the symbol Ti(wix) really denotes the\nquantity\n\nl2]).\n\nl1\n\nT (wix) = A(x; wi)Ti,\n\nx \u2208 \u039b\n\nThe distortion is de\ufb01ned to be the squared l2 norm of the residual d(Ii, w \u25e6 Ti) =P\n\nwhere A(x; wi) is a row vector of mixing coef\ufb01cients determined by wi and and the interpolation\nmethod being used and Ti is the vector obtained by stacking the pixels of the template Ti(y), y \u2208 \u039b.\nWe will also use the notation wi \u25e6 Ti = A(wi)Ti where the left hand side is the stacking of the\nwarped template T (wix), x \u2208 \u039b and A(wi) is the matrix whose rows are the vectors A(x; wi) for\nx \u2208 \u039b.\nx\u2208\u039b(Ii(x) \u2212\nTi(wix))2. The complexity of the aligned ensemble T (y), y \u2208 \u039b is computed as in Sect. 2.3 by\nprojecting on the image pixels and averaging their entropies (this is equivalent to assuming that the\npixels are statistically independent). For each pixel y \u2208 \u039b a density p(T (y) = t), t \u2208 [0, 1] is esti-\nmated non parametrically from the data {T1(y), . . . , TK(y)} (we use Parzen window as explained\nin Sect. 2.3). The complexity of a pixel is thus\n\nH(T (y)) = \u2212 1\nK\n\nlog p(Ti(y)).\n\nKX\n\ni=1\n\nFinally the overall cost function is obtained by summing over all pixels and averaging over all\nimages:\n\nL(w1, . . . , wK, T1, . . . , TK) =\n\n1\nK\n\n3.1 Basic search\n\nKX\n\nX\n\ni=1\n\nx\u2208\u039b\n\n(Ii(x) \u2212 Ti(wix))2 \u2212 \u03bb\n\nlog p(Ti(y)).\n\n(3)\n\nKX\n\nX\n\ni=1\n\ny\u2208\u039b\n\n1\nK\n\nIn this section we show how the optimization algorithm from [7] can be adapted to work with the\nnew formulation. This algorithm is a simple coordinate maximization in the dimensions of the\nsearch space:\n\n2With respect to Sect. 2 the patterns xi are now the images Ii and the alignment y are the templates Ti.\n\n\f1, . . . , K}\n\n1: Estimate the probabilities p(T (y)), y \u2208 \u039b from the templates {Ti(y) : i =\n2: For each pattern i = 1, . . . , K and for each component qji of the parameter\nvector qi, try a few values of qji. For each value re-compute the cost func-\ntion (3) and keep the best.\n\n3: Repeat, re\ufb01ning the sampling step of the parameters.\n\nexpected distortion\n\nKX\n\nX\n\ni=1\n\nx\u2208\u039b\n\n1\nK\n\nKX\n\nX\n\ni=1\n\nx\u2208\u039b\n\nand (2.)\n\nThis algorithm is appropriate if the dimensionality of the parameter vector q is reasonably small.\nHere we consider af\ufb01ne transformations for the sake of the illustration, so that q is six-dimensional.\nIn (1.)\nfunction\nL(w1, . . . , wK, T1, . . . , TK) requires to know Ti(y). As a \ufb01rst order approximation (as the\n\ufb01nal result will be re\ufb01ned by Gauss-Newton as explained in the next Section), we bypass this\n\u25e6 Ii, exploiting the fact that the af\ufb01ne transformations wi are\nproblem and we simply set Ti = w\u22121\ni R(vi) of [9] with the\n\ninvertible3. Eventually all we do is substituting the regularization termP\n\nestimating the probabilities p(Ti(y))\n\nand the\n\ncost\n\ni\n\n(Ii(x) \u2212 wi \u25e6 (w\u22121\n\ni\n\n\u25e6 Ii(x)))2 =\n\n1\nK\n\n(Ii(x) \u2212 A(x; wi)A(w\u22121\n\n)Ii)2\n\ni\n\nNote that warping and un-warping the image Ii is a lossy operation even if wi is bijective because\nthe transformation, applied to digital images, introduces aliasing. Thus the new algorithm is simply\navoiding those transformations wi that would introduce excessive loss of \ufb01delity.\n\n3.2 Gauss-Newton search\n\nWith respect to IC, where only the transformations w1, . . . , wK are estimated, here we compute the\ntemplates T1, . . . , Tk as well. While this might be not so important when a coarse approximation to\nthe solution has to be found (for which the algorithm of Sect. 3.1 can be used), it must be taken into\naccount to get re\ufb01ned results. This can be done (with a bit of numeric care) by Gauss-Newton (GN).\nApplying Gauss-Newton requires to take derivatives with respect to the pixel values Ti(y). We\nexploit the fact that the variables T (y) are continuous, as opposed to [9].\nWe still process a single image per time, reiterating several times across the whole ensemble\n{I1(x), . . . , IK(x)}. For a given image Ii we update the warp parameters qi and the template\nTi simultaneously. We exploit the fact that, as the number K of images is usually big, the density\np(T (y)) does not change signi\ufb01cantly when only one of the templates Ti is being changed. There-\nfore p(T (y)) can be assumed constant in the computation of the gradient and the Hessian of the cost\nfunction (3). The gradient is given by\n2\u2206i(x)\u2207Ti(wix) \u2202wi\n\u2202q>\n\n\u02d9p(Ti(y))\n\u2202L\n\u2202q>\np(Ti(y))\nwhere \u2206i(x) = Ti(wix)\u2212Ii(x) is the reconstruction residual, A(x; wi) is the linear map introduced\nin Sect. 3 and \u03b4y = \u03b4(z \u2212 y) is the 2-D discrete delta function centered on y, encoded as a vector.\nThe approximated Hessian of the cost function (3) can be obtained as follows. First, we use the\nGauss-Newton approximation for the derivative w.r.t. the transformation parameters qi\n\n2\u2206i(x)(A(x; wi)\u03b4y) \u2212X\n\n=X\n\n=X\n\n\u2202Ti(y)\n\n(x),\n\nx\u2208\u039b\n\nx\u2208\u039b\n\ny\u2208\u039b\n\n\u2202L\n\ni\n\ni\n\n\u2248X\n2 \u2202w>\n2(A(x; wi)\u03b4y)2 \u2212X\n\ni\n\u2202qi\n\nx\u2208\u039b\n\ni\n\ny\u2208\u039b\n\n(x)\u2207>Ti(wix)\u2207Ti(wix) \u2202wi\n\u2202q>\n\n(x)\n\ni\n\n\u00a8p(Ti(y))p(Ti(y)) \u2212 \u02d9p(Ti(y))2\n\np(Ti(y))2\n\nWe then have\n\n\u22022L\n\u2202Ti(y)2 =\n\u22022L\n\n\u2202Ti(y)\u2202Ti(z)\n\n\u22022L\n\n\u2202Ti(y)\u2202q> =\n\n\u22022L\n\u2202qi\u2202q>\nX\nX\nX\n\nx\u2208\u039b\n\nx\u2208\u039b\n\nx\u2208\u039b\n\n=\n\n2(A(x; wi)\u03b4y)(A(x; wi)\u03b4z)\n\n2(A(x; wi)\u03b4y)\u2207Ti(wix)\n\n\u2202wi\n\u2202q>\n\ni\n\n+\n\n2\u2206i(x)A(x; wi)\u02c6D1\u03b4y D2\u03b4y\n\nX\n\nx\u2208\u039b\n\n\u02dc \u2202wi\n\n\u2202q>\n\ni\n\n3Our criterion avoids implicitly non-invertible af\ufb01ne transformations as they yield highly distorted codes.\n\n\fFigure 1: Toy example. Top left. We distort the patterns by applying translations drawn uniformly\nfrom the 8-shaped region (the center corresponds to the null translation). Top. We show the gradient\nbased algorithm while it gradually aligns the patterns by reducing the complexity of the alignment\ny. Dark areas correspond to high values of the density of the alignment; we also superimpose the\ntrajectory of one of the patterns. Unfortunately the gradient based algorithm, being a local tech-\nnique, gets trapped in two local modes (the modes can however be fused in a post-processing stage).\nBottom. The basic algorithm completely eliminates the effect of the nuisance transformations doing\na better job of avoiding local minima. Although for this simple problem the basic search is more\neffective, on more dif\ufb01cult scenarios the extra complexity of the Gauss-Newton search pays off (see\nSect. 4).\n\nwhere D1 is the discrete linear operator used to compute the derivative of Ti(y) along its \ufb01rst dimen-\nsion and D2 the analogous operator for the second dimension. The second term of the last equation\ngives a very small contribution and can be dropped.\nThe equations are all straightforward and result in la linear system\n\n\u03b4\u03b8>(cid:18) \u22022L\n\n\u2202\u03b8\u2202\u03b8>\n\n(cid:19)\n\n= \u2212 \u2202L\n\u2202\u03b8>\n\nwhere the vector \u03b8> = (cid:2)q>T (y1) . . . T (yn)(cid:3) has size in the order of the number of pixels of the\n\ntemplate T (y), y \u2208 \u039b. While this system is large, it is also extremely sparse an can be solved rather\nef\ufb01ciently by standard methods [8].\n\n4 Experiments\n\nsum of the Euclidean distancesPm\nas p(yi) =Q\n\nThe \ufb01rst experiment (Fig.1) is a toy problem illustrating our method. We collect K patterns xi,\ni = 1, . . . , K which are arrays of M 2D points xi = (x1i, . . . , xM i). Such points are generated\nby drawing M i.i.d. samples from a 2-D Gaussian distribution and adding a random translation\nwi \u2208 R2 to them. The distribution of the translations wi is generic (in the example wi is drawn\nuniformly from an 8-shaped region of the plane): This is not a problem as we do not need to make\nany particular assumptions on w besides that it is a translation. The distortion d(xi, yi) is simply the\nj=1 kyji + wi\u2212 xjik2 between the patterns xi and the transformed\ncodes wi(yi) = (y1i + wi, . . . , ymi +wi). The distribution p(yi) of the codes is assumed to factorize\nj=1 p(yji) where the p(yji) are identical densities estimated by Parzen window from\nall the available samples {yji, j = 1, . . . , M, i = 1, . . . , K}.\nIn the second experiment (Fig. 2) we align hand-written digits extracted from the NIST Special\nDatabase 19. The results (Fig. 3) should be compared to the ones from [9]: They are of analogous\nquality, but they were achieved without regularizing the class of admissible transformations. Despite\nthis, we did not observe any of the aligned patterns to collapse. In Fig. 4 we show the effect of\nchoosing different values of the parameter \u03bb in the cost function (3). As \u03bb is increased, the alignment\ncomplexity is reduced and the \ufb01delity of the alignment is degraded. By an appropriate choice of \u03bb,\nthe alignment can be regarded as a \u201crestoration\u201d or \u201ccanonization\u201d of the pattern which abstracts\nfrom details of the speci\ufb01c instance.\n\n\fFigure 2: Basic vs GN image alignment algorithms. Left. We show the results of applying the basic\nimage alignment algorithm of Sect. 3.1. The patterns are zeroes from the NIST Special Database\n19. We show in writing order: The expected value E[T (y)]; the per-pixel entropy H(T (y)) (it can\nbe negative as it is differential); a 3-D plot of the same function H(T (y)); the distortion-complexity\ndiagram as the algorithm minimizes the function D + \u03bbR (in green we show some lines of constant\ncost); the probability p(T (y) = l) as l \u2208 [0, 1] and y varies along the middle scan-line; and the per-\npixel distortion D(x) = E[(I(x)\u2212T (wx))2]. Right. We demonstrate the GN algorithm of Sect. 3.2.\nThe algorithm achieves a signi\ufb01cantly better solution in term of the cost function (3). Moreover GN\nconverges in only two sweeps of the dataset, while the basic algorithm after 10 sweeps is still slowly\nmoving. This is due to the fact that GN selects both the best search direction and step size, resulting\nin a more ef\ufb01cient search strategy.\n\nFigure 3: Aligned patterns. Left. A few patterns from NIST Special Database 19. Middle. Basic\nalgorithm: Results are very similar to [9], except that no regularization on the transformations is\nused. Right. GN algorithm: Patterns achieve a better alignment due to the more ef\ufb01cient search\nstrategy; they also appear to be much more \u201cregular\u201d due to the noise cancellation effect discussed\nin Fig. 4. Bottom. More examples of patterns before and after GN alignment.\n5 Conclusions\n\nIC is a useful algorithm for joint pattern alignment, both robust and \ufb02exible. In this paper we showed\nthat the original formulation can be improved by realizing that alignment should result in a simpli\ufb01ed\nrepresentation of the useful information carried by the patterns rather than a simpli\ufb01cation of the\npatterns. This results in a formulation that does not require inventing regularization terms in order\nto prevent degenerate solutions. We also showed that Gauss-Newton can be successfully applied to\nthis problem for the case of image alignment and that this is in some regards more effective than the\noriginal IC algorithm.\n\nExpected value per pixel510152025051015202530Entropy per pixel5101520255101520250204002040\u22122.5\u22122\u22121.5\u22121Entropy per pixel\u22122.2\u22122\u22121.800.511.5x 10\u22124Distortion\u2212rate diagramRateDistortionp(T(y)) along middle scanline51015202500.20.40.60.8Distortion per pixel510152025510152025Expected value per pixel510152025051015202530Entropy per pixel5101520255101520250204002040\u22122.5\u22122\u22121.5\u22121Entropy per pixel\u22122.2\u22122\u22121.801234x 10\u22124Distortion\u2212rate diagramRateDistortionp(T(y)) along middle scanline51015202500.20.40.60.8Distortion per pixel510152025510152025\f(a) Distortion-Complexity\n\n(b) Not aligned\n\n(c) Aligned\n\nFigure 4: Distortion-complexity balance. We illustrate the effect of varying the parameter \u03bb in (3).\n(a) Estimated distortion-complexity function D(R). The green (dashed) lines have slope equal to\n\u03bb and should be tangent to D(R) (Sect. 2). (b) We show the alignment T (wix) of eight patterns\n(rows) as \u03bb is increased (columns). In order to reduce the entropy of the alignment, the algorithm\n\u201cforgets\u201d about speci\ufb01c details of each glyph. (c) The same as (b), but aligned.\n\nAcknowledgments\n\nWe would like to acknowledge the support of AFOSR FA9550-06-1-0138 and ONR N00014-03-1-\n0850.\n\nReferences\n[1] P. Ahammad, C. L. Harmon, A. Hammonds, S. S. Sastry, and G. M. Rubin. Joint nonparametric\nalignment for analizing spatial gene expression patterns in drosophila imaginal discs. In Proc.\nCVPR, 2005.\n\n[2] K. Branson. The information bottleneck method. Lecture Slides, 2003.\n[3] J. Buhmann and H. K\u00a8uhnel. Vector quantization with complexity costs. IEEE Trans. on Infor-\n\nmation Theory, 39, 1993.\n\n[4] P. A. Chou, T. Lookabaugh, and R. M. Gray. Entropy-constrained vector quantization. In 37,\n\neditor, IEEE Trans. on Acoustics, Speech, and Signal Processing, volume 1, 1989.\n\n[5] T. M. Cover and J. A. Thomson. Elements of Information Theory. Wiley, 2006.\n[6] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classi\ufb01cation. Wiley Inerscience, 2001.\n[7] B. J. Frey and N. Jojic. Transformation-invariant clustering and dimensionality reduction using\n\nEM. PAMI, 2000.\n\n[8] G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press,\n\n1996.\n\n[9] E. G. Learned-Miller. Data driven image models through continuous joint alignment. PAMI,\n\n28(2), 2006.\n\n[10] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of The\n\nRoyal Statistical Society, Series B, 61(3), 1999.\n\n\u22122.3\u22122.2\u22122.1\u22122\u22121.9\u22121.8012345x 10\u22124RateDistortion\f", "award": [], "sourceid": 3076, "authors": [{"given_name": "Andrea", "family_name": "Vedaldi", "institution": null}, {"given_name": "Stefano", "family_name": "Soatto", "institution": null}]}