{"title": "A Model for Learning Variance Components of Natural Images", "book": "Advances in Neural Information Processing Systems", "page_first": 1391, "page_last": 1398, "abstract": "", "full_text": "A Model for Learning Variance Components of\n\nNatural Images\n\nYan Karklin\n\nyan+@cs.cmu.edu\n\nMichael S. Lewicki(cid:3)\n\nlewicki@cnbc.cmu.edu\n\nComputer Science Department &\n\nCenter for the Neural Basis of Cognition\n\nCarnegie Mellon University\n\nAbstract\n\nWe present a hierarchical Bayesian model for learning ef\ufb01cient codes of\nhigher-order structure in natural images. The model, a non-linear gen-\neralization of independent component analysis, replaces the standard as-\nsumption of independence for the joint distribution of coef\ufb01cients with\na distribution that is adapted to the variance structure of the coef\ufb01cients\nof an ef\ufb01cient image basis. This offers a novel description of higher-\norder image structure and provides a way to learn coarse-coded, sparse-\ndistributed representations of abstract image properties such as object\nlocation, scale, and texture.\n\n1 Introduction\n\nOne of the major challenges in vision is how to derive from the retinal representation\nhigher-order representations that describe properties of surfaces, objects, and scenes. Phys-\niological studies of the visual system have characterized a wide range of response proper-\nties, beginning with, for example, simple cells and complex cells. These, however, offer\nonly limited insight into how higher-order properties of images might be represented or\neven what the higher-order properties might be. Computational approaches to vision of-\nten derive algorithms by inverse graphics, i.e. by inverting models of the physics of light\npropagation and surface re\ufb02ectance properties to recover object and scene properties. A\ndrawback of this approach is that, because of the complexity of modeling, only the sim-\nplest and most approximate models are computationally feasible to invert and these often\nbreak down for realistic images. A more fundamental limitation, however, is that this for-\nmulation of the problem does not explain the adaptive nature of the visual system or how it\ncan learn highly abstract and general representations of objects and surfaces.\n\nAn alternative approach is to derive representations from the statistics of the images them-\nselves. This information theoretic view, called ef\ufb01cient coding, starts with the observation\nthat there is an equivalence between the degree of structure represented and the ef\ufb01ciency\nof the code [1]. The hypothesis is that the primary goal of early sensory coding is to en-\ncode information ef\ufb01ciently. This theory has been applied to derive ef\ufb01cient codes for\n\n(cid:3)To whom correspondence should be addressed\n\n\fnatural images and to explain a wide range of response properties of neurons in the visual\ncortex [2\u20137].\n\nMost algorithms for learning ef\ufb01cient representations assume either simply that the data\nare generated by a linear superposition of basis functions, as in independent component\nanalysis (ICA), or, as in sparse coding, that the basis function coef\ufb01cients are \u2019sparsi\ufb01ed\u2019\nby lateral inhibition. Clearly, these simple models are insuf\ufb01cient to capture the rich struc-\nture of natural images, and although they capture higher-order statistics of natural images\n(correlations beyond second order), it remains unclear how to go beyond this to discover\nhigher-order image structure.\n\nOne approach is to learn image classes by embedding the statistical density assumed by\nICA in a mixture model [8]. This provides a method for modeling classes of images and\nfor performing automatic scene segmentation, but it assumes a fundamentally local repre-\nsentation and therefore is not suitable for compactly describing the large degree of structure\nvariation across images. Another approach is to construct a speci\ufb01c model of non-linear\nfeatures, e.g. the responses of complex cells, and learn an ef\ufb01cient code of their outputs [9].\nWith this, one is limited by the choice of the non-linearity and the range of image regulari-\nties that can be modeled.\n\nIn this paper, we take as a starting point the observation by Schwartz and Simoncelli [10]\nthat, for natural images, there are signi\ufb01cant statistical dependencies among the variances\nof \ufb01lter outputs. By factoring out these dependencies with divisive normalization, Schwartz\nand Simoncelli showed that the model could account for a wide range of non-linearities\nobserved in neurons in the auditory nerve and primary visual cortex.\n\nHere, we propose a statistical model for higher-order structure that learns a basis on the\nvariance regularities in natural images. This higher-order, non-orthogonal basis describes\nhow, for a particular visual image patch, image basis function coef\ufb01cient variances deviate\nfrom the default assumption of independence. This view offers a novel description of\nhigher-order image structure and provides a way to learn sparse distributed representations\nof abstract image properties such as object location, scale, and surface texture.\n\nEf\ufb01cient coding of natural images\n\nThe computational goal of ef\ufb01cient coding is to derive from the statistics of the pattern\nensemble a compact code that maximally reduces the redundancy in the patterns with min-\nimal loss of information. The standard model assumes that the data is generated using a set\nof basis functions A and coef\ufb01cients u:\n\nx = Au ;\n\n(1)\n\nBecause coding ef\ufb01ciency is being optimized, it is necessary, either implicitly or explicitly,\nfor the model to capture the probability distribution of the pattern ensemble. For the linear\nmodel, the data likelihood is [11, 12]\n\np(xjA) = p(u)=jdetAj :\n\nThe coef\ufb01cients ui, are assumed to be statistically independent\n\np(u) = (cid:213)\n\np(ui) :\n\ni\n\n(2)\n\n(3)\n\nICA learns ef\ufb01cient codes of natural scenes by adapting the basis vectors to maximize\nthe likelihood of the ensemble of image patterns, p(x1; : : : ;xN) = (cid:213)\nn p(xnjA), which maxi-\nmizes the independence of the coef\ufb01cients and optimizes coding ef\ufb01ciency within the limits\nof the linear model.\n\n\fc\n\nb\n\na\nFigure 1: Statistical dependencies among natural image independent component basis co-\nef\ufb01cients. The scatter plots show for the two basis functions in the same row and column\nthe joint distributions of basis function coef\ufb01cients. Each point represents the encoding of\na 20 (cid:2) 20 image patch centered at random locations in the image. (a) For complex natural\nscenes, the joint distributions appear to be independent, because the joint distribution can\nbe approximated by the product of the marginals. (b) Closer inspection of particular image\nregions (the image in (b) is contained in the lower middle part of the image in (a)) reveals\ncomplex statistical dependencies for the same set of basis functions. (c) Images such as\ntexture can also show complex statistical dependencies.\n\nStatistical dependencies among \u2018independent\u2019 components\n\nA linear model can only achieve limited statistical independence among the basis function\ncoef\ufb01cients and thus can only capture a limited degree of visual structure. Deviations from\nindependence among the coef\ufb01cients re\ufb02ect particular kinds of visual structure (\ufb01g. 1). If\nthe coef\ufb01cients were independent it would be possible to describe the joint distribution as\nthe product of two marginal densities, p(ui;u j) = p(ui)p(u j). This is approximately true\nfor natural scenes (\ufb01g.1a), but for particular images, the joint distribution of coef\ufb01cients\nshow complex statistical dependencies that re\ufb02ect the higher-order structure (\ufb01gs.1b and\n1c). The challenge for developing more general models of ef\ufb01cient coding is formulating\na description of these higher-order correlations in a way that captures meaningful higher-\norder visual structure.\n\n2 Modeling higher-order statistical structure\n\nThe basic model of standard ef\ufb01cient coding methods has two major limitations. First,\nthe transformation from the pattern to the coef\ufb01cients is linear, so only a limited class\nof computations can be achieved. Second, the model can capture statistical relationships\namong the pixels, but does not provide any means to capture higher order relationships\nthat cannot be simply described at the pixel level. As a \ufb01rst step toward overcoming these\nlimitations, we extend the basic model by introducing a non-independent prior to model\nhigher-order statistical relationships among the basis function coef\ufb01cients.\n\nGiven a representation of natural images in terms of a Gabor-wavelet-like representation\nlearned by ICA, one salient statistical regularity is the covariation of basis function coef\ufb01-\ncients in different visual contexts. Any speci\ufb01c type of image region, e.g. a particular kind\nof texture, will tend to yield in large values for some coef\ufb01cients and not others. Different\ntypes of image regions will exhibit different statistical regularities among the variances of\nthe coef\ufb01cients. For a large ensemble of images, the goal is to \ufb01nd a code that describes\nthese higher-order correlations ef\ufb01ciently.\n\n\fIn the standard ef\ufb01cient coding model, the coef\ufb01cients are often assumed to follow a gen-\neralized Gaussian distribution\n\n(4)\nwhere z = q=(2l\niG[1=q]). The exponent q determines the distribution\u2019s shape and weight\nof the tails, and can be \ufb01xed or estimated from the data for each basis function coef\ufb01cient.\nThe parameter l\ni determines the scale of variation (usually \ufb01xed in linear models, since\nbasis vectors in A can absorb the scaling). l\ni is a generalized notion of variance; for clarity,\nwe refer to it simply as variance below.\n\n;\n\np(ui) = ze(cid:0)jui=l\n\nijq\n\nBecause we want to capture regularities among the variance patterns of the coef\ufb01cients,\nwe do not want to model the values of u themselves. Instead, we assume that the relative\nvariances in different visual contexts can be modeled with a linear basis as follows\n\n) logl\n\ni = exp([Bv]i)\nl = Bv :\n\n(5)\n(6)\nwhere [Bv]i refers to the ith element of the product vector Bv. This formulation is useful\nbecause it uses a basis to represent the deviation from the variance assumed by the stan-\ndard model. If we assume that vi also follows a zero-centered, sparse distribution (e.g. a\ngeneralized Gaussian), then Bv is peaked around zero which yields a variance of one, as\nin standard ICA. Because the distribution is sparse, only a few of the basis vectors in B\nare needed to describe how any particular image deviates from the default assumption of\nindependence. The joint distribution for the prior (eqn.3) becomes\n\n(cid:0)log p(ujB;v) (cid:181)\n\nL(cid:229)\n\ni\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n\nui\n\ne[Bv]i(cid:12)(cid:12)(cid:12)(cid:12)\n\nq\n\n;\n\n(7)\n\nHaving formulated the problem as a statistical model, the choice of the value of v for a\ngiven u is determined by maximizing the posterior distribution\n\n\u02c6v = argmax\n\nv\n\np(vju; B) = argmax\n\nv\n\np(ujB;v)p(v)\n\n(8)\n\nUnfortunately, computing the most probable v is not straightforward. Because v speci\ufb01es\nthe variance of u, there is a range of values that could account for a given pattern \u2013 all that\nchanges is the probability of the \ufb01rst order representation, p(ujB;v). For the simulations\nbelow, \u02c6v was estimated by gradient ascent.\nBy maximizing the posterior p(vju; B), the algorithm is computing the best way to describe\nhow the distribution of vi\u2019s for the current image patch deviates from the default assumption\nof independence, i.e. v = 0. This aspect of the algorithm makes the transformation from\nthe data to the internal representation fundamentally non-linear. The basis functions in\nB represent an ef\ufb01cient, sparse, distributed code for commonly observed deviations. In\ncontrast to the \ufb01rst layer, where basis functions in A correspond to speci\ufb01c visual features,\nhigher-order basis functions in B describe the shapes of image distributions.\n\nThe parameters are adapted by performing gradient ascent on the data likelihood. Using the\ngeneralized prior, the data likelihood is computed by marginalizing over the coef\ufb01cients.\nAssuming independence between B and v, the marginal likelihood is\n\np(xjA; B) = Z p(ujB;v)p(v)=j detAjdv :\n\n(9)\n\nThis, however, is intractable to compute, so we approximate it by the maximum a posteriori\nvalue \u02c6v\n\np(xjA; B) (cid:25) p(ujB; \u02c6v)p(\u02c6v)=j detAj :\ni p(vi) and that p(vi) (cid:181)\n\n(10)\nexp((cid:0)jvij). We adapt B by maximizing\n\nWe assume that p(v) = (cid:213)\nthe likelihood over the data ensemble\nB = argmax\n\nFor reasons of space, we omit the (straightforward) derivations of the gradients.\n\nlog p(unjB; \u02c6vn) + log p(B)\n\nn\n\n(11)\n\nB\n\nl\nl\n(cid:229)\n\fFigure 2: A subset of the 400 image basis functions. Each basis function is 20x20 pixels.\n\n3 Results\n\nThe algorithm described above was applied to a standard set of ten 512 (cid:2)512 natural images\nused in [2]. For computational simplicity, prior to the adaptation of the higher-order basis\nB, a 20 (cid:2) 20 ICA image basis was derived using standard methods (e.g. [3]). A subset of\nthese basis functions is shown in \ufb01g. 2.\n\nBecause of the computational complexity of the learning procedure, the number of basis\nfunctions in B was limited to 30, although in principle a complete basis of 400 could be\nlearned. The basis B was initialized to small random values and gradient ascent was per-\nformed for 4000 iterations, with a \ufb01xed step size of 0.05. For each batch of 5000 randomly\nsampled image patches, \u02c6v was derived using 50 steps of gradient ascent at a \ufb01xed step size\nof 0.01.\nFig. 3 shows three different representations of the basis functions in the matrix B adapted to\nnatural images. The \ufb01rst 10 (cid:2) 3 block (\ufb01g.3a) shows the values of the 30 basis functions in\nB in their original learned order. Each square represents 400 weights Bi; j from a particular\nv j to all the image basis functions ui\u2019s. Black dots represent negative weights; white,\npositive weights. In this representation, the weights appear sparse, but otherwise show no\napparent structure, simply because basis functions in A are unordered.\n\nFigs. 3b and 3c show the weights rearranged in two different ways. In \ufb01g. 3b, the dots rep-\nresenting the same weights are arranged according to the spatial location within an image\npatch (as determined by \ufb01tting a 2D Gabor function) of the basis function which the weight\naffects. Each weight is shown as a dot; white dots represent positive weights, black dots\nnegative weights. In \ufb01g. 3c, the same weights are arranged according to the orientation and\nspatial scale of the Gaussian envelope of the \ufb01tted Gabor. Orientation ranges from 0 to p\ncounter-clockwise from the horizontal axis, and spatial scale ranges radially from DC at the\nbottom center to Nyquist. (Note that the learned basis functions can only be approximately\n\ufb01t by Gabor functions, which limits the precision of the visualizations.)\n\nIn these arrangements, several types of higher-order regularities emerge. The predominant\none is that coef\ufb01cient variances are spatially correlated, which re\ufb02ects the fact that a com-\nmon occurrence is an image patch with a small localized object against a relatively uniform\nbackground. For example, the pattern in row 5, column 3 of \ufb01g. 3b shows that often the\ncoef\ufb01cient variances in the top and bottom halves of the image patch are anti-correlated,\ni.e. either the object or scene is primarily across the top or across the bottom. Because\nvi can be positive or negative, the higher-order basis functions in B represent contrast in\nthe variance patterns. Other common regularities are variance-contrasts between two ori-\nentations for all spatial positions (e.g. row 7, column 1) and between low and high spatial\nscales for all positions and orientations (e.g. row 9, column 3). Most higher-order basis\nfunctions have simple structure in either position, orientation, or scale, but there are some\nwhose organization is less obvious.\n\n\fb\n\na\nFigure 3: The learned higher-order basis functions. The same weights shown in the original\norder (a); rearranged according to the spatial location of the corresponding image basis\nfunctions (b); rearranged according to frequency and orientation of image basis functions\n(c). See text for details.\n\nc\n\n\fFigure 4: Image patches that yielded the largest coef\ufb01cients for two basis functions in B.\nThe central block contains nine image patches corresponding to higher-order basis function\ncoef\ufb01cients with values near zero, i.e. small deviations from independent variance patterns.\nPositions of other nine-patch blocks correspond to the associated values of higher-order\ncoef\ufb01cients, here v15 and v27 (whose weights to ui\u2019s are shown at the axes extrema). For\nexample, the upper-left block contains image patches for which v15 was highly negative\n(contrast localized to bottom half of patch) and v27 was highly positive (power predomi-\nnantly at low spatial scales). This illustrates how different combinations of basis functions\nin B de\ufb01ne distributions of images (in this case, spatial frequency and location).\n\nAnother way to get insight into the code learned by the model is to display, for a large\nensemble of image patches, the patches that yield the largest values of particular vi\u2019s (and\ntheir corresponding basis functions in B). This is shown in \ufb01g. 4.\n\nAs a check to see if any of the higher-order structure learned by the algorithm was simply\ndue to random variations in the dataset, we generated a dataset by drawing independent\nsamples un from a generalized Gaussian to produce the pattern xn = Aun. The resulting\nbasis B was composed only of small random values, indicating essentially no deviation\nfrom the standard assumption of independence and unit variance. In addition, adapting\nthe model on a synthetic dataset generated from a hand-speci\ufb01ed B recovers the original\nhigher-order basis functions.\nIt is also possible to adapt A and B simultaneously (although with considerably greater\ncomputational expense). To check the validity of \ufb01rst deriving B for a \ufb01xed A, both ma-\ntrices were adapted simultaneously for small 8 (cid:2) 8 patches on the same natural image data\nset. The results for both the image basis matrix A and the higher-order basis B were quali-\ntatively similar to those reported above.\n\n\f4 Discussion\n\nWe have presented a model for learning higher-order statistical regularities in natural im-\nages by learning an ef\ufb01cient, sparse-distributed code for the basis function coef\ufb01cient vari-\nances. The recognition algorithm is non-linear, but we have not tested yet whether it can\naccount for non-linearities similar to the types reported in [10].\n\nA (cautious) neurobiological interpretation of the higher-order units is that they are anal-\nogous to complex cells which pool output over speci\ufb01c \ufb01rst-order feature dimensions.\nRather than achieving a simplistic invariance, however, the model presented here has the\nspeci\ufb01c goal of ef\ufb01ciently representing the higher-order structure by adapting to the statis-\ntics of natural images, and thus may predict a broader range of response properties than are\ncommonly tested physiologically.\nOne salient type of higher-order structure learned by the model is the position of image\nstructure within the patch. It is interesting that, rather than encoding speci\ufb01c locations, the\nmodel learned a coarse code of position using broadly tuned spatial patterns. This could\noffer novel insights into the function of the broad tuning of higher level visual neurons.\nBy learning higher-order basis functions for different classes of visual images, the model\ncould not only provide insights into other types of visual response properties, but could\nprovide a way to simplify some of the computations in perceptual organization and other\ncomputations in mid-level vision.\n\nReferences\n\n[1] H. B. Barlow. Possible principles underlying the transformation of sensory messages. In W. A.\n\nRosenbluth, editor, Sensory Communication, pages 217\u2013234. MIT Press, Cambridge, 1961.\n\n[2] B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive-\ufb01eld properties by learning\n\na sparse code for natural images. Nature, 381:607\u2013609, 1996.\n\n[3] A. J. Bell and T. J. Sejnowski. The \u2019independent components\u2019 of natural scenes are edge \ufb01lters.\n\nVision Res., 37(23):3327\u20133338, 1997.\n\n[4] J. H. van Hateren and A. van der Schaaf.\n\nIndependent component \ufb01lters of natural images\ncompared with simple cells in primary visual cortex. Proc. Royal Soc. Lond. B, 265:359\u2013366,\n1998.\n\n[5] J. H. van Hateren and D. L. Ruderman.\n\nIndependent component analysis of natural image\nsequences yield spatiotemporal \ufb01lters similar to simple cells in primary visual cortex. Proc.\nRoyal Soc. Lond. B, 265:2315\u20132320, 1998.\n\n[6] P. O. Hoyer and A. Hyvarinen. Independent component analysis applied to feature extraction\n\nfrom colour and stereo images. Network, 11(3):191\u2013210, 2000.\n\n[7] E. Simoncelli and B. Olshausen. Natural image statistics and neural representation. Ann. Rev.\n\nNeurosci., 24:1193\u20131216, 2001.\n\n[8] T-W. Lee and M. S. Lewicki. Unsupervised classi\ufb01cation, segmentation and de-noising of\n\nimages using ICA mixture models. IEEE Trans. Image Proc., 11(3):270\u2013279, 2002.\n\n[9] P. O. Hoyer and A. Hyvarinen. A multi-layer sparse coding network learns contour coding from\n\nnatural images. Vision Research, 42(12):1593\u20131605, 2002.\n\n[10] O. Schwartz and E. P. Simoncelli. Natural signal statistics and sensory gain control. Nat.\n\nNeurosci., 4:819\u2013825, 2001.\n\n[11] B. A. Pearlmutter and L. C. Parra. A context-sensitive generalization of ICA. In International\n\nConference on Neural Information Processing, pages 151\u2013157, 1996.\n\n[12] J-F. Cardoso.\n\nProcessing Letters, 4:109\u2013111, 1997.\n\nInfomax and maximum likelihood for blind source separation.\n\nIEEE Signal\n\n\f", "award": [], "sourceid": 2167, "authors": [{"given_name": "Yan", "family_name": "Karklin", "institution": null}, {"given_name": "Michael", "family_name": "Lewicki", "institution": null}]}